Efficient Process Arrival Pattern Aware Collective Communication for HPC and Deep Learning
Loading...
Authors
Mohammadalizadehbakhtevari, Pedram
Date
Type
thesis
Language
eng
Keyword
MPI , Collective Communication , Deep Learning Frameworks , MPI_Allreduce , Process Arrival Pattern
Alternative Title
Abstract
High-Performance Computing (HPC) is the key to tackle computationally intensive problems such as Deep Learning (DL) and scientific applications.
Message Passing Interface (MPI) is the de facto parallel programming standard that fulfills communication in HPC systems.
MPI collective communication operations involve all processes within a program-defined group of processes and have been used extensively in parallel applications.
Unfortunately, most of the studies attempting to improve the performance of collective operations are based on the premise that all the processes commence the communication simultaneously. However, researchers have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments. Therefore, it is important to propose new algorithms that improve the performance of collectives by PAP-awareness.
This thesis presents a complete communication characterization of Horovod as one of the most famous distributed DL frameworks. We provide a thorough study on PAP of the MPI\_Allreduce as the most important collective operation used in Horovod and show that the arrival pattern of MPI processes is indeed imbalanced especially for small messages.
Furthermore, we present various proposals for improving the MPI collective communication performance in the presence of imbalanced PAPs for different message sizes.
We propose an intra-node PAP-aware shared-memory-aware MPI\_Allreduce algorithm for small messages that, based on the arrival time of the processes at each invocation of the collective call, dynamically chooses the leader process. The evaluation results show that our design delivers up to 56\% improvement over native algorithms under different imbalanced PAPs.
We also propose a PAP-aware algorithm capable of dynamically constructing the reduction schedule at each invocation of the collective call based on the arrival order of the processes for intra-node MPI\_Reduce and MPI\_Allreduce collectives with large messages, achieving up to 73\%, and 44\% improvement over the state-of-the-art algorithms, respectively.
Finally, we evaluate the performance of two state-of-the-art cluster-wide MPI\_Allreduce algorithms and introduce a PAP-tolerant cluster-wide allreduce algorithm which imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. This algorithm delivers up to 58\% improvement at the microbenchmark level and an average improvement of 10\% for Horovod DL application over native algorithms.
Description
Citation
Publisher
License
Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
CC0 1.0 Universal
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
CC0 1.0 Universal