Novel Allreduce Algorithms for Distributed Deep Learning

Thumbnail Image
Kitor, Ben
MPI , Collective Communications , Deep Learning
To solve the most computationally challenging problems, scientists and engineers use HPC clusters. These massive-scale systems comprise thousands of servers, each containing multiple GPUs and CPUs, tied together with a web of interconnect technologies like PCIe, InfiniBand, and NVLink. As a result, HPC systems are incredibly complicated and, by extension, challenging to program. To minimize this burden, developers leverage the MPI. This programming model abstracts the hardware and provides powerful tools like collective communications for exchanging data between processes. HPC is typically associated with scientific computations, classic examples include molecular dynamics or cosmological simulations, but recently DL has become an application of interest. DL is a subset of ML that leverages massive model and dataset size to achieve incredible performance. DL has revolutionized several fields, including computer vision and natural language processing, and will likely be a critical technology in the future. Since the concept of DL targets large scale, it benefits significantly from running on HPC clusters. We investigate the state-of-the-art in large-scale DL and identify Horovod, a commonly used MPI-based data parallelism library. Horovod spends large fractions of its runtime performing MPI_Allreduce, a collective communication. To address this, we propose two new allreduce methods, each leveraging different HPC techniques. First, we describe topology aware rank reordering for allreduce and broadcast. This method takes the communication pattern of the specified collective algorithm and re-ranks processes to ensure the hardware is used optimally. Evaluation with micro-benchmarks demonstrates up to an 80% improvement, depending on message size and initial process-to-core mappings. The second proposed technique is a multinode PAP allreduce. PAP awareness allows the algorithm to introspect on which processes have arrived and can participate in the collective and optimize communication accordingly. To achieve this, we propose a novel remote memory access based PAP distribution mechanism and an accompanying hierarchical chain allreduce. The micro-benchmark evaluation shows a 20% improvement over state-of-the-art collective libraries like UCC and NCCL under imbalance. Furthermore, performance improvements with Horovod demonstrate that data parallel training can benefit greatly from PAP awareness.
External DOI