Novel Allreduce Algorithms for Distributed Deep Learning

Loading...
Thumbnail Image

Authors

Kitor, Ben

Date

Type

thesis

Language

eng

Keyword

MPI , Collective Communications , Deep Learning

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

To solve the most computationally challenging problems, scientists and engineers use HPC clusters. These massive-scale systems comprise thousands of servers, each containing multiple GPUs and CPUs, tied together with a web of interconnect technologies like PCIe, InfiniBand, and NVLink. As a result, HPC systems are incredibly complicated and, by extension, challenging to program. To minimize this burden, developers leverage the MPI. This programming model abstracts the hardware and provides powerful tools like collective communications for exchanging data between processes. HPC is typically associated with scientific computations, classic examples include molecular dynamics or cosmological simulations, but recently DL has become an application of interest. DL is a subset of ML that leverages massive model and dataset size to achieve incredible performance. DL has revolutionized several fields, including computer vision and natural language processing, and will likely be a critical technology in the future. Since the concept of DL targets large scale, it benefits significantly from running on HPC clusters. We investigate the state-of-the-art in large-scale DL and identify Horovod, a commonly used MPI-based data parallelism library. Horovod spends large fractions of its runtime performing MPI_Allreduce, a collective communication. To address this, we propose two new allreduce methods, each leveraging different HPC techniques. First, we describe topology aware rank reordering for allreduce and broadcast. This method takes the communication pattern of the specified collective algorithm and re-ranks processes to ensure the hardware is used optimally. Evaluation with micro-benchmarks demonstrates up to an 80% improvement, depending on message size and initial process-to-core mappings. The second proposed technique is a multinode PAP allreduce. PAP awareness allows the algorithm to introspect on which processes have arrived and can participate in the collective and optimize communication accordingly. To achieve this, we propose a novel remote memory access based PAP distribution mechanism and an accompanying hierarchical chain allreduce. The micro-benchmark evaluation shows a 20% improvement over state-of-the-art collective libraries like UCC and NCCL under imbalance. Furthermore, performance improvements with Horovod demonstrate that data parallel training can benefit greatly from PAP awareness.

Description

Citation

Publisher

License

Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN