High-Performance Interconnect-AwareMPI communication for Deep Learning Workloads
High-Performance Computing (HPC) refers to using aggregate compute power of many small compute nodes to solve large complex problems which cannot be computed in a reasonable time on a single computer. In recent years these HPC clusters have moved towards using accelerators, such as Graphics Processing Units (GPUs), to offload computationally intensive portions of applications. Distributed Deep Learning workloads on these heterogeneous HPC systems has become increasingly important. These new workloads have been developed upon existing HPC libraries such as the Message Passing Interface (MPI) and Compute Unified Device Architecture (CUDA). MPI communication is critical to distributed Deep Learning applications at scale as they place a large amount of pressure on the communication subsystem of HPC clusters. Improving the MPI communication run-time could benefit Deep Learning. For that, we first investigate the characteristics of Deep Learning applications to understand how we can propose and design communication mechanisms which solve some important communication challenges. We focused on tackling the issues regarding large GPU messages, which we observed with Deep Learning applications. To begin our investigation, we studied NVLink usage within the context of point-to-point communication. Unified Communication X (UCX) framework, used within the Open MPI library, only utilises a small portion of the available NVLink bandwidth for intra-socket GPU-to-GPU communication. We propose a novel GPU-to-GPU data transfer mechanism that stripes the message across multiple intra-socket communication channels and multiple memory regions using multiple GPU streams to utilise all available NVLink paths. Our approach achieves 1.64x and 1.84x higher bandwidth for both UCX and Open MPI + UCX, respectively. Then we propose a 3-stage hierarchical, pipelined MPI_Allreduce design that incorporates the new multi-path NVLink data transfer mechanism for intra-socket communication in the first and third stages of the collective, and PCIe and X-bus channels for inter-socket GPU-to-GPU communication in the second stage with minimal interference. For large messages, our proposed algorithm achieves a large speedup.Finally, we evaluate our proposed MPI_Allreduce for Deep Learning applications such as Horovod + TensorFlow with a range of Deep Learning models. For Horovod + TensorFlow and VGG16, we observe up to 3.42x speedup over other MPI implementations.