Improving Communication Performance in GPU-Accelerated HPC Clusters
In recent years, GPUs have been adopted in many High-Performance Computing (HPC) clusters due to their massive computational power and energy efficiency. The Message Passing Interface (MPI) is the de-facto standard for parallel programming. Many HPC applications, written in MPI, use parallel processes and multiple GPUs to achieve higher performance and GPU memory capacity. In such applications, efficiently performing GPU inter-process communication is the key in the application performance. In this dissertation, we present proposals to improve the GPU inter-process communication in HPC clusters using novel GPU-aware designs, efficient and scalable algorithms, topology-aware designs, and hardware features. Specifically, we propose various approaches to improve the efficiency of MPI communication routines in GPU clusters. We also propose designs that evaluate the total application inter-process communication and provide solutions to improve its efficiency. First, we propose efficient GPU-aware algorithms to improve MPI collective performance. We show the importance of minimizing CPU intervention on GPU collective performance. We also utilize GPU features to enhance both collective communication and computation. As inter-process communications scale to across multi-GPU nodes and clusters, efficient inter-process communication routines must consider the physical structure of the underlying system. Given the hierarchical nature of the GPU clusters with multi-GPU nodes, we propose hierarchy-aware designs for GPU collectives and show that different algorithms are favored at different hierarchy levels. With the presence of multiple data copy mechanisms in modern GPU clusters, it is crucial to make an informed decision on how to use them for efficient inter-process communications. In this regard, we propose designs that intelligently decide which data copy mechanisms to use in GPU collectives. Using these designs, we reveal the importance of using multiple data copy mechanisms in performing multiple inter-process communications. Finally, we provide topology-aware solutions to improve the application inter-process communication efficiency, both within multi-GPU nodes and across GPU clusters. First, we study the performance of different communication channels used for GPU inter-process communications. Next, we propose topology-aware designs that consider both the system physical topology and application communication pattern. These designs improve the communication performance by performing more intensive inter-process communication on stronger communication channels.
URI for this recordhttp://hdl.handle.net/1974/23833
Request an alternative formatIf you require this document in an alternate, accessible format, please contact the Queen's Adaptive Technology Centre
The following license files are associated with this item: