Improving Communication Performance through Topology and Congestion Awareness in HPC Systems
MetadataShow full item record
High-Performance Computing (HPC) represents the flagship domain in providing high-end computing capabilities that play a critical role in helping humanity solve its hardest problems. Ranging from answering profound questions about the universe to finding a cure for cancer, HPC applications span nearly every aspect of our life. The impressive power of HPC systems comes mainly from the massive number of processors---in the order of millions---that they provide. The efficiency of communications among these processors is the main bottleneck in the overall performance of HPC systems. This dissertation presents new algorithms for improving the communication performance in HPC systems by exploiting the topology information. We propose a parallel topology- and routing-aware mapping heuristic and a refinement algorithm that improves the communication performance by achieving a lower congestion across the network links. Our experimental results with 4,096 processors show that the proposed approach can provide more than 60% improvement in various mapping metrics compared to an initial in-order mapping of processes. Communication time is also improved by up to 50%. We also propose four topology-aware mapping heuristics designed specifically for collective communications in the Message Passing Interface (MPI). The heuristics provide a better match between the collective communication algorithm and the physical topology of the system, and decrease the communication latency by up to 78%. Furthermore, we expand topology-aware communications into the scope of accelerated computing. Using accelerators---especially Graphics Processing Units (GPUs)---to speed up certain types of computations plays an increasingly important role in HPC. We present a unified framework for topology-aware process mapping and GPU assignment in multi-GPU systems. Our experimental results on two clusters with 64 GPUs show that the proposed approach improves communication performance by up to 91%. Finally, we present a novel distributed algorithm that uses the process topology information to design optimized communication schedules for MPI neighborhood collectives. The proposed algorithm finds the common neighborhoods in a distributed graph topology and exploits them as an opportunity to improve the communication performance through message combining. The optimized schedules reduce the communication latency of MPI neighborhood collectives by more than 50%.