Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters

Thumbnail Image
Qian, Ying
Collective communications , Interconnects
Two driving forces behind high-performance clusters are the availability of modern interconnects and the advent of multi-core systems. As multi-core clusters become commonplace, where each core will run at least one process with multiple intra-node and inter-node connections to several other processes, there will be immense pressure on the interconnection network and its communication system software. Many parallel scientific applications use Message Passing Interface (MPI) collective communications intensively. Therefore, efficient and scalable implementation of MPI collective operations is critical to the performance of applications running on clusters. In this dissertation, I propose and evaluate a number of efficient collective communication algorithms that utilize the modern features of Quadrics and InfiniBand interconnects as well as the availability of multiple cores on emerging clusters. To overcome bandwidth limitations and to enhance fault tolerance, using multiple independent networks known as multi-rail networks is very promising. Quadrics multi-rail QsNetII network is constructed using multiple network interface cards (NICs) per node, where each NIC is connected to a rail. I design and evaluate a number of Remote Direct Memory Access (RDMA) based multi-port collective operations on multi-rail QsNetII network. I also extend the gather and allgather algorithms to be shared memory aware for small to medium messages. The algorithms prove to be much more efficient than the native Quadrics MPI implementation. ConnectX is the newest generation of InfiniBand host channel adapters from Mellanox Technologies. I provide evidence that ConnectX achieves scalable performance for simultaneous communication over multiple connections. Utilizing this ability of ConnectX cards, I propose a number of RDMA based multi-connection and multi-core aware allgather algorithms at the MPI level. My algorithms are devised to target different message sizes, and the performance results show that they outperform the native MVAPICH implementation. Recent studies show that MPI processes in real applications could arrive at an MPI collective operation at different times. This imbalanced process arrival pattern can significantly affect the performance of the collective communication operation. Therefore, design and efficient implementation of collectives under different process arrival patterns is critical to the performance of scientific applications running on modern clusters. I propose novel RDMA-based process arrival pattern aware alltoall and allgather for different message sizes over InfiniBand clusters. I also extend the algorithms to be shared memory aware for small to medium messages under process arrival patterns. The performance results indicate that the proposed algorithms outperform the native MVAPICH implementation as well as other non-process arrival pattern aware algorithms when processes arrive at different times.
External DOI