High-performance Communication in MPI through Message Matching and Neighborhood Collective Design
MetadataShow full item record
Message Passing Interface (MPI) is the de facto standard for communication in High Performance Computing (HPC). MPI Processes compute on their local data while extensively communicating with each other. Communication is therefore the major bottleneck for performance. This dissertation presents several proposals for improving the communication performance in MPI. Message matching is in the critical path of communications in MPI. Therefore, it has to be optimized given the scalability requirements of the HPC applications. We propose clustering-based message matching mechanisms as well as a partner/non-partner message queue design that consider the behavior of the applications to categorize the communicating peers into some groups, and assign dedicated queues to each group. The experimental evaluations show that the proposed approaches improve the queue search time and application runtime by up to 28x and 5x, respectively. We also propose a unified message matching mechanism that improves the message queue search time by distinguishing messages coming from point-to-point and collective communications. For collective elements, it dynamically profiles the impact of each collective call on message queues and uses this information to adapt the queue data structure. For point-to-point elements, it uses partner/non-partner queue design. The evaluation results show that we can improve the queue search time and application runtime by up to 80x and 5.5x, respectively. Furthermore, we consider the vectorization capabilities of used in new HPC systems many-core processors/coprocessors to improve the message matching performance. The evaluation results show that we can improve the queue search time and application runtime by up to 4.5x and 2.92x, respectively. Finally, we propose a collaborative communication mechanism based on common neighborhoods that might exist among groups of k processes. Such common neighborhoods are used to decrease the number of communication stages through message combining. We consider two design alternatives: topology-agnostic and topology-aware. The former ignores the physical topology of the system and the mapping of processes, whereas the latter takes them into account to further optimize the communication pattern. Our experimental results show that we can gain up to 8x and 5.2x improvement for various process topologies and a sparse matrix-matrix multiplication kernel, respectively.