High-Performance Network- and GPU-Aware Communication for MPI Partitioned and MPI Neighbourhoods

Loading...
Thumbnail Image

Authors

Temucin, Yiltan Hassan

Date

2024-12-18

Type

thesis

Language

eng

Keyword

MPI , GPU , HPC

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Advances in High-Performance Computing (HPC) continue to improve the performance of applications in Molecular Dynamics, AI, Deep Learning, and Large Language Models (LLMs), among others, to solve large complex problems. Over the past decade, there has been an increase in CPU core count as well as the use of Graphics Processing Units (GPUs), thus there is a need for system software to adapt. Most Message Passing Interface (MPI) implementations and applications are poorly optimized for multi-threaded communication, and MPI currently has no standardized way to enable GPU support. To address some of the issues with MPI, MPI Partitioned Point-to-Point Communication was proposed to better support hybrid workloads. We addressed the lack of open-source micro-benchmarks for MPI Partitioned communication. We designed a micro-benchmark suite that allows users to search the parameter space for optimal partition communication usage for their application. We provided benchmarks for halo exchanges and sweeping communication patterns to analyze how partition communication can be used. We designed low-level network optimizations for MPI Partitioned including a brute force approach, using the Partitioned LogGP model, and a dynamic method relying on timers. We used our micro-benchmarks to evaluate our designs. To generalize this design to multiple memory regions as well as GPUs, we investigated a UCX-based design for MPI Partitioned Point-to-Point and Collective communication with applications to CPU and GPU-based clusters. Our design was evaluated on the Cerio Rockport Ethernet Fabric and the NVIDIA GH200 platforms to better understand how we can utilize different links in our network and GPU-Initiated communication. On top of the Partitioned Point-to-Point design, we designed a partitioned allreduce collective which greatly impoves the existing Open MPI's implementation. We addressed the challenges associated with GPU-Aware MPI Neighborhood collectives using a topology-aware design that considers the AMD Infinity Fabric. We proposed an algorithm to map our communication pattern to our topology to create a hierarchical and leader-based communication pattern. Using these new communication patterns, we have proposed new MPI_Neighbor_allgather and MPI_Neighbor_allgatherv collective algorithms. We evaluated the performance of the new MPI Neighborhood Collectives using a variety of benchmarks and a Sparse Matrix-Matrix multiplication (SpMM) kernel on multiple platforms.

Description

Citation

Publisher

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN