Next Generation Communication Mechanisms exploiting Heterogeneity, Hierarchy and Concurrency for Emerging HPC Systems

Purpose

New communication mechanisms (integrated into Mvapich2) for Dense Many-Core (DMC) architectures

Overview

This award was partially supported by the CIF21 Software Reuse Venture whose goals are to support pathways towards sustainable software elements through their reuse, and to emphasize the critical role of reusable software elements in a sustainable software cyberinfrastructure to support computational and data-enabled science and engineering.

Parallel programming based on MPI (Message Passing Interface) is being used with increased frequency in academia, government (defense and non-defense uses), as well as emerging uses in scalable machine learning and big data analytics. The emergence of Dense Many-Core (DMC) architectures like Intel's Knights Landing (KNL) and accelerator/co-processor architectures like NVIDIA GPGPUs are enabling the design of systems with high compute density. This, coupled with the availability of Remote Direct Memory Access (RDMA)-enabled commodity networking technologies like InfiniBand, RoCE, and 10/40GigE with iWARP, is fueling the growth of multi-petaflop and ExaFlop systems. These DMC architectures have the following unique characteristics: deeper levels of hierarchical memory; revolutionary network interconnects; and heterogeneous compute power and data movement costs (with heterogeneity at chip-level and node-level).

For these emerging systems, a combination of MPI and other programming models, known as MPI+X (where X can be PGAS, Tasks, OpenMP, OpenACC, or CUDA), are being targeted. The current generation communication protocols and mechanisms for MPI+X programming models cannot efficiently support the emerging DMC architectures. This leads to the following broad challenges: 1) How can high-performance and scalable communication mechanisms for next-generation DMC architectures be designed to support MPI+X (including Task-based) programming models? and 2) How can the current and next generation applications be designed/co-designed with the proposed communication mechanisms?

A synergistic and comprehensive research plan, involving computer scientists from The Ohio State University (OSU) and Ohio Supercomputer Center (OSC) and computational scientists from the Texas Advanced Computing Center (TACC), San Diego Supercomputer Center (SDSC) and University of California San Diego (UCSD), is proposed to address the above broad challenges with innovative solutions. The research will be driven by a set of applications from established NSF computational science researchers running large scale simulations on Stampede and Comet and other systems at OSC and OSU.

Impact

The proposed designs will be integrated into the widely-used MVAPICH2 library and made available for public use. Multiple graduate and undergraduate students will be trained under this project as future scientists and engineers in HPC. The established national-scale training and outreach programs at TACC, SDSC and OSC will be used to disseminate the results of this research to XSEDE users. Tutorials will be organized at XSEDE, SC and other conferences to share the research results and experience with the community.

Contributors

Bill Barth
Director of High Performance Computing

Si Liu
Research Associate

Publications

K. Hamidouche, A. Awan, A. Venkatesh, and D. K. Panda. "CUDA M3: Designing Efficient CUDA Managed Memory-aware MPI by Exploiting GDR and IPC," 23rd IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 2016.

M. Li, X. Lu, K. Hamidouche, J. Zhang, and D. K. Panda. "Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA," 23rd IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 2016.

J. M. Hashmi, K. Hamidouche, and D. K. Panda. "Enabling Performance Efficient Runtime Support for Hybrid MPI+UPC++ Programming Models," 18th IEEE Int'l Conference on High Performance Computing and Communications, 2016.

A. Awan, K. Hamidouche, J. Hashmi, and D. K. Panda. "S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters," 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2017.

H. Subramoni, S. Chakraborty, and D. K. Panda. "Designing Dynamic and Adaptive MPI Point-to-point Communication Protocols for Efficient Overlap of Computation and Communication," International Supercomputing Conference (ISC), 2017.

Funding Source

National Science Foundation

Computing and Communication Foundations (CCF)

NSF Award #1565414