XSEDE Technology Audit Service

Purpose

XDMoD (XD Metrics on Demand), coupled with TACC Stats, provides a comprehensive open-source resource management tool for HPC systems.

Overview

To improve the operational efficiency and management of the XSEDE network, the Technology Audit Service (TAS) award for XSEDE focussed first on developing the XDMoD auditing framework to provide XSEDE stakeholders (the public, users, PI's, support staff, center directors, XSEDE leadership, and NSF program managers) with ready access to utilization, performance, and quality of service data for XSEDE. While the primary focus of TAS is XSEDE, we realized that such a resource management tool would also be of great utility to high performance computing centers in general, as well as other data centers managing complex IT infrastructure.

To meet this need, we developed Open XDMoD, an open source version of XDMoD, that is already currently in use by numerous academic and industrial HPC centers. Throughout the course of the TAS award, we recognized various opportunities to expand XDMoD beyond the ideas put forth in the original proposal that would increase its utility both to XSEDE and move it into the realm of a comprehensive resource management tool for cyberinfrastructure. Perhaps the best example is the incorporation of job level performance data through TACC_Stats (SUPReMM program) into XDMoD. This functionality provides XDMoD with the ability to identify poorly performing applications, improve throughput, characterize the system's workload, and provide metrics critical for the specification of future procurements. Given the scale of today's HPC systems, even modest increases in throughput can have a substantial impact on science and engineering research. For example, with respect to XSEDE, every 1% increase in system performance translates into an additional 15M CPU hours of computer time that can be allocated for research.

Impact

Already downloaded more than 230 times since its release at SC13, Open XDMoD, is in production on at least a dozen sites, including academic, industrial and research centers both nationally and internationally (eg. CERN, Cambridge, Southampton, Rolls Royce and Dow). It is also running on the NCSA's Blue Waters system and the production systems at NCAR. We intend to broaden its appeal to academic and industrial HPC centers further by adding the ability to analyze job level performance data through TACC_Stats and other popular performance monitoring packages. This will provide many of these centers with the ability for the first time to characterize and subsequently optimize the applications running on their systems.

Funding Source(s)

  • NSF Office of Advanced Cyberinfrastructure (OAC) 1445806

Publications

S. M. Gallo, J. P. White, R. L. DeLeon, T. R. Furlani, H. Ngo, A. K. Patra, M. D. Jones, J. T. Palmer, N. Simakov, J. M. Sperhac, M. Innus, T. Yearke, R. Rathsam. "Analysis of XDMoD/SUPReMM Data Using Machine Learning Techniques," 2015 IEEE International Conference on Cluster Computing (CLUSTER, 2015. doi:10.1109/CLUSTER.2015.114

N. A. Simakov, J. P. White, R. L. DeLeon, A. Ghadersohi, T. R. Furlani, M. D. Jones, S. M. Gallo, A. K. Patra. "Application Kernels: HPC Resources Performance Monitoring and Variance Analysis," Concurrency and Computation: Practice and Experience, v.27, 2015. doi:10.1002/cpe.3564

J.T. Palmer, S.M. Gallo, T. R. Furlani, M. D. Jones, R. L. DeLeon, J. P. White, N. Simakov, A. K. Patra, J. Sperhac. "Open XDMoD: A Tool for the Comprehensive Management of High Performance Computing Resources," Computing in Science and Engineering, v.17, 2015. doi:10.1109/MCSE.2015.68

Related Link(s)

Thomas Furlani furlani@buffalo.edu (Principal Investigator)
Gregor von Laszewski (Co-Principal Investigator)
Abani Patra (Co-Principal Investigator)
Steven Gallo (Co-Principal Investigator)
Matthew Jones (Co-Principal Investigator)