The Performance Doctor Is In

New software helps identify and fix bottlenecks in scientific computing codes

Banner Image
A PerfExpert case study. The tool was applied to a mantle advection code and found the code to be memory bound. The group was able to achieve a 40% speedup due to node-level optimizations.

Much has been made of the performance gap between the bleeding-edge hardware that runs large-scale computational models and simulations, and the legacy software researchers use to execute these computations.

"Supercomputers are the most expensive and most powerful computers on the planet," explained Martin Burtscher, professor of computer science at Texas State University and the lead developer of the new optimization tool, PerfExpert. "You would think that if you run a program on these top machines, you would also use the highest-end code. But that's not always the case."

In part, this performance gap exists because most of the people who research big problems requiring high-end machines are typically not computer scientists, much less optimization experts. They are domain experts — chemists, astronomers, particle physicists, etc. — with extensive knowledge in their disciplines, but less experience with code optimization, the process of modifying a software system to make it work more efficiently using fewer resources.

Optimizing execution behavior on even a complex modern processor core is difficult and when several cores are combined in a single chip, it increases the complexity. In high performance computing (HPC), several chips are embedded in a node, and many nodes are connected by a network to form a supercomputing cluster. As those cores and their connections become increasingly more complicated, so does the problem of programming them to execute more rapidly, to operate with less storage, and to draw less power. Finally, efficient programs on high-performance systems requires not only optimization at the core, chip and node levels, but also coordinating the actions of hundreds to thousands of nodes.

The performance counter tool workflow explains how PerfExpert works and how it differs from other optimization tools. [click to view a larger version.]

When scientists and engineers do not optimize their codes, a significant amount of computing time is lost or used ineffectively, slowing the progress of important research. Making this knowledge readily accessible to application domain experts is the goal of PerfExpert.

PerfExpert was born at TeraGrid 2008, the annual conference for NSF-supported research and technology in HPC.

James Browne, a professor of computer science at The University of Texas at Austin and a 40-year veteran of performance evaluation, attended a session at the conference where each of the developers of important open-source performance evaluation tools for HPC systems presented their systems. These tools — which come pre-installed on modern microprocessors and report on the trillions of minute calculations the chips execute each second — are very powerful in the hands of expert users, but effective use requires extensive knowledge of core and chip architecture and the software stack of the system.

At the end of the session, Browne asked the audience of approximately 100 application domain researchers whether they had ever used any of the tools that had just been presented. In response, only three researchers said they had.

"It seemed clear to me that performance evaluation tools for the HPC application community needed to be put in a more commercial model where ease-of-use is a primary goal," Browne said.

Martin Burtscher, professor of computer science at Texas State University and the lead developer of the new optimization tool, PerfExpert.

This reality (and a passion for computing performance) led Browne to recruit Burtscher and a team of optimization experts from the Texas Advanced Computing Center (TACC) to create PerfExpert, a general-purpose tool that provides simple, useable advice to help make computer programs run faster.

"The idea was to write a tool that takes our experience and makes it available to people whenever they want it," Burtscher said.

After running an application, PerfExpert provides the user with a list of the most important functions and loops that represent performance bottlenecks and the cause of each bottleneck. Typically less than a page long, the output is graphical and intuitive, and provides examples of how to improve the code's structure and data layout.

"PerfExpert deliberately presents a simple and focused diagnosis of performance bottlenecks so the user can concentrate on the important parts where there's the most to be gained," said Burtscher.

PerfExpert also lets users track the optimization progress so they can see if the changes suggested by the tool really made a difference in the performance of the code.

The project team consisted of Burtscher (then at UT's Institute for Computational Engineering and Sciences), John McCalpin, Byoung-Do Kim, Lars Koesterke, Carlos Rosales at TACC, and Jeff Diamond and Steve Keckler in the Department of Computer Science at UT. The research, planning and development of the tool were funded by the National Science Foundation (NSF) through the "World-Class Science Through World Leadership in HPC" award, which financed the deployment of TACC's Ranger supercomputer.

Testing the tool on Ranger, the group used PerfExpert to optimize four robust science codes, and achieved speedups of up to 40 percent overall. "On Ranger, that's like having 25,000 extra cores," Burtscher said.

The project team trained the tool by populating a database with lessons learned by optimizing simulation codes and assisting HPC users. Currently, PerfExpert is installed on Ranger and Longhorn at TACC, but the group recently hired a full-time developer, Ashay Rane, to extend the capabilities of PerfExpert and to develop a public release. In the coming months, PerfExpert will become available on other supercomputers on the NSF TeraGrid with different architectures.

James C. Browne, professor of computer science
and physics and Regents Chair in computer
sciences at The University of Texas at Austin
.

For Burtscher, Browne, Rane and the rest of the group, the project represents a rare opportunity to share performance expertise throughout the HPC community and to help users squeeze more performance, and thus more science, out of these machines.

"Lower turn-around times, bigger problem sets, more simulations per time unit —PerfExpert helps in all of these dimensions," said Burtscher. "I love performance, so if I can help other people get their programs to run faster, that's very cool."

PerfExpert has a web page with a QuickStart guide, slides from a tutorial and other helpful information. Many users have been able to apply PerfExpert on Ranger directly from the QuickStart Guide. YOU are invited to try it out, too! If you try PerfExpert and have questions or want follow-up information, please contact Ashay Rane (ashay.rane@tacc.utexas.edu) A performance optimization class featuring PerfExpert will be offered at TACC on April 25th.

Published March 2, 2011


The Texas Advanced Computing Center (TACC) at The University of Texas at Austin is one of the leading centers of computational excellence in the United States. The center's mission is to enable discoveries that advance science and society through the application of advanced computing technologies. To fulfill this mission, TACC identifies, evaluates, deploys, and supports powerful computing, visualization, and storage systems and software. TACC's staff experts help researchers and educators use these technologies effectively, and conduct research and development to make these technologies more powerful, more reliable, and easier to use. TACC staff also help encourage, educate, and train the next generation of researchers, empowering them to make discoveries that change the world.

Share |
  • PerfExpert is a new easy-to-use performance diagnosis tool for HPC applications created by researchers at The University of Texas at Austin, Texas State University and TACC.
  • After running an application, PerfExpert provides the user with a list of the most important functions and loops that represent performance bottlenecks, and the cause of each bottleneck.
  • Typically less than a page long, the output is graphical and intuitive, and provides examples of how to improve the code's structure and data layout.

Aaron Dubrow
Science and Technology Writer
aarondubrow@tacc.utexas.edu