PerfExpert 4.0

An Easy-to-Use Performance Diagnosis Tool for HPC Applications.

Quick Start Guide

This guide explains how to run programs using PerfExpert and how to interpret its output using a simple matrix multiplication program. In this chapter, we will use the OpenMP simple matrix multiplication program available in

CAUTION: PerfExpert may, if you choose to use the full capabilities for automated optimization, change your source code during the process of optimization. PerfExpert always saves the original file with a different name (e.g., omp_mm.c.old_27301) as well as adding annotations to your source code for each optimization it makes. We cannot, however, fully guarantee that code modifications for optimizations will not break your code. We recommend having a full backup of your original source code before using PerfExpert.

Environment Configuration

If you are using any of the TACC machines (int this example Stampede specifically), load the appropriate modules:

module load papi hpctoolkit perfexpert

The runs with PerfExpert should be made using a data set size for each compute node which is equivalent to full production runs but for which execution time is not more than about ten or fifteen minutes since PerfExpert will run your application multiple times (actually, three times on Stampede) with different performance counters enabled. For that reason, before you run PerfExpert you should either request iterative access to computational resources (compute node), or modify the job script that you use to run your application and specify a running time that is about 3 (for Stampede) or 6 (for Lonestar) times the normal running time of the program.

To request interactive access to a compute node on Stampede should read Stampede User Guide.

CAUTION: When running your application with a jobscript, be sure to specify a running time that is about 6x the normal running time of the program.

#SBATCH -J myMPI      # job name
#SBATCH -o myMPI.o%j  # output and error filename (%j stands for jobID)
#SBATCH -n 16         # total number of mpi tasks requested
#SBATCH -p development # queue (partition) -- normal, development, etc.
#SBATCH -t 01:30:00          # run time (hh:mm:ss) - 1.5 hours
perfexpert 0.1 ./my_program  # run the executable named my_program

This will instruct PerfExpert to run your application and generate the performance anaylis report.

PerfExpert Options

There are several different options for applying PerfExpert. The following summary shows you how to choose the options to run PerfExpert to match your needs.

$ perfexpert -h
Usage: perfexpert <threshold> [-gvch] [-l level] [-d database]
                  [-r count] [-m target|-s sourcefile] [-p prefix]
                  [-a FILE] [-b FILE] <target_program> [arguments]

  <threshold>        Define the relevance (in % of runtime) of code
                     fragments PerfExpert should take into
                     consideration (> 0 and <= 1)
  -d --database      Select the recommendation database file
                     (default: system defined)
  -r --recommend     Number of recommendations to show
  -m --makefile      Use GNU standard 'make' command to compile the
                     code (it requires the source code available in
                     the current directory)
  -s --source        Set the source code file (if your source code
                     has more than one file please use a Makefile and
                     choose '-m' option. It also enables the automatic
                     optimization option '-a')
  -p --prefix        Add a prefix to the command line (e.g. mpirun).
                     Use double quotes to specify multiple arguments
                     (e.g. -p "mpirun -n 2")
  -b --before        Execute FILE before each run of the application
  -a --after         Execute FILE after each run of the application
  -g --clean-garbage Remove temporary fiels after run
  -v --verbose       Enable verbose mode using default verbose level
  -l --verbose_level Enable verbose mode using a specific verbose level
  -c --colorful      Enable colors on verbose mode, no weird characters
                     will appear on output files
  -h --help          Show this message

Use CC, CFLAGS and LDFLAGS to select compiler and compilation/linkage flags

If you select the -m or -s options, PerfExpert will try to automatically optimize your code and show you the performance analysis report & the list of suggestion for bottleneck remediation when no automatic optimization is possible.

For the -m or -s options, PerfExpert requires access to the application source code. If you select the -m option and the application is composed of multiple files, your source code tree should have a Makefile file to enable PerfExpert compile your code. If your application is composed of a single source code file, the option -s is sufficient for you. If you do not select -m or -s options, PerfExpert requires only the binary code and will show you only the performance analysis report and the list of suggestion for bottleneck remediation.

PerfExpert will run your application multiple times to collect different performance metrics. You may use the -b (or -a) options if you want to execute a program or script before (or after) each run. The argument target_program should be the filename of the application you want to analyze, not a shell script, otherwise, PerfExpert will analyze the performance of the shell script instead of the performance of you application.

Use the -r option to select the number of recommendations for optimization you want for each code section which is a performance bottleneck.

CAUTION: If your program takes any argument that starts with a "-'' signal PerfExpert will interpret this as a command line option. To help PerfExpert handle arguments and options correctly, use quotes and add a space before the program's arguments (e.g., " -s 50'').

CAUTION: In case you are trying to optimize a MPI application, you should use the -p option to specify the MPI launcher and also it's arguments.

For this guide, using the OpenMP simple matrix multiply code, we will use the following command line options:

$ OMP_NUM_THREADS=16 CFLAGS="-fopenmp" perfexpert -s mm_omp.c 0.05 mm_omp

which executes PerfExpert's automatic optimizations and will also generate an OpenMP-enabled binary which will run with 16 threads. In this case, PerfExpert will compile the mm_omp.c code using the system's default compiler, which is GCC in the case of Stampede. PerfExpert will take into consideration only code fragments (loops and functions) that take more 5% of the runtime.

To select a different compiler, you should specify the CC environment variable as below:

$ CC="icc" OMP_NUM_THREADS=16 CFLAGS="-fopenmp" perfexpert -s mm_omp.c 0.05 mm_omp

If you do not want to have PerfExpert trying to optimize the application automatically, just compile your program and run the following command:

$ OMP_NUM_THREADS=16 perfexpert 0.05 mm_omp

WARNING: If the command line you use to run PerfExpert includes the MPI launcher (i.e., mpirun -n 2 my_mpi_app my_mpi_app_arguments ...), PerfExpert will analyze the performance of the MPI launcher instead of the performance of your application. Use the -p command line argument of PerfExpert to set the MPI launcher and all its arguments (e.g., -p "mpirun -n 16").

The Performance Analysis Report

This section explains the performance analysis report and the metrics shown by PerfExpert. We discuss the following sample output:

Loop in function compute() (99.9% of the total runtime)
ratio to total instrns       %  0.......25.........50.......75......100
   - floating point      :    6 ***
   - data accesses       :   33 ****************
* GFLOPS (% max)         :    7 ***
   - packed              :    0 *
   - scalar              :    7 ***
performance assessment     LCPI good....okay....fair....poor....bad...
* overall                :  0.8 >>>>>>>>>>>>>>>>
upper bound estimates
* data accesses          :  2.5 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+
   - L1d hits            :  1.3 >>>>>>>>>>>>>>>>>>>>>>>>>
   - L2d hits            :  0.3 >>>>>
   - L3d hits            :  0.0 >>>>>>>>>>>>>>>>
   - LLC misses          :  0.1 >>
* instruction accesses   :  0.3 >>>>>>
   - L1i hits            :  0.3 >>>>>>
   - L2i hits            :  0.0 >
   - L2i misses          :  0.0 >
* data TLB               :  1.5 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
* instruction TLB        :  0.0 >
* branch instructions    :  0.0 >
   - correctly predicted :  0.0 >
   - mispredicted        :  0.0 >
* floating-point instr   :  0.2 >>>>
   - fast FP instr       :  0.2 >>>>
   - slow FP instr       :  0.0 >

Apart from the total running time, PerfExpert performance analysis report includes, for each code segment:

  • Instruction execution ratios (with respect to total instructions);
  • Approximate information about the computational efficiency (GFLOPs measurements);
  • Overall performance;
  • Local Cycles Per Instruction (LCPI) values for the cost of memory accesses.

The program composition part shows what percentage of the total instructions were computational (floating-point instructions) and what percentage were instructions that accessed data. This gives a rough estimate in trying to understand whether optimizing the program for either data accesses or floating-point instructions would have a significant impact on the total running time of the program.

The PerfExpert performance analysis report also shows the GFLOPs rating, which is the number of floating-point operations executed per second in multiples of 109. The value for this metric is displayed as a percentage of the maximum possible GFLOP value for that particular machine. Although it is rare for real-world programs to match even 50% of the maximum value, this metric can serve as an estimate of how efficiently the code performs computations.

The next, and major, section of the PerfExpert performance analysis report shows the LCPI values, which is the ratio of cycles spent in the code segment for a specific category, divided by the total number of instructions in the code segment. The overall value is the ratio of the total cycles taken by the code segment to the total instructions executed in the code segment. 

Generally, a value of 0.5 or lower for an LCPI is considered to be good. However, it is only necessary to look at the ratings (good, okay, ..., bad) The rest of the report maps this overall LCPI, into the six constituent categories: data accesses, instruction accesses, data TLB accesses, instruction TLB accesses, branches and floating point computations. Without getting into the details of instruction operation on Intel and AMD chips, one can say that these six categories record performance in non-overlapping ways. That is, they roughly represent six separate categories of performance for any application.

The LCPI value is a good indicator of the cost arising from instructions of the specific category. Hence, the higher the LCPI, the slower the program. The following is a brief description of each of these categories:

  • Data accesses: counts the LCPI arising from accesses to memory for program variables.
  • Instruction accesses: counts the LCPI arising from memory accesses for code (functions and loops).
  • Data TLB: provides an approximate measure of penalty arising from strides in accesses or regularity of accesses.
  • Instruction TLB: reflects cost of fetching instructions due to irregular accesses.
  • Branch instructions: counts cost of jumps (i.e., if statements, loop conditions, etc.).
  • Floating-point instructions: counts LCPI from executing computational (floating-point) instructions.

Some of these LCPI categories have subcategories. For instance, the LCPI from data and instruction accesses can be divided into LCPI arising from the individual levels of the data and instruction caches and branch LCPIs can be divided into LCPIs from correctly predicted and from mispredicted branch instructions. For floating-point instructions, the division is based on floating-point instructions that take few cycles to execute (e.g., add, subtract and multiply instructions) and on floating-point instructions that take longer to execute (e.g., divide and square-root instructions).

In each case, the classification (data access, instruction access, data TLB, etc.) is shown so that it is easy to understand which category is responsible for the performance slowdown. For instance if the overall CPI is poor and the data access LCPI is high, then you should concentrate on access to program variables and memory. Additional LCPI details help in relating performance numbers to the process architecture.

IMPORTANT: When PerfExpert runs with automatic performance optimization enabled the performance analysis report shown reflects the performance of the code after all possible automatic optimizations have been applied.

NOTICE: PerfExpert creates a .perfexpert-temp.XXXXXX directory for each time it is executed. This directory has one subdirectory for each optimization cycle PerfExpert completed or attempted. Each subdirectory includes the intermediate files PerfExpert generated during each cycle, including the performance analysis reports.

List of Recommendations for Optimization

If PerfExpert runs with -r option enabled, it will also generate a list of suggestions for performance improvement for each bottleneck. This option is always available, it does not depend on which of the other command line option are. A list of suggestions for this example is shown below:

# Recommendations for /work/02204/fialho/tutorial/3/mm_omp.c:14
# Here is a possible recommendation for this code segment
Description: move loop invariant computations out of loop
Reason: this optimization reduces the number of executed floating-point operations
Code example:
loop i
  x = x + a * b * c[i];
temp = a * b;
loop i
  x = x + temp * c[i];

# Here is a possible recommendation for this code segment
Description: change the order of loops
Reason: this optimization may improve the memory access pattern and make it more cache
and TLB friendly
Code example:
loop i
  loop j {...}
loop j
  loop i {...}

# Here is a possible recommendation for this code segment
Description: componentize important loops by factoring them into their own subroutines
Reason: this optimization may allow the compiler to optimize the loop independently and
thus tune it better
Code example:
loop i {...}
loop j {...}
void li() { loop i {...} }
void lj() { loop j {...} }
James Browne

Professor Emeritus of Computer Science, UT Austin

Antonio Gomez

Research Scientist, UT Austin

Ashay Rane

PhD student, UT Austin