Stampede User Guide

Updates and Notices

  • Use ONLY the "impi" or "mvapich2-mic" modules for symmetric (CPU+MIC) MPI computing. All impi modules support symmetric computing, but only the mvapich2-mic module currently supports the MIC coprocessor.
  • New queue policy: Jobs in the development queue are now limited to 1 job at a time for a maximum of 2 hours.
  • Stampede's Intel Xeon Phi Co-processors are now in full production adding more than 7 PF of performance to the system. Please consult the calendar for upcoming training classes.

System Overview

The TACC Stampede* system is a 10 PFLOPS (PF) Dell Linux Cluster based on 6,400+ Dell PowerEdge server nodes, each outfitted with 2 Intel Xeon E5 (Sandy Bridge) processors and an Intel Xeon Phi Coprocessor (MIC Architecture). The aggregate peak performance of the Xeon E5 processors is 2+PF, while the Xeon Phi processors deliver an additional aggregate peak performance of 7+PF. The system also includes a set of login nodes, large-memory nodes, graphics nodes (for both remote visualization and computation), and dual-coprocessor nodes. Additional nodes (not directly accessible to users) provide management and file system services.

One of the important design considerations for Stampede was to create a multi-use cyberinfrastructure resource, offering large memory, large data transfer, and GPU capabilities for data-intensive, accelerated or visualization computing. By augmenting some of the compute-intensive nodes within the system with very large memory and GPUs there is no need to move data for data-intensive computing, remote visualization and GPGPU computing. For those situations requiring large-data transfers from other sites, 4 high-speed data servers have been integrated into the Lustre file systems.

Compute Nodes: The majority of the 6400 nodes are configured with two Xeon E5-2680 processors and one Intel Xeon Phi SE10P Coprocessor (on a PCIe card). These compute nodes are configured with 32GB of "host" memory with an additional 8GB of memory on the Xeon Phi coprocessor card. A smaller number of compute nodes are configured with two Xeon Phi Coprocessors -- the specific number of nodes available in each configuration at any time will be available from the batch queue summary.

Large Memory Nodes: There are an additional 16 large-memory nodes with 32 cores/node and 1TB of memory for data-intense applications requiring disk caching to memory and large-memory methods.

Visualization Nodes: For visualization and GPGPU processing 128 compute nodes are augmented with a single NVIDIA K20 GPU on each node with 5GB of on-board GDDR5 memory.

File Systems: The Stampede system supports a 14PB global, parallel file storage managed as three Lustre file systems. Each node contains a local 250GB disk. Also, the TACC Ranch tape archival system (60 PB capacity) is accessible from Stampede.

Interconnect: Nodes are interconnected with Mellanox FDR InfiniBand technology in a 2-level (cores and leafs) fat-tree topology.

Figure 1.1. Stampede System: Front-row, 8 in all (pre-production image).

Innovative Computing Capability with Intel Xeon Phi Coprocessor

As directed in the NSF award, the system is equipped with an innovative computing component. The TACC innovative solution features an Intel® Many Integrated Core Architecture (Intel® MIC Architecture) coprocessor in each compute node. The Phi Xeon Coprocessor (often called a MIC, pronounced "Mike" ) has significantly more cores (61), and 4 times the width of vector registers seen on TACC's Lonestar IV system. The basis of the Phi coprocessor is a light-weight x86 core with in-order instruction processing, coupled with heavy-weight 512bit SIMD registers and instructions. With these two features the Phi die can support 60+ cores, and can execute 8 double precision (DP) vector instructions. The core count and vector lengths are basic extensions of an x86 processor, and allow the same programming paradigms (serial, threaded and vector) used on other Xeon (E5) processors. Unlike the GPGPU accelerator model, the same program code can be used efficiently on the host and the coprocessor. Also, the same Intel compilers, tools, libraries, etc. that you use on Intel and AMD systems are available for the Phi processors.

These coprocessors contain a large number of (relatively) simple cores running at lower frequency to deliver much higher peak performance per chip than is available using more traditional multi-core approaches. In the case of the Xeon Phi Coprocessor SE10P used in the Stampede system, each coprocessor chip has a peak performance of ~1070 GFLOPS, approximately six times the peak performance of a single Xeon E5 processor, or three times the aggregate peak performance of the two Xeon E5 processors in each Stampede compute node. Each coprocessor is equipped with 8GB of GDDR5 DRAM with a peak bandwidth of 352GB/s, also significantly higher than the 51.2GB/s peak bandwidth available to each Xeon E5 processor chip.

A critical advantage of the Xeon Phi coprocessor is that, unlike GPU-based coprocessors, the processing cores in the Xeon Phi coprocessor run the Intel x86 instruction set (with 64-bit extensions), allowing the use of familiar programming models, software, and tools.

The many-core Phi processor is integrated into each node as a coprocessor with an interconnect to the E5 processors and the external network (HCA card) through a PCIe express interface as shown in Figure 1.2; the connectivity is similar to the way GPU accelerators are configured in a node.

Figure 1.2. Stampede Zeus Node: 2 Xeon E5 processors and 1 Xeon Phi coprocessor

The MIC coprocessor, however, runs a lightweight BusyBox Operating System (OS), thereby making MIC function as a separate Symmetric Multiprocessor (SMP). So, while the MIC can be used as a work offload engine by the E5 processors, it is also capable working as another independent (SMP) processor. In the latter mode MPI processes can be launched on the MIC and/or the E5 processors. In this "symmetric" mode the MIC appears as an extra node for launching MPI tasks.

System Configuration

All Stampede nodes run CentOS 6.3 and are managed with batch services through SLURM 2.4. Global $HOME, $WORK and $SCRATCH storage areas are supported by three Lustre parallel distributed file systems with 76 IO servers. Inter-node communication (MPI/Lustre) is through an FDR Mellanox InfiniBand network.

The 6,400 Dell Zeus C8220z compute nodes are housed in 160 racks (40 nodes/rack), along with two 38-port Mellanox leaf switches. Each node has two Intel E5 8-core (Sandy Bridge) processors and an Intel Xeon Phi 61-core (Knights Corner) coprocessor, connected by an x16 PCIe bus. The host and coprocessor are configured with DDR3 32GB and DDR5 8GB memory, respectively. Forty of the compute nodes are reserved for development and are accessible interactively for a limited time. The 16 large-memory nodes are in a single rack. Each node is a Dell PowerEdge R820 server with 4 E5-4650 8-core processors and 1TB of DDR3 memory.

The interconnect is an FDR InfiniBand network of Mellanox switches, consisting of a fat tree topology of eight core-switches and over 320 leaf switches with a 5/4 oversubscription. The network configuration for the compute nodes is shown in Figure 1.4.

The configuration and features for the compute nodes, interconnect and I/O systems are described below, and summarized in Tables 1.1 through 1.4.

Configuration Details & Technical Specifications

Table 1.1 System Configuration & Performance

Component Technology Performance/Size
Nodes(sled) 2 8-core Xeon E5 processors, 1 61-core Xeon Phi coprocessor 6,400 Nodes
Memory Distributed, 32GB/node 205TB (Aggregate)
Shared Disk Lustre 2.1.3, parallel File System 14 PB
Local Disk SATA (250GB) 1.6PB (Aggregate)
Interconnect InfiniBand Mellanox Switches/HCAs FDR 56 Gb/s

Compute nodes

A Compute node consists of a Dell C8220z double-wide sled in a 4 rack-unit chassis with 3 other sleds. Each node runs CentOS 6.3 with the 2.6.32 x86_64 Linux kernel. Each node contains two Xeon Intel 8-Core 64-bit E5-processors (16 cores in all) on a single board, as an SMP unit. The core frequency is 2.7GHz and supports 8 floating-point operations per clock period with a peak performance of 21.6 GFLOPS/core or 346 GFLOPS/node. Each node contains 32GB of memory (2GB/core). The memory subsystem has 4 channels from each processor's memory controller to 4 DDR3 ECC DIMMS, each rated at 1600 MT/s (51.2GB/s for all four channels in a socket). The processor interconnect, QPI, runs at 8.0 GT/s between sockets. The Intel Xeon Phi is a special production model with 61 1.1 GHz cores with a peak performance of 16.2 DP GFLOPS/core or 1.0 DP TFLOPS/Phi. Each coprocessor contains 8GB of GDDR5 memory, with 8 dual-channel controllers, with a peak memory performance of 320GB/s.

Table 1.2 Dell DCS (Dell Custom Solution) C8220z Compute Node

Component Technology
Sockets per Node/Cores per Socket
2/8 Xeon E5-2680 2.7GHz (turbo, 3.5)
1/61 Xeon Phi SE10P 1.1GHz
Motherboard Dell C8220, Intel PQI, C610 Chipset
Memory Per Host
Memory per Coprocessor
32GB 8x4G 4 channels DDR3-1600MHz
QPI 8.0 GT/s
PCI Express Processor
PCI Express Coprocessor
x40 lanes, Gen 3
x16 lanes, Gen 2 (extended)
250GB Disk 7.5K RPM SATA

Figure 1.3. Dell Zeus Node: E5 processors (left top) and Phi coprocessor (left bottom, right).
2 Half-sleds (left, top & bottom) combine to form single sled.

Login nodes

Table 1.3 PowerEdge R720 Login Nodes

Component Technology
4 login nodes
Processors Sockets per Node/Cores per Socket E5-2680, 2.7GHz
Motherboard Dell R720, Intel QPI C600 Chipset
Memory Per Node 32GB 8x4GB 4 channels/CPU DDR3-1600 (MT/s)
Cache: 256KB/core L2; 20MB/CPU L3
Global: Lustre, xxGB quota
Local: Shared, 432GB SATA 10K rpm

Intel E5 (Sandy Bridge) processor

The new E5 architecture includes the following features important to HPC:

  • 4 DP (double precision) vector width and AVX Instruction set
  • 4-Channel On-chip (integrated) memory controllers
  • Support for 1600MT/s DDR3 memory
  • Dual Intel QuickPath links between Xeon dual-processor systems support 8.0GT/s
  • Turbo Boost version 2.0, up to peak 3.5GHz in turbo mode.
  • In these Romley platforms PCIe lanes are controlled by CPUs (do not pass through the chip set)
  • Gen 3.0 PCI Express
  • Improved Hyper-threading (turned off but good for some HPC Packages)
  • 64 KB L1 Cache/core (32KB L1 Data and 32KB L1 Instruction)


The 56GB/s FDR InfiniBand interconnect consists of Mellanox switches, fiber cables and HCAs (Host Channel Adapters). Eight core 648-port SX6536 switches and over 320 36-port SX6025 endpoint switches (2 in each compute-node rack) form a 2-level Clos fat tree topology, illustrated in Figure 1.4. Core and endpoint switches have 4.0 and 73 Tb/s capacities, respectively. There is a 5/4 oversubscription at the endpoint (leaf) switches (20 node input ports: 16 core-switch output ports). Any MPI message is only 5 hops or less from source to destination.

Figure 1.4. Stampede Network Topology for 6,400 compute nodes: 8 684-port core switches and 320 36-port leaf switches.

File Systems Overview

User-owned storage on the Stampede system is available in three directories, identified by the $HOME, $WORK and $SCRATCH environment variables. These directories are separate Lustre (global) file systems, and accessible from any node in the system.

Table 1.4 Storage Systems

Storage Class Size Architecture Features
Local (each node) Login: 1TB
Compute: 250GB
Big Mem: 600GB
SATA x.xK rpm
SATA 7.5K rpm
SATA z.zK rpm
432GB partition mounted on /tmp
80GB partition mounted on /tmp
398GB partition mounted on /tmp
Parallel 14PB Lustre, Version x.x 72 Dell R610 data servers (OSS), through IB, user striping allowed,
MPI-IO, XPB, YPB, and ZPB partitions on $HOME/$WORK/$SCRATCH,
4 Dell R710 meta data servers (MDxxx) with 2 Dell MD 3220 Storage Arrays.
Ranch (Tape Storage) 60PB SAM-FS (Storage Archive Manager) 10GB/s connection through 4 GridFTP Servers

System Access

Secure Shell

To create a login session from a local machine it is necessary to have an SSH client. Wikipedia is a good source of information on SSH, and provides information on the various clients available for your particular operating system. To ensure a secure login session, users must connect to machines using the secure shell ssh command. Data movement must be done using the secure shell commands scp and sftp.

Do not run the optional ssh-keygen command to set up Public-key authentication. This command sets up a passphrase that will interfere with the execution of job scripts in the batch system. If you have already done this, remove the .ssh directory (and the files under it) from your home directory. Log out and log back in to regenerate the keys.

To initiate an ssh connection to a Stampede login node from your local system, use one of the SSH clients supporting the SSH-2 protocol e.g., like Openssh, Putty, SecureCRT. Then execute the following command:

localhost$ ssh

Note: the TACC userid is needed if the user name on the local machine and the TACC machine differ.

Login passwords (which are identical to TACC portal passwords, not XUP passwords) can be changed in the TACC Portal. Select "Change Password" under the "HOME" tab after you login. If you've forgotten your password, go to the TACC portal home page ( and select the "? Forgot Password" button in the Sign In area.

GSI-OpenSSH (gsissh)

The following commands authenticate using the XSEDE myproxy server, then connecting to the gsissh port, 2222, on Stampede:

    localhost$ myproxy-login -s
    localhost$ gsissh -p 2222

Please consult NCSA's detailed documentation on installing and using myproxy and gsissh, as well as the GSI-OpenSSH User's Guide for more info.


To report a problem please run the ssh or gsissh command with the "-vvv" option and include the verbose information in the ticket.

Computing Environment

Unix Shell

The most important component of a user's environment is the login shell. It interprets text on each interactive command-line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. The default shell is BASH. To determine your login shell, use:

login1$ echo $SHELL

You can change your default login shell by submitting a ticket via TACC Portal. Select one of the available shells listed in the "etc/shells" file, and include it in the ticket. To display the list of available shells, execute:

login1$ cat /etc/shells

After your support ticket is closed, please allow several hours for the change to take effect.

Environment Variables

Another important component of a user's environment is the set of environment variables. Many of the UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:

login1$ env

The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the $HOME and $PATH variables.


Notice that the $PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with setenv for C-type shells and export for Bourne-type shells) are carried to the environment of shell scripts and new shell invocations, while normal shell variables (created with the set command) are useful only in the present shell. Only environment variables are displayed by the env (or printenv) command. Execute "set" to see the (normal) shell variables.

Startup Scripts

Unix shells allow users to customize their environment via startup files containing scripts. Customizing your environment with startup scripts is not entirely trivial. Below are some simple instructions, as well as an explanation of the shell setup operations.

Technical Background

All UNIX systems set up a default environment that provides administrators and users with the ability to execute additional UNIX commands to alter that environment. These commands are sourced; that is, they are executed by the login shell, and the variables (both normal and environmental), as well as aliases and functions, are included in the present environment. The Xeon E5 hosts on Stampede support the Bourne shell and its variants (/bin/sh, /bin/bash , /bin/zsh) and the C shell and its variants (/bin/csh, /bin/tcsh). The BusyBox operating system on the Xeon Phi coprocessors supports only the Dash shell, a lightweight shell with Bash-like syntax. Each shell's environment is controlled by system-wide and user startup files. TACC deploys system-specific startup files in the /etc/profile.d/ directory. User owned startup files are dot files (begin with a period and are viewed with the "ls -a" command) in the user's $HOME directory.

Each UNIX shell may be invoked in three different ways: as a login shell, as an interactive shell or as a non-interactive shell. The differences between a login and interactive shell are rather arcane. For our purposes, just be aware that each type of shell runs different startup scripts at different times depending on how it's invoked. Both login and interactive shells are shells in which the user interacts with the operating system via a terminal window. A user issues standard command-line instructions interactively. A non-interactive shell is launched by a script and does not interact with the user, for example, when a queued job script runs.

Bash shell users should understand that login shells, for example, shells launched via ssh, source one and only one of the files ~/.bash_profile, ~/.bash_login, or ~/.profile (whichever the command finds first in file-list order), and will not automatically source ~/.bashrc. Interactive non-login shells, for example shells launched by typing "bash" on the command-line, will source ~/.bashrc and nothing else. We recommend that Bash users use ~/.profile rather than .bash_profile or .bash_login: both Bash on the Xeon E5 host and Dash on the Xeon Phi coprocessor will source ~/.profile when you login via ssh. You may also want to restrict yourself to POSIX-compliant syntax so both shells correctly interpret your commands.

The system-wide startup scripts, /etc/profile for Bash and /etc/csh.cshrc for C type shells, set system-wide variables such as ulimit, and umask, and environment variables such as $HOSTNAME and the initial $PATH. They also source command scripts in the /etc/profile.d/ directory that site administrators may use to set up the environments for common user tools (e.g., vim, less) and system utilities (e.g., Ganglia, Modules, Globus).


TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.

At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIBPATH environment variables, directory locations (e.g., $WORK, $HOME), aliases (e.g., cdw, cdh) and license paths are set by the login modules. Therefore, there is no need for you to set them or update them when updates are made to system and application software.

Users that require 3rd party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.

The environment for executing each major TACC application can be set with a module command. The specifics are defined in a modulefile file, which sets, unsets, appends to, or prepends to environment variables (e.g., $PATH, $LDLIBRARYPATH, $INCLUDE_PATH, $MANPATH) for the specific application. Each modulefile also sets functions or aliases for use with the application. You only need to invoke a single command to configure the application/programming environment properly. The general format of this command is:

module load modulename

where modulename is the name of the module to load. If you often need a specific application, see Controlling Modules Loaded at Login below for details.

Most of the package directories are in /opt/apps/ ($APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.

As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set in your shell environment by loading the fftw3 module:

login1$ module load fftw3

To look at a synopsis about using an application in the module's environment (in this case, fftw3), or to see a list of currently loaded modules, execute the following commands:

login1$ module help fftw3
login1$ module list

Available Modules

TACC's module system is organized hierarchically to prevent users from loading software that will not function properly with the currently loaded compiler/MPI environment (configuration). Two methods exist for viewing the availability of modules: Looking at modules available with the currently loaded compiler/MPI, and looking at all of the modules installed on the system.

To see a list of modules available to the user with the current compiler/MPI configuration, users can execute the following command:

login1$ module avail

This will allow the user to see which software packages are available with the current compiler/MPI configuration (e.g., Intel 13 with MVAPICH2).

To see a list of modules available to the user with any compiler/MPI configuration, users can execute the following command:

login1$ module spider

This command will display all available packages on the system. To get specific information about a particular package, including the possible compiler/MPI configurations for that package, execute the following command:

login1$ module spider modulename

Software upgrades and adding modules

During upgrades, new modulefiles are created to reflect the changes made to the environment variables. TACC will generally announce upgrades and module changes in advance.

Controlling Modules Loaded at Login

Each user's computing environment is initially loaded with a default set of modules. This module set may customized at any time. During login startup, the following command is run:

login1$ module restore

This command loads the user's personal set of modules (if it exists) or the system default. If a user wishes to have their own personal collection of modules they can create this by loading the modules they want and unloading the modules they don't and then do:

login1$ module save

This marks the collection as their personal default collection of modules that they will have every time they login. It is also possible to have named collections, run "module help" for more details.

There is a second method for controlling the module specified at login. The ".modules" file is sourced by the startup scripts at TACC and is read after the "module restore" command. This file can contain any list of module commands required. You can also place module commands in shell scripts and batch scripts. We do not recommend putting module commands in personal startup files (.bashrc, .cshrc), however; doing so can cause subtle problems with your environment on compute nodes.

Transferring Files to and from Stampede

Stampede supports multiple file transfer programs, common command-line utilities such as scp, sftp, and rsync, and services such as Globus Connect and Globus' globus-url-copy utility. XSEDE users can take advantage of both Globus Online and globus-url-copy may be used to achieve higher performance than the scp and rsync programs when transferring large files between other XSEDE systems, TACC clusters and TACC storage systems (Corral & Ranch. During production, the scp speeds between Stampede and Ranch average about 30MB/s, globus-url-copy speeds are about 125MB/s. These values vary with I/O and network traffic. The following transfer methods are presented in order of ease of use.

  1. Globus Connect
  2. Linux command-line utilities scp & rsync
  3. Globus' globus-url-copy command-line utility
  4. GSI-OpenSSH

Globus Connect

Globus Connect is recommended for transferring data between XSEDE sites. Globus Connect provides fast, secure transport via an easy-to-use web interface using pre-defined and user-created "endpoints". XSEDE users automatically have access to Globus Connect via their XUP username/password. Other users may sign up for a free Globus Connect Personal account.

Linux Command-line Data Transfer Utilities


Data transfer from any Linux system can be accomplished using the scp utility to copy data to and from the login node. A file can be copied from your local system to the remote server by using the command:

localhost% scp filename \

Consult the man pages for more information on scp.

login1$ man scp


The rsync command is another way to keep your data up to date. In contrast to scp, rsync transfers only the actual changed parts of a file (instead of transferring an entire file). Hence, this selective method of data transfer can be much more efficient than scp. The following example demonstrates usage of the rsync command for transferring a file named "myfile.c" from the current location on Stampede to Lonestar's $WORK directory.

login1$ rsync myfile.c \

An entire directory can be transferred from source to destination by using rsync as well. For directory transfers the options "-atvr" will transfer the files recursively ("-r" option) along with the modification times ("-t" option) and in the archive mode ("-a" option) to preserve symbolic links, devices, attributes, permissions, ownerships, etc. The "-v" option (verbose) increases the amount of information displayed during any transfer. The following example demonstrates the usage of the -avtr options for transferring a directory named "gauss" from the present working directory on Stampede to a directory named "data" in the $WORK file system on Lonestar.

login1$ rsync -avtr ./gauss \

For more rsync options and command details, run the command "rsync -h" or:

login1$ man rsync

When executing multiple instantiations of scp or rsync, please limit your transfers to no more than 2-3 processes at a time.


XSEDE users may also use Globus' globus-url-copy command-line utility to transfer data between XSEDE sites. globus-url-copy, like Globus Connect described above, is an implementation of the GridFTP protocol, providing high speed transport between GridFTP servers at XSEDE sites. The GridFTP servers mount the specific file systems of the target machine, thereby providing access to your files or directories.

This command requires the use of an XSEDE certificate to create a proxy for passwordless transfers. To obtain a proxy, use the "myproxy-logon" command with your XSEDE User Portal (XUP) username and password to obtain a proxy certificate. The proxy is valid for 12 hours for all logins on the local machine. On Stampede, the myproxy-logon command is located in the CTSSV4 module (not loaded by default).

login1$ module load CTSSV4
login1$ myproxy-logon -T -l XUP_username

Each globus-url-copy invocation must include the name of the server and a full path to the file. The general syntax looks like:

globus-url-copy [options] source_url destination_url

where each XSEDE URL will generally be formatted:


Users may look up XSEDE GridFTP servers on the Data Transfers & Management page.

Note that globus-url-copy supports multiple protocols e.g., HTTP, FTP in addtion to the GridFTP protocol. Please consult the following references for more information.

globus-url-copy Examples

The following command copies "directory1" from TACC's Stampede to Georgia Tech's Keeneland system, renaming it to "directory2". Note that when transferring directories, the directory path must end with a slash ( "/"):

login1$ globus-url-copy -r -vb \  
    gsi`pwd`/directory1/ \  

The following command copies a single file, "file1" from TACC's Stampede to "file2" on SDSC's Trestles:

login1$ globus-url-copy -tcp-bs 11M -vb \  
    gsi`pwd`/file1 \  

Use the buffer size option, "-tcp-bs 11M", to explicitly set the FTP data channel buffer size, otherwise, the speed will be about 20 times slower! Consult the Globus documentation to select the optimum value: How do I choose a value for the TCP buffer size (-tcp) option?

Advanced users may employ the "-stripe" option enables striped transfers on supported servers. Stampede's GridFTP servers each have a 10GbE interface adapter and are configured for a 4-way stripe since most deployed 10GbE interfaces are performance-limited by host PCI-X busses to ~6Gb/s.


Additional command-line transfer utilities supporting standard ssh and grid authentication are offered by the Globus GSI-OpenSSH implementation of OpenSSH. The gsissh, gsiscp and gsiftp commands are analogous to the OpenSSH ssh, scp and sftp commands. Grid authentication is provided to XSEDE users by first executing the myproxy-logon command (see above).

Users who need to transfer large amounts of data to Stampede may find it worthwhile to disable gsiscp's default data stream encryption. To do so, add the following three options:

  • -oTcpRcvBufPoll=yes
  • -oNoneEnabled=yes
  • -oNoneSwitch=yes

to your command-line invocation. Note that not all machines support these options. You must explicitly connect to port 2222 on Stampede. The following command copies "file1" on your local machine to Stampede renaming it to "file2".

localhost$ gsiscp -oTcpRcvBufPoll=yes -oNoneEnabled=yes -oNoneSwitch=yes \  
    -P2222 file1 stampede:file2

Please consult Globus' GSI-OpenSSH User's Guide for further info.

Sophisticated users may wish to apply HPN-SSH patches to their own local OpenSSH installations. More information about HPN-SSH can be found here:

Application Development

Programming Models

There are two primary memory models for computing: distributed-memory and shared-memory. A third model, hybrid programming, combines the two. In the distributed-memory model, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the shared-memory model, threads (light weight processes) are used to access memory in a common address space. HPC applications often use OpenMP for threading, although other methods such as Intel Threading Building Blocks (TBB) and Cilk, and POSIX threads are viable alternative solutions. The majority of scientific codes that employ threading techniques use the OpenMP paradigm because it is portable (understood by most compilers) and supported on HPC systems. Hence we will emphasize the use of the OpenMP threading paradigm when discussing the shared-memory model.

Distributed-memory Model

For distributed memory systems, single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms are used.

In the SPMD paradigm, the same program image (e.g., an a.out executable) is loaded onto cores as processes with their own private memory and address space. Each process executes and performs the same operations on different data in its own address space. This is the usual mechanism for MPI code: a single executable is available to all nodes (through a globally accessible file system such as $SCRATCH, $WORK, or $HOME) and launched on cores of each node through the TACC batch MPI launch command (e.g., "ibrun ./a.out").

In the MPMD paradigm at least two different program images are launched. This paradigm has been used in a few multi-physics applications that launch 2 or more different codes that communicate through MPI, and more often now multiple programs are interfaced to each other in workflows.

Two specialty cases of SPMD computing are Parametric Sweeps (PS) and Heterogeneous (architecture) computing. Parametric Sweeps are used to investigate the parameter space of model simulations, and launch simultaneously tens or hundreds of independent executions working on different data without an (MPI) communication. PS executables are launched through a special launcher job script designed to launch independent executables with appropriate inputs from a parameter list file. Details for launching parameter sweep jobs are described in the launcher module:

login1$ module help launcher

In SPMD heterogeneous computing the same program executes on different architectures. On Stampede nodes MPI executables can be launched on the E5 processors as well as the MIC Phi coprocessors. Unlike GPU accelerators, this is possible because the MICs run a micro OS (mOS) for launching processes, and also have an MPI interface to the Host Channel Adaptor (HCA) on the node for external communication (see Figure 1.2). Intel refers to this type of computing paradigm as "symmetric" computing to emphasize that the Phi coprocessor component of a node can be use separately as a stand-alone SMP system for executing MPI codes. On Stampede nodes, MPI applications can be launched solely on the E5 processors, or solely on the Phi coprocessors, or on both in a "symmetric" heterogeneous computing mode. For heterogeneous computing, an application is compiled for each architecture and the MPI launcher ("ibrun" at TACC) is modified to launch the executables on the appropriate processors according to the resource specification for each platform (number of tasks on the E5 component and the Phi component of a node).

Shared-memory Model

Shared-memory programming paradigms that employ threading can only be executed on a set of cores where the threads can access the same (shared) memory. Applications spawn threads on the cores to work on tasks in parallel and access the same memory.

Each Stampede node, though, has a Phi coprocessor that is effectively a stand-alone processor with its own memory space. An OpenMP application can run solely on the E5 processors (host), or solely on the Phi coprocessors (native), or on both.

On a single Stampede node, applications can be compiled to run only on the (host) cores of the E5 processors within its 32GB of shared memory. Likewise, an application can be compiled to run only (natively) on the MIC Phi coprocessor within its 8GB of shared memory. Users can take advantage of both the E5 and Phi processors by executing an OpenMP threaded application on the E5s, and offloading parallel work (usually as OpenMP work sharing constructs) to the MIC. Another use case for threading can be found in MPI Hybrid codes. An MPI task running on an E5 or Phi coprocessor can spawn threads on the local cores that share the process memory of the MPI task that forked the threads.

Common OpenMP (Shared Memory) Use Cases

  • Host only: run only on the E5
  • MIC: run natively on the Phi
  • Offload: run OpenMP on the E5 and on the Phi
  • MPI Hybrid
    • Symmetric: launch MPI tasks on the E5 and the Phi
    • Offload: launch MPI tasks on the E5 and offload openMP code to the Phi

See the pertinent compilation options and combinations in Table 5.2, and the section on Symmetric Computing.

Hybrid Model

In multi-core cluster systems with a small core count per node, applications are run in pure MPI mode in which all cores within the cluster have their own private (distributed) memory. For an MPI application this is accomplished by executing multiple a.out's on a node, with each process running in a separate address space using memory accessible only to the process. In this way, all cores appear as a set of distributed memory machines, even though each node has cores that share a single memory subsystem.

Within the last few years more HPC MPI applications (distributed memory) have been employing on-node threading with OpenMP (a shared memory paradigm) to minimize the number of MPI communicating tasks and to increase the memory per MPI task.

As HPC systems evolve from multi- to many-core systems we expect to see even more applications migrate to this hybrid paradigm (employing MPI tasks to manage memory access across nodes, reducing the number of MPI tasks per node, and employing OpenMP to use several to many threads per MPI task).

In the SMPP paradigm either compiler directives (as pragmas in C, and special comments in Fortran) or explicit threading calls (e.g., with Pthreads or Cilk) are employed. The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.

In clusters with SMPs, hybrid programming is sometimes employed to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.


The Stampede system's default programming environment uses the Intel C++ and Fortran compilers. These Intel compilers are the only compilers able to compile programs for the Phi coprocessors. This section explains and illustrates the important and basic uses of this compiler for all the programming models described above (OpenMP/MPI on the E5/Phi systems in native mode, offloading and heterogeneous modes). The Intel compiler commands can be used for both compiling (making ".o" object files) and linking (making an executable from "o" object files).

The Intel Compiler Suite

The Intel Fortran and C++ Composer XE 2013 suite (version 13.0) is loaded as the default compiler at login. TACC staff recommends using the Intel compilers whenever possible. The Intel suite has been installed with 64-bit standard libraries and compiles programs as 64-bit applications (as the default compiler mode). Since the E5's and Phi coprocessors are new architectures, which rely on optimizations in the new 2013 compiler, any program compiled for another Intel system should be recompiled.

The Intel Fortran and C/C++ compiler commands are "ifort", "icc" and "icpc", respectively. Use the "-help" option with any of these commands to display a list and explanation of all the compiler options, useful during debugging and optimization. Release notes for the new 2013 compilers are available at; the user and references guides are also available at the Intel web site.

Basic Compiler Commands and Serial Program Compiling

Appropriate file name extensions are required for each compiler. By default, the executable filename is "a.out", but it may be renamed with the "-o" option. We use "a.out" throughout this guide to designate a generic executable file. The compiler command performs two operations: it makes a compiled object file (having a .o suffix) for each file listed on the command-line, and then combines them with system library files in a link step to create an executable. To compile without the link step, use the "-c" option.

The same code can be compiled to run either natively on the host or natively on the MIC. Use the same compiler commands for the host (E5) or the MIC (Phi) compiling, but include the "-mmic" option to create a MIC executable. We suggest you name MIC executables with a ".mic" suffix.

Table 5.1 Compiling serial programs

Language Compiler File Extension Example
C icc .c icc compiler_options prog.c
C++ icpc .C, .cc, .cpp, .cxx icpc compiler_options prog.cpp
F77 ifort .f, .for, .ftn ifort compiler_options prog.f
F90 ifort .f90, .fpp ifort compiler_options prog.f90

Table 5.2 Host, MIC and Host+MIC offload compilations

Mode Required Options Notes
none Use -xhost to generate AVX (Advanced Vector Extensions) instructions.
Phi (MIC)
-mmic Suggestion: name executables with a ".mic" suffix to differentiate them from an E5 executables. e.g., "ifort -mmic prog.f90 -o prog.mic"
Host + Offload none for automatic offloading of MKL lib functions use environment variables;
for direct offloading use pragmas

The following examples illustrate how to rename an executable (-o option), compile for the host (run on the E5 processors), and compile for the MIC (run natively on the MIC):

A C program example:

  login1$ icc   -xhost -O2 -o flamec.exe     prog.c
  login1$ icc   -mmic  -O2 -o flamec.exe.mic prog.c

A Fortran program example:

  login1$ ifort -xhost -O2 -o flamef.exe     prog.f90
  login1$ ifort -mmic  -O2 -o flamef.exe.mic prog.f90

Commonly used options may be placed in an icc.cfg or ifc.cfg file for compiling C and Fortran code, respectively.

For additional information, execute the compiler command with the "-help" option to display every compiler option, its syntax, and a brief explanation, or display the corresponding man page, as follows:

  login1$ icc   -help
  login1$ icpc  -help
  login1$ ifort -help
  login1$ man icc
  login1$ man icpc
  login1$ man ifort

Some of the more important options are listed in the Basic Optimization section of this guide.

Compiling OpenMP Programs

Since each Stampede node has many cores (16 E5 and 61 Phi cores), applications can take advantage of multi-core shared-parallelism by using threading paradigms such as OpenMP. For applications with OpenMP parallel directives, include the "-openmp" option on the compiler command to enable the parallel thread generation. Use the "-openmp_report" option to display diagnostic information.

Table 5.3 Important OpenMP compiler options.

Compiler Options
-openmp Enables the parallelizer to generate multi-threaded code based on the OpenMP directives.
Use whenever OpenMP pragmas are present in core for E5 processor or Phi coprocessor.
-openmp_report[0|1|2] Controls the OpenMP parallelizer diagnostic level

Below are host and MIC compile examples for enabling OpenMP code directives.

  login1$ icc   -xhost -openmp -O2 -o flamec.exe      prog.c
  login1$ icc   -mmic  -openmp -O2 -o flamec.exe.mic  prog.c
  login1$ ifort -xhost -openmp -O2 -o flamef.exe      prog.f90
  login1$ ifort -mmic  -openmp -O2 -o flamef.exe.mic  prog.f90

The Intel compiler accepts OpenMP pragmas and OpenMP API calls that adhere to the 3.1 standard ( The $KMP_AFFINITY and OpenMP environment variables that set thread affinity and thread control are described in the "running code" section below.

Compiling MPI Programs

Stampede supports two versions of MPI: MVAPICH2, an open-source MPI library from the Network-Based Computing Laboratory (NBCL) at Ohio State University, and the proprietary Intel® MPI Library 4.1. Both libraries provide MPI Compiler Commands that invoke an appropriate compiler with the appropriate MPI include-file path and MPI library-file path options. At login, MPI MVAPICH2 (mvapich2) and Intel compiler (intel) modules are loaded to produce the default environment that includes the mvapich compiler commands (mpixxx) in the user execution path ($PATH). To use the Intel MPI, swap in the impi module ("module swap mvapich2 impi). Use the mpixxx compiler commands in Table 5.4 for the MVAPICH2 and Intel MPI libraries. Communication performance may differ, and run-time controls are managed by a different set of run-time environment variables that are documented in the MVAPICH2 and Intel MPI user guides and reference manuals.

The MPI compiler commands for MVAPICH2 and Intel libraries are listed for each language in the table below:

Table 5.4 MPI Compiler Commands

Compiler MPI
Language File Extension Example
mpicc mvapich2 intel C .c mpicc <compiler_options> myprog.c
mpicxx mvapich2 Intel C++ intel: .C/c/cc/cpp/cxx/c++/i/ii mpicxx <compiler_options>
mpif77 mvapich2 Intel F77 .f, .for, .ftn mpif77 <compiler_options> myprog.f
mpif90 mvapich2 Intel F90 .f90, .fpp mpif90 <compiler_options> myprog.f90

Appropriate file name extensions are required for each wrapper. By default, the executable name is named "a.out". You may rename it using the "-o" option. To compile without the link step, use the "-c" option. The following examples illustrate renaming an executable and the use of two important compiler optimization options.

MVAPICH2 Compile examples

login1$ mpicc  -xhost -O2 -o simulate.exe mpi_prog.c
login1$ mpif90 -xhost -O2 -o simulate.exe mpi_prog.f90

Include linker options, such as library paths and library names, after the program module names, as explained in the Libraries section below. The Running Applications section explains how to execute MPI executables in batch scripts and interactive batch runs on compute nodes. Note: Use Intel's mpi, impi, or mvapich2-mic modules for running MPI applications on the MIC. The regular mvapich2 modules do not support MIC native applications.

Compiling with gcc

We recommend that you use the Intel compiler for optimal code performance. TACC does not support the use of the gcc compiler for production codes on the E5 processors, and there is no information from Intel about combining Intel and gcc binaries for the Phi coprocessor. For those rare cases when gcc is required, for either a module or the main program, you can specify the gcc compiler with the "-cc mpcc" option for modules requiring gcc. (Since gcc- and Intel-compiled codes are binary compatible, you should compile all other modules that don't require gcc with the Intel compiler.) When gcc is used to compile the main program, an additional Intel library is required. The examples below show how to invoke the gcc compiler for these two cases:

gcc Compile Examples
  • Create object file suba.o with gcc
    login1$ mpicc -O3 -xhost -c -cc=gcc suba.c
  • Create a.out: compile main with icc; load in suba.o
    login1$ mpicc -O3 -xhost mymain.c suba.o
  • Create object file suba.o with icc
    login1$ mpicc -O3 -xhost -c suba.c
  • Create a.out: compile main with gcc; load in suba.o
    login1$ mpicc -O3 -xhost -cc=gcc -L$ICC_LIB -lirc mymain.c suba.o

Compiling for Phi Offloading

During the execution of an application, work can be re-directed to execute on the Phi coprocessor. There are two ways to offload work onto the MIC: automatic offloading, and compiler-assisted offloading. Automatic offloading of a growing set of Intel's MKL (Math Kernel Library) library routines can be invoked by simply setting environment variables prior to execution. Compiler-assisted offloading requires the application developer to insert directives within the application code. Even in this explicit form the synchronous offloading can automatically determine and move data between the host and MIC. Direct control of data allocation, data transfers and asynchronous offloading allows developers to specifically optimize offloading for the spatial and temporal data requirements of their application. The Offload Coding section explains and illustrates the directives used in compiler-assisted offloading.

Automatic Offloading (AO)

Certain routines in the MKL library have been redesigned to run on the host, run as an offloaded routine on the MIC, or run on both with the workload split between the two. Users can control the mode at run time with environment variables. (There is also an API for developers to code the control directly inside the application.) In the example below the "has_dgemm" program is compiled and the "dgemm" routine is loaded from the MKL library in the usual way. The following lines show how the offload is controlled (in a batch job script or an interactive session). The $OMP_NUM_THREADS and $MIC_OMP_NUM_THREADS variables set the number of threads on the host and the MIC, respectively.

Compile and run example for AO :

  login1$ icc   -O3 -xhost  -mkl has_dgemm.c
  login1$ ifort -O3 -xhost  -mkl has_dgemm.f90
  c401-001$ export MKL_MIC_ENABLE=1
  c401-001$ export OMP_NUM_THREADS=16
  c401-001$ ./a.out
Compiler-assisted Offloading

The Intel C/C++ and Fortran 2013 compilers support directives that offload regions of work to a Phi coprocessor. When the execution of an offload-enabled application on the host encounters an offload region, the data and instructions are sent to a Phi coprocessor for execution and the host waits for completion. If a Phi coprocessor is not available, the application will run an E5-compiled version of the region on the host.

Specific compiler and loader options for offloaded code can be set as a comma-separated list in a string within the following compiler command options:




In the following example the host code is compiled with "-O3", and the offloaded code is compiled with "-O2 -fma" and linked with "-g".

login1$ icc -O3 -offload-option,mic,compiler,"-O2 -fma" \

Compiler Optimization Options

Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for inter-procedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

Levels of optimization are set with the "-On" option, detailed in Table 5.5 below.

Table 5.5 Compiler optimization levels

Level -On Description
n = 0 Fast compilation, full debugging support. Automatically enabled if using -g.
n = 1,2 Low to moderate optimization, partial debugging support:
  • instruction rescheduling
  • copy propagation
  • software pipelining
  • common subexpression elimination
  • prefetching, loop transformations
n = 3 Aggressive optimization - compile time/space intensive and/or marginal effectiveness; may change code semantics and results (sometimes even breaks code!) :
  • enables -O2
  • more aggressive prefetching, loop transformations

Table 5.6 lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.

Table 5.6 Compiler Options Affecting Performance

Option Description
-assume buffered_ioUse this to ensure buffered I/O from your Fortran executables (recommended)
-c For compilation of source file only.
-fpe0 Enable floating point exceptions. Useful for debugging.
-fp-model Enable floating-point model variation:
  • [no-]except : enable/disable floating-point semantics
  • fast[=1|2] : enables more aggressive floating-point optimizations
  • precise : allows value-safe optimizations
  • source : allows intermediates in source precision
  • strict : enables fp-model precise, fp-model except, disables contractions and enables pragma stdc fenv_access
  • double : rounds intermediates in 53-bit (double) precision
  • extended : rounds intermediates in 64-bit (extended) precision
-fast combination of aggressive optimization options, implies -static and -no-prec-div (not recommended)
-g Debugging information, generates symbol table. (-fp , disables using EBP as general-purpose register)
-help lists options.
-ip Enable single-file inter-procedural (IP) optimizations (within files).
-ipo Enable multi-file IP optimizations (between files).
-no-offload Disable any offload usage
-O3 Aggressive optimization (-O2 is default).
-offload-option,mic, <tool>"<opts>" Appends additional options to defaults for tool (compiler, ld, or as). Opts is a command separated list of options.
-opt-prefetch Enables data prefetching.
-opt-streaming-stores <arg> Specifies whether streaming stores are generated:
  • Always : Enable streaming stores under the assumption that the application is memory bound
  • Auto : [DEFAULT] Compiler decides when streaming stores are used
  • Never : Disable generation of streaming stores
-openmp Enable the parallelizer to generate multi-threaded code based on the OpenMP directives.
-openmp-report[0|1|2] Controls the OpenMP parallelizer diagnostic level.
-vec_report[0|...|6] controls amount of vectorizer diagnostic information.
-xhost includes specialized code for AVX instruction set.

Basic Optimization for Serial and Parallel Programming using OpenMP and MPI

The MPI compiler wrappers use the same compilers that are invoked for serial code compilation. So, any of the compiler flags used with the icc command can also be used with mpicc; likewise for ifort and mpif90; and icpc and mpicxx.

Developers often experiment with the following options: "-pad", "-align", "-ip", "-no-rec-div" and "-no-rec-sqrt". In some codes performance may decrease. Please see the Intel compiler manual for a full description of each option.

Use the "-help" option with the mpi commands for additional information:

login1$ mpicc  -help
login1$ mpicxx -help
login1$ mpif90 -help
login1$ mpirun -help

Use the options listed for mpirun with the ibrun command in your job script. For details on the MPI standard, go to:


Some of the more useful load flags/options for the host environment are listed below. For a more comprehensive list, consult the ld man page.

  • Use the "-l" loader option to link in a library at load time:
  • ifort prog.f90 -l<name>

    This links in either the shared library "" (default) or the static library "libname.a", provided the library can be found in ldd's library search path or the $LD_LIBRARY_PATH environment variable paths.

  • To explicitly include a library directory, use the "-L" option:
  • ifort prog.f -L/mydirectory/lib -l<name>

In the above example, the user's libname.a library is not in the default search path, so the "-L" option is specified to point to the directory containing libname.a (only the library name is supplied in the "-l" argument; remove the "lib" prefix and the ".a" suffix.)

Many of the modules for applications and libraries, such as the hdf5 library module provide environment variables for compiling and linking commands. Execute the "module help module_name" command for a description, listing and use cases for the assigned environment variables. The following example illustrates their use for the hdf5 library:

  login1$ icc -I$TACC_HDF5_INC hdf5_test.c -o hdf5_test \
                  -Wl,-rpath,$TACC_HDF5_LIB -L$TACC_HDF5_LIB -lhdf5 -lz

Here, the module supplied environment variables $TACC_HDF5_LIB and $TACC_HDF5_INC contain the hdf5 library and header library directory paths, respectively. The loader option "-Wl,-rpath" specifies that the $TACC_HDF5_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of the $LD_LIBRARY_PATH or the LDD dynamic cache of bindings between shared libraries and directory paths. This avoids having to set the $LD_LIBRARY_PATH (manually or through a module command) before running the executables. (This simple load sequence will work for some of the unthreaded MKL functions; see MKL Library section for using various packages within MKL.)

You can view the full path of the dynamic libraries inserted into your binary with the ldd command. The example below shows a partial listing for the hdf5_test binary:

  login1$ ldd hdf5_test
    ... => /opt/apps/intel13/hdf5/1.8.9/lib/ (0x00002a) => /lib64/ (0x00000034a7a00000)
    ... => /opt/apps/intel13/hdf5/1.8.9/lib/ (0x00002abdc2) => /lib64/tls/ => /lib64/

Performance Libraries

ISPs (Independent Software Providers) and HPC vendors provide high performance math libraries that are tuned for specific architectures. Many applications depend on these libraries for optimal performance. Intel has developed performance libraries for most of the common math functions and routines (linear algebra, transformations, transcendental, sorting, etc.) for the EM64T architectures. Details of the Intel libraries and specific loader/linker options are given below.

MKL library

The Math Kernel Library consists of functions with Fortran, C, and C++ interfaces for the following computational areas:

  • BLAS (vector-vector, matrix-vector, matrix-matrix operations) and extended BLAS for sparse computations
  • LAPACK for linear algebraic equation solvers and eigen system analysis
  • Fast Fourier Transforms
  • Transcendental Functions

The MKL library and associated environment variables, $MKLROOT and $TACC_MKL_xxx, are loaded by default with the Intel compiler upon login to Stampede. If you switch compilers, for example to gcc as below, then you must reload the mkl module to gain access to the MKL environment:

login1$ module swap intel gcc

Due to MODULEPATH changes the following have been reloaded:
1) mvapich2/1.9a2

login1$ module load mkl

In addition, MKL also offers a set of functions collectively known as VML -- the Vector Math Library VML is a set of vectorized transcendental functions that offer both high performance and excellent accuracy compared to the libm functions (for most of the Intel architectures). The vectorized functions are considerably faster than standard library functions for vectors longer than a few elements.

Below is an example command for compiling and linking a program that contains calls to BLAS functions (in MKL). Note that the library is for use in a single node; hence it can be used by both serial compilers or by MPI wrapper scripts.

The following C and Fortran examples illustrate the use for the MKL library with the Intel compilers, dynamic libraries and in sequential (non-threaded) mode:

  login1$ icc   myprog.c   -mkl=sequential
  login1$ ifort myprog.f90 -mkl=sequential

Static linking is more involved. Note that the $MKLROOT environment variable is defined when the Intel compiler module is loaded, and that we have added the -I$MKLROOT/include directory to the compiler flags.

  login1$ icc   -I$MKLROOT/include  myprog.c    -Wl,--start-group \
      $MKLROOT/lib/intel64/libmkl_intel_lp64.a \
      $MKLROOT/lib/intel64/libmkl_sequential.a \
      $MKLROOT/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm
login1$ ifort -I$MKLROOT/include myprog.f90 -Wl,--start-group \     $MKLROOT/lib/intel64/libmkl_intel_lp64.a \     $MKLROOT/lib/intel64/libmkl_sequential.a \     $MKLROOT/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm

To use threaded libraries with dynamic library linking:

  login1$ icc   -openmp -I$MKLROOT/include   myprog.c   -mkl=parallel
  login1$ ifort -openmp -I$MKLROOT/include   myprog.f90 -mkl=parallel

and threaded libraries with static library linking:

  login1$ icc -openmp -I$MKLROOT/include myprog.c -mkl=parallel -Wl,--start-group \
      $MKLROOT/lib/intel64/libmkl_intel_lp64.a \
      $MKLROOT/lib/intel64/libmkl_intel_thread.a \
      $MKLROOT/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm
login1$ ifort -openmp -I$MKLROOT/include myprog.f90 -mkl=parallel -Wl,--start-group \     $MKLROOT/lib/intel64/libmkl_intel_lp64.a \     $MKLROOT/lib/intel64/libmkl_intel_thread.a \     $MKLROOT/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm

Assistance in constructing the MKL linker options is provided by the MKL Link Line Advisor utility ( Please refer to the link line advisor for information on how to link against the MKL libraries for host and coprocessor, automatic offload and compiler assisted offload.

To use the GNU compilers (gcc/gfortran) in sequential non-threaded mode, with MKL, static linking is unchanged. Dynamic linking is changed:

  #First set MKLROOT, otherwise it will NOT be defined
  login1$ export MKLROOT= /opt/apps/intel/13/composer_xe_2013.1.117/mkl
login1$ gcc -I$MKLROOT/include myprog.c -L$(MKLROOT)/lib/intel64 \                 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm
login1$ gfortran -I$MKLROOT/include myprog.f90 -L$(MKLROOT)/lib/intel64 \                 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm

It is important to remember that the $MKLROOT environment variable is only defined if the intel module is loaded. It would have to be explicitly defined otherwise. If the code is to be compiled with openmp ("-fopenmp") it might cause problems with symbols from the iomp5 library (the intel OpenMP library) that is needed for the threaded version of the MKL libraries.


The Intel MKL libraries also include a highly optimized version of ScaLAPACK that can be used in C/C++ and Fortran with Intel and mvapich2 MPI libraries. For dynamic libraries with Intel libraries with threading:

  login1$ mpicc prog.c -I$MKLROOT/include -L$MKLROOT/lib/intel64 \
      -lmkl_scalapack_lp64 \
      -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \
      -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm
login1$ mpif90 prog.f90 -I$MKLROOT/include -L$MKLROOT/lib/intel64 \     -mkl_scalapack_lp64 \     -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \     -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm

Note, the C library list includes the Fortran libfcore.a to satisfy miscellaneous Intel run-time routines; other situations may require the portability or POSIX libraries, libifport.a and libifposix.a.

Static libraries on the other hand have a different link line:

  login1$ mpicc prog.c -I$MKLROOT/include \
    $MKLROOT/lib/intel64/libmkl_scalapack_lp64.a \
    -Wl,--start-group  $MKLROOT/lib/intel64/libmkl_intel_lp64.a \
    $MKLROOT/lib/intel64/libmkl_sequential.a $MKLROOT)/lib/intel64/libmkl_core.a \
    $MKLROOT/lib/intel64/libmkl_blacs_intelmpi_lp64.a -Wl,--end-group -lpthread -lm
login1$ mpif90 prog.f90 -I$MKLROOT/include \   $MKLROOT/lib/intel64/libmkl_scalapack_lp64.a \   -Wl,--start-group $MKLROOT/lib/intel64/libmkl_intel_lp64.a \   $MKLROOT/lib/intel64/libmkl_sequential.a $MKLROOT/lib/intel64/libmkl_core.a \   $MKLROOT/lib/intel64/libmkl_blacs_intelmpi_lp64.a -Wl,--end-group -lpthread -lm

Runtime Environment

Bindings to the most recent shared libraries are configured in the file /etc/ (and cached in the /etc/ file). Cat /etc/ to see the TACC configured directories, or execute:

login1$ /sbin/ldconfig -p

To see a list of directories and candidate libraries, echo the $LD_LIBARY_PATH environment variable and use the -Wl,-rpath option to override the default runtime bindings.

Environment variables set by particular modules can be viewed with the module show command:

login1$ module show modulefile

Libraries on the Phi Coprocessor

Native Libraries

You can build libraries that run natively on the Phi coprocessor by including the "-mmic" compiler option when invoking the compiler. The $MIC_LD_LIBRARY_PATH environment variable can be used to point to the directory containing the native library. Static native library archives can be created with the "ar" archiver, as is the case with host static library archives. Similarly, dynamic native library archives are created the same way they are created for the host by adding the "-mmic" and "-fPIC" options to the compilation of the object file and using the "-shared" flag when creating the shared library:

  login1$ icc -mmic -fPIC -c fun1.c fun2.c
  login1$ ar rcs libsampleMIC.a fun1.o fun2.o
Static Libraries in Offloaded Code

Conceptually, the compiler, linker and standard library environments on the host and coprocessor are thinly separated. Code for the host is compiled within the host environment and offloaded code within the coprocessor environment. The coprocessor environment can include libraries and these would be available to be called from offloaded code with no need to use special syntax or runtime features.

When the coprocessor is available, offloaded code is loaded when the host version of the executable is loaded, or when the first offload is attempted. At this time the libraries linked with the offloaded code are initialized. The loaded target executable remains in the target memory until the host program terminates. Thus, any global state maintained by the library is maintained across offload instances.

Separate copies of libraries are linked or loaded with the host program and the offloaded coprocessor code so that there are two sets of global states: one on the host and one on the coprocessor. The host code only sees the host state and the offloaded coprocessor code only sees the state of the library on the coprocessor.

To create static coprocessor libraries, first create MIC object code (by compiling offload decorated modules with "-c" or compiling non-decorated modules with the "-c" and "-mmic" options). Once the *MIC.o objects (and the CPU *.o objects) have been created, use the "xiar" command with the "-qoffload-build" option to build an archive.

  login1$ xiar -qoffload-build ar options archive [member...]
  login1$ xild -lib -qoffload-build ar options archive [member...]

When supplying the name of the library and the list of its member files to xiar or xild, it is important to only specify the file names associated with the host library and host object files, such as lib.a and file.o. "xiar" and "xild" automatically manipulate the corresponding coprocessor library and member files, libMIC.a and fileMIC.o, respectively.

For example:

login1$ xiar -qoffload-build rcs libsample.a fun1.o fun2.o

This will create the libsample.a library for the host and a libsampleMIC.a library for the coprocessor. The libsample.a will contain the object files fun1.o and fun2.o. The libsampleMIC.a library will contain the object files fun1MIC.o and fun2MIC.o.

Linking against the static library follows the same rules as specifying the object files: the host library should be specified and the compiler will automatically incorporate the coprocessor library. The host library can be specified with the linker options -L and -l<library name> (the lib<library name>MIC library will be included automatically by the compiler).


DDT is a symbolic, parallel debugger that allows graphical debugging of MPI applications. For information on how to perform parallel debugging using DDT on Stampede, please see the DDT Debugging Guide.

Code Tuning

In this section we discuss various methods, tips & tricks for optimizing your code, increasing efficiency and making best use of Stampede's hardware.

Memory Tuning

There are a number of techniques for optimizing application code and tuning the memory hierarchy.

Maximize cache reuse

The following snippets of code illustrate the correct way to access contiguous elements i.e. stride 1, for a matrix in both C and Fortran.

Fortran example C example
Real*8 :: a(m,n), b(m,n), c(m,n)
 ... do i=1,n
  do j=1,m
  end do
end do
double a[m][n], b[m][n], c[m][n];
for (i=0;i <m;i++){
  for (j=0;j <n;j++){


Prefetching is the ability to predict the next cache line to be accessed and start bringing it in from memory. If data is requested far enough in advance, the latency to memory can be hidden. Compiler inserts prefetch instructions into loop -- instructions that move data from main memory into cache in advance of their use. Prefetching may also be specified by the user using directives.

Example : In the following dot-product example, the number of streams prefetched are increased from 2, to 4, to 6, for the same functionality. However, just prefetching a larger number of streams does not necessarily translate into increased performance. There is a threshold value beyond which prefetching more streams can be counterproductive.

2 streams 4 streams
do i=1,n
  end do
do i=1,n/2
end do
6 streams
do i=1,n/3
end do
do i=3*(n/3)+1,n
end do

Fit the problem

Make sure to fit the problem size to memory (32GB/node) as there is no virtual memory available for swap.

  1. Always minimize stride length . For the best-case scenario, stride length 1 is optimal for most systems and in particular the vector systems. If that is not possible, then the low-stride access should be the goal. That increases cache efficiency, as well as sets up hardware and software prefetching. Stride lengths of powers of two is typically the worst case scenario leading to cache misses.
  2. Another approach is data reuse in cache by cache blocking . The idea is to load chunks of the data so it fits maximally in the different levels of cache while in use. Otherwise the data has to be loaded into cache from memory every time it becomes necessary since its not in cache. This phenomenon is commonly known as cache miss . This is costly from the computational standpoint, since the latency for loading data from memory is a few orders higher than from cache, hence the concern. The goal is to keep as much of the data in cache while it is in use and to minimizing loading it from memory.
  3. This concept is illustrated in the following matrix-matrix multiply example where the indices for the i, j, k loops are set up in such a way so as to fit the greatest possible sizes of the different sub-matrices in cache while the computation is ongoing.

    Example: Matrix multiplication

          Real*8 a(n,n), b(n,n), c(n,n)
          do ii=1,n,nb ! < nb is blocking factor
            do jj=1,n,nb
              do kk=1,n,nb
                do i=ii,min(n,ii+nb-1)
                  do j=jj,min(n,jj+nb-1)
                    do k=kk,min(n,kk+nb-1)
                    end do
                  end do
                end do
              end do
            end do
          end do
  4. Another standard issue is the dimension of arrays when they are stored and it is always best to avoid leading dimensions that are a multiple of a high power of two . Users should be particularly aware of the cache line and associativity. Performance degrades when the stride is a multiple of the cache line size.
  5. Example : Consider an L1 cache that is 16 K in size and 4-way set associative, with a cache line of 64 Bytes.

    Problem : A 16 K 4-way set associative cache has 4 sets of 4 K each (4096). If each cache line is 64 bytes, then there are 64 cache lines per set. Effectively reduces L1 from 256 cache lines to only 4. That results in a 256 byte cache, down from the original 16 K, due to the non-optimal choice of leading dimension.

          Real*8 :: a(1024,50)
          do i=1,n
          end do

    Solution : Change leading dimension to 1028 (1024 + 1/2 cache line)

  6. Encourage Data Prefetching to Hide Memory Latency
  7. Work within available physical memory

Floating-Point Tuning

Unroll Inner Loops to Hide FP Latency

In the following dot-product example, two points are illustrated. If the inner loop indices are small then the inner loop overhead makes it optimal to unroll the inner loop instead. In addition, unrolling inner loops hides floating point latency. A more advanced notion of micro level optimization is the measure of the relative rate of operations and the number of data access in a compute step. More precisely it is rate of Floating Multiply Add to data access ratio in a compute step. The higher this rate, the better.

      do i=1,n,k
        s1 = s1 + x(i)*y(i)
        s2 = s2 + x(i+1)*y(i+1)
        s3 = s3 + x(i+2)*y(i+2)
        s4 = s4 + x(i+3)*y(i+3)
        sk = sk + x(i+k)*y(i+k)
      end do
      dotp = s1 + s2 + s3 + s4 + ... + sk

Avoid Divide Operations

The following example illustrates a very common step, since a floating point divide is more expensive than a multiply. If the divide step is inside a loop, it is better to substitute that step by a multiply outside of the loop, provided no dependencies exist. Another alternative is to replace the loop by optimized vector intrinsics library, if available.

      do i=1,n
      end do
      do i=1,n
      end do

Fortran90 Performance Pitfalls

Several coding issues impact the performance of Fortran90 applications. For example, consider the two cases of using different F90 Array syntax for the two dimensional arrays below:

Case 1:

    do j = js,je
      do k = ks,ke
        do i = is,ie
          rt(i,k,j) = rt(i,k,j) - smdiv*(rt(i,k,j) - rtold(i,k,j))

Case 2:

    rt(is:ie,ks:ke,js:je)=rt(is:ie,ks:ke,js:je) - &
    smdiv * rt(is:ie,ks:ke,js:je) - rtold(is:ie,ks:ke,js:je))

The array syntax in the computation step of the second approach leads to a significant performance penalty over using explicit loops on cache-based systems, although it is more elegant. Vector systems tend to prefer this array syntax from a performance standpoint. More importantly, the array syntax generates larger temporary arrays on the program stack.

The way the arrays are declared also impacts performance. In the following example, there are two cases of F90 assumed shape arrays. In the second case, the negative performance impact is significantly higher, almost ten-fold in compile time.

Case 1:

    REAL, DIMENSION( ims:ime , kms:kme , jms:jme ) :: r, rt, rw, rtold
    Results in F77-style assumed-size arrays
    Compile time: 46 seconds
    Run time: .064 seconds / call

Case 2:

    REAL, DIMENSION( ims: , kms: , jms: ) :: r, rt, rw, rtold
    Results in F90-style assumed-shape arrays
    Compile time: 3120 seconds!!
    Run time: .083 seconds / call

Another issue that arises from the F90 assumed shape arrays occurs when it is a parameter in a subroutine. Using assumed shape arrays as a parameter in a subroutine may result in the subroutine being passed a copy, rather than being passed the address of the array itself. This F90 copy-in/copy-out overhead is inefficient and may cause errors when calling external libraries.

IO Tuning

The TACC Lustre Parallel I/O systems is a collection of IO servers and a large number of disks that act as if they are one very large disk. One of the IO servers, the Meta-Data Server (MDS), tracks where each file is located on the disks of the different IO servers.

Because Lustre has a large collection of disks it is possible to read and write large files across many disks quickly. However, all opening, closing and location of files must go through a single Meta-Data Server and this often becomes a bottle-neck in IO performance. With hundreds of jobs running at the same time and sharing the Lustre file system, there can be much contention for the accessing the Meta-Data Server.

When considering IO performance, use the "avoid too often and too many" rules. Avoid writing small files, opening and closing a file frequently, writing to a separate file for each task in a large parallel job stresses the MDS. It is best to aggregate I/O operations whenever possible. For best I/O performance one should consider using libraries like parallel HDF5 to write single files in parallel efficiently.

Some of the more common sense approach entails using what's provided by the vendor i.e. taking advantage of the hardware. On Linux systems for example, this would mean using the Parallel Virtual Filesystem (PVFS) for Linux-based clusters. On IBM systems, for example, that would imply using the fast Global Parallel Filesystem (GPFS) provided by IBM.

Other common sensible approaches to optimizing I/O is to be aware of the existence and the locations of the file systems i.e. whether the file systems are locally mounted or through a remote file system. The former is much faster than the latter, due to limitations of network bandwidth, disk speed and overhead due to accessing the file system over the network and should always be the goal at the design level.

The other approaches include considering the best software options available. Some of them are enumerated below:

  • Read or write as much data as possible with a single READ/WRITE/PRINT. Avoid performing multiple writes of small records.
  • Use binary instead of ASCII format because of the overhead incurred converting from the internal representation of real numbers to a character string. In addition, ASCII files are larger than the corresponding binary file.
  • In Fortran, prefer direct access to sequential access. Direct or random access files do not have record length indicators at the beginning and end of each record.
  • If available, use asynchronous I/O to overlap reads/writes with computation.

General I/O Tips

  1. Don't open and close files with every I/O operation
  2. Open the files that are needed at the beginning of execution and close them at the end. Each open/close operation has an overhead cost that adds up, especially if multiple tasks are opening and closing files at the same time.

  3. Use the $SCRATCH filesystem
  4. $SCRATCH has more I/O servers than $WORK or $HOME. If you need to keep your input/output data on $WORK, you may add commands to your batch script to copy data in $SCRATCH at the beginning of a run and copy out at the end.

  5. Limit the number of files in one directory
  6. Directories that contain hundreds or thousands of files will be slow to respond to I/O operations due to overhead of indexing the files. Break up directories with large number of files into subdirectories with fewer files.

  7. Aggregate I/O operations as much as possible
  8. Fewer large read/write operations are much more efficient than many small operations. This may be accomplished by reducing the number of writers from every task to one per node or fewer to balance the bandwidth of a node with the bandwidth of the I/O servers.

Parallel I/O Striping

There are different Lustre striping options you should use when writing to a single file or multiple files.

  • Multiple Files/Multiple I/O tasks -- Set the stripe count to 1. For a large number of files, set the stripe count for the output directory to 1:
  • login1$ lfs setstripe -c 1 ./my_output_dir

    If you are writing the same amount of data with each write, you should try setting a default stripe size for the output files.

    login1$ lfs setstripe -s 16mb ./my_output_dir

    Please limit these settings to the directory in which the output will be stored.

    Do not set the stripe count higher than 160; attempting to do so will crash the Meta-Data Server.

  • Single File/Multiple tasks -- Increase the stripe count to the number of nodes.
  • For example, if 1024 tasks are writing to the same file from 64 nodes, set the stripe count to 64:

    login1$ lfs setstripe -c 64 ./my_output_file

    Also, experiment with the size of the stripe to improve performance:

    login1$ lfs setstripe -s 16mb ./my_output_file

Writing to a Single File using Parallel Libraries:

There are three common ways to write a single file in parallel: MPI I/O, Netcdf-4 and HDF5. Any of these work well, but developers should consider using HDF5 when there is no preference. An effective way to learn HDF5 is start with the HDF5 tutorial found at

The T3PIO library improves the parallel performance of MPI I/O and HDF5 by setting the number of stripes and stripe size at runtime for each file created. It is a module on TACC systems and the source can be found at:

Running Applications

SLURM Batch Environment

Batch facilities such as LoadLeveler, LSF, SGE and SLURM differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (cancel, resource request modification, etc.). Here we provide an overview of the Simple Linux Utility for Resource Management (SLURM) batch environment, describe the Stampede queue structure and associated queue commands, and list basic SLURM job control commands along with options.

Each Stampede node (of E5 16 cores, and a Phi coprocessor; and K20 GPUs on gpu nodes) can be assigned to only one user at a time; hence, a complete node is dedicated to a user's job and accrues wall-clock time for 16 cores whether they are used or not. Note: the allocation usage is based solely on the E5 core hours used; the Phi coprocessors and the NVIDIA GPUs are "free" components within the nodes.

Stampede Queue Structure

The Stampede production queues and their characteristics (wall-clock and processor limits; priority charge factor; and purpose) are listed in Table 7.1 below. Queues that don't appear in the table (such as systest, sysdebug, and clean) are non-production queues for system and HPC group testing and special support.

Table 7.1 SLURM Batch Environment Queues

Queue Name Max Runtime Max Nodes/Procs Max Jobs in Queue SU Charge Rate Purpose
normal 48 hrs 256 / 4K 50 1 normal production
development 2 hrs 16 / 256 1 1 development nodes
largemem 48 hrs 4 / 128 4 2 large memory 32 cores/node
serial 12 hrs 1 / 16 8 1 serial/shared_memory
large 24 hrs 1024 / 16K 50 1 large core counts (access by request 1)
request 24 hrs -- 50 1 special requests
normal-mic 48 hrs 256 / 4k 50 1 production MIC nodes
normal-2mic 24 hrs 128 / 2k 50 1 production MIC nodes with two co-processors
gpu 24 hrs 32 / 512 50 1 GPU nodes
gpudev 4 hrs 4 / 64 5 1 GPU development nodes
vis 8 hrs 32 / 512 50 1 GPU nodes + VNC service
visdev 4 hrs 4 / 64 5 1 Vis development nodes (GPUs + VNC)

1Access to the large queue on Stampede is granted after demonstration of the scalability of your code up to 4096 cores. Please submit a ticket via the TACC User Portal to request large queue access. Include scalability data for your code in your request. Please see the example SLURM job script for large jobs.

Eight of the 128 GPU nodes have been reserved for development and are only accessible through the gpudev and visdev queues. The gpu and vis queues share the remaining 120 GPU nodes. The purpose of using two queues is to constrain the time limit for remote visualization and provide a VNC daemon. All 6400 compute nodes contain at least one MIC co-processor. In addition, the 440 nodes in the normal-2mic queue contain two MIC co-processors. The normal queue provides access to any of the 6,120 compute nodes (6,400 less the GPU and development nodes), without regard to the presence of MIC coprocessor. Use this queue if your application does not require MIC coprocessing. Use the normal-mic or normal-2mic queues to guarantee access to nodes that contain MIC coprocessors.

Viewing Stampede Queue Status

The latest queue information can be displayed several different ways with SLURM's "sinfo" and "squeue" commands, and TACC's "showq" utility.

The "sinfo" command without arguments gives you more information than you probably want. Use the print options prescribed in Table 7.2 below with sinfo for a more readable listing that summarizes each queue on a single line. The column labeled "NODES(A/I/O/T)" of this summary listing displays the number of nodes with the Available, Idle, and Other states along with the Total node count for the partition. See "man sinfo" for more information. See also the squeue command detailed below.

TACC's "showq" job monitoring command-line utility displays jobs in the batch system in a manner similar to PBS' utility of the same name. showq summarizes running, idle, and pending jobs, also showing any advanced reservations scheduled within the next week. See Table 7.2 for some showq options.

login1$ showq

Table 7.2 sinfo and showq Commands

Command Description/Options
sinfo -o "%20P %5a %.10l %16F"
Lists the availability and status of queues
showq options -l displays queue and node count columns
-u only active and waiting jobs of the user are reported

SLURM Job Control

Job Submission with sbatch

Use SLURM's sbatch command to submit job scripts:

login1$ sbatch myjobscript

where myjobscript is the name of a UNIX format text file containing job script commands. This file should contain both shell commands and special statements that include sbatch options and resource specifications. Some of the most common options are described below in Table 7.3 and the example job scripts. Details are available online in man pages (e.g., execute "man sbatch" on Stampede).

Options can be passed to sbatch on the command-line or specified in the job script file. The latter approach is preferable. It is easier to store commonly used sbatch commands in a script file that will be reused several times rather than retyping the sbatch commands at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script. All batch submissions MUST specify a time limit and total tasks. Jobs that do not use the -t (time) and -n (total tasks) options will be rejected.

Batch scripts contain two types of statements: scheduler directives and shell commands. Scheduler directive lines begin with #SBATCH and are followed with sbatch options. The UNIX shell commands are interpreted by the shell specified on the first line after the #! sentinel; otherwise the Bash shell (/bin/bash) is used. By default, a job begins execution in the directory of submission with the local (submission) environment. The job script below requests an MPI job with 32 cores and 1.5 hours of run time:

#SBATCH -J myMPI           # job name
#SBATCH -o myMPI.o%j       # output and error file name (%j expands to jobID)
#SBATCH -n 32              # total number of mpi tasks requested
#SBATCH -p development     # queue (partition) -- normal, development, etc.
#SBATCH -t 01:30:00        # run time (hh:mm:ss) - 1.5 hours
#SBATCH --mail-type=begin  # email me when the job starts
#SBATCH --mail-type=end    # email me when the job finishes
ibrun ./a.out              # run the MPI executable named a.out

If you don't want stderr and stdout directed to the same file, use both -e and -o options to designate separate output files. By default, stderr and stdout are sent to a file named "slurm-%j.out", where %j is replaced by the job ID; and with only an -o option, both stderr and stdout are directed to the same designated output file.

Table 7.3 Common sbatch Options

Option Argument Function
-p queue_name Submits to queue (partition) designated by queue_name
-J job_name Job Name
-n total_tasks The job acquires enough nodes to execute total_tasks tasks (launching 16 tasks/node).
Use the -N option with the -n option when fewer than 16 tasks/node are required (e.g. for hybrid codes).
-N nodes This option can only be used in conjunction with the -n option (above). Use this option to specify launching less than 16 tasks per node. The job acquires nodes nodes, and total_tasks/nodes tasks are launched on each node.
--ntasks-per-xxx N/A The ntasks-per-core/socket/node options are not available on Stampede. The -N and -n options provide all the functionality needed for specifying a task layout on the nodes.
-t hh:mm:ss Wall clock time for job. Required
--mail-user= email_address Specify the email address to use for notifications.
--mail-type= begin, end, fail, or all Specify when user notifications are to be sent (one option per line).
-o output_file Direct job standard output to output_file (without -e option error goes to this file)
-e error_file Direct job error output to error_file
-d= afterok:jobid Specifies a dependency: this run will start only after the specified job (jobid) succesfully finishes
-A project_account_name Charges run to project_account_name. Used only for multi-project logins. Account names and reports are displayed at login.
Job Dependencies

Some workflows may have job dependencies, for example a user may wish to perform post-processing on the output of another job, or a very large job may have to be broken up into smaller pieces so as not to exceed maximum queue runtime. In such cases you may use SLURM's --dependency= options. The following command submits a job script that will run only upon successful completion of another previously submitted job:

 login1$ sbatch --dependency=afterok:jobid job_script_name
Job Monitoring with squeue

After job submission, users may monitor the status of their jobs in several ways. While the job is in the waiting state the system is continuously monitoring the number of nodes that become available and applying a fair share algorithm and a backfill algorithm to determine a fair, expedient scheduling to keep the machine running at optimum capacity. Both the showq and squeue commands with the "-u username" option display similar information:

login1$ showq -u janeuser
WAITING JOBS------------------------
1676351   helloworld janeuser      Waiting 4096    15:30:00  Wed Sep 11 11:59:53
1676352   helloworld janeuser      Waiting 4096    15:30:00  Wed Sep 11 12:00:07
1676354   helloworld janeuser      Waiting 4096    15:30:00  Wed Sep 11 12:00:09
login1$ squeue -u janeuser
1676351      normal hellowor janeuser  PD       0:00    256 (Resources)
1676352      normal hellowor janeuser  PD       0:00    256 (Resources)
1676354      normal hellowor janeuser  PD       0:00    256 (Resources)

In the above examples, each command's output lists the three jobs (1676351, 1676352 & 1676354) waiting to run. The showq command displays cores and time requested, while the squeue command displays the partition (queue), the state (ST) of the job along with the node list when allocated. In this case, all three jobs are in the Pending (PD) state awaiting "Resources", (nodes to free up). Table 7.4 details common squeue options and Table 7.5 describes the command's output fields.

Table 7.4 Common squeue Options

Option Result
< >=comma separated list  
-i <interval> Repeatedly report at intervals (in seconds).
-j <job_list> Displays information for specified job(s)
-p <part_list> Displays information for specified partitions (queues).
---start Use with -j to estimate job start time
-t <state_list>
-u <username> Displays information for specified user.
Shows jobs in the specified state(s)
See squeue man page for state abbreviations:
"all" or list of {PD,R,S,CG,CD,CF,CA,F,TO,PR,NF}

Table 7.5 Columns in the squeue command output

Field Description
JOBID job id assigned to the job
USER user that owns the job
STATE current job status, including, but not limited to:
CD (completed)
CF (cancelled)
F (failed)
PD (pending)
R (running)

Using the squeue command with the --start and -j options can provide an estimate of when a particular job will be scheduled:

login1$ squeue --start -j 1676354
1676534    normal hellow   janeuser  PD  2013-08-21T13:42:03    256 (Resources)

Even more extensive job information can be found using the scontrol command. The output shows quite a bit about the job: job dependencies, submission time, number of codes, location of the job script and the working directory, etc. See the man page for more details.

login1$ scontrol show job 1676354
JobId=1676991 Name=mpi-helloworld
   UserId=slindsey(804387) GroupId=G-40300(40300)
   Priority=1397 Account=TG-STA110012S QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=15:30:00 TimeMin=N/A
   SubmitTime=2013-09-11T15:12:49 EligibleTime=2013-09-11T15:12:49
   StartTime=2013-09-11T17:40:00 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=normal AllocNode:Sid=login4:27520
   ReqNodeList=(null) ExcNodeList=(null)
   NumNodes=256-256 NumCPUs=4096 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Job Deletion with scancel

The scancel command is used to remove pending and running jobs from the queue. Include a space-separated list of job IDs that you want to cancel on the command-line:

login1$ scancel job_id1 job_id2 ...

Use "showq -u" or "squeue -u username" to see your jobs.

Example job scripts are available online in /share/doc/slurm . They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.

SLURM and Environment Variables

In addition to the environment variables that can be inherited by the job from the interactive login environment, SLURM provides environment variables for most of the values used in the #SBATCH directives. These are listed at the end of the sbatch man page. The environment variables SLURM_JOB_ID, SLURM_JOB_NAME, SLURM_SUBMIT_DIR and SLURM_NTASKS_PER_NODE may be useful for documenting run information in job scripts and output. Table 7.6 below lists some important SLURM-provided environment variables.

Note that environment variables cannot be used in an #SBATCH directive within a job script. For example, the following directive will not work as expected:


Instead, use the following directive:

#SBATCH -o myMPI.o%j

where "%j" expands to the jobID.

You may also pass variables from the host submission environment to SLURM via the command-line using the "--export" option to sbatch, or you may use the pre-defined SLURM environment variables in the shell portion of your job script. The following command passes two environment variables, VAR1 and VAR2 from the host submission environment to SLURM:

login1$ sbatch --export="VAR1=valueA,VAR2=valueB" myjobscript

Table 7.6. SLURM Environment Variables

Environment Variable Description
SLURM_JOB_ID batch job id assigned by SLURM upon submission
SLURM_JOB_NAME user-assigned job name
SLURM_NNODES number of nodes
SLURM_NODELIST list of nodes
SLURM_NTASKS total number of tasks
SLURM_QUEUE queue (partition)
SLURM_SUBMIT_DIR directory of submission
SLURM_TASKS_PER_NODE number of tasks per node

Runtime Environments

This section discusses how to submit jobs for your particular programming model: MPI, hybrid (openMP+MPI), symmetric, and serial codes. Stampede also provides users still experimenting with codes interactive access to the development nodes.

In the examples below we use "a.out" as the executable name, but of course the name may be any valid application name, along with arguments and file redirections (.e.g.,

ibrun tacc_affinity ./myprogram myargs < myinput

Please consult the sample SLURM job submission scripts below for various runtime configurations.

Runtime Environment for Scalable MPI Programs

The MVAPICH-2 MPI package provides a runtime environment that can be tuned for scalable code. For packages with short messages, there is a FAST_PATH option that can reduce communication costs, as well as a mechanism to share receive queues. Also, there is a Hot-Spot Congestion Avoidance option for quelling communication patterns that produce hot spots in the switch. See Chapter 9, Scalable features for Large Scale Clusters and Performance Tuning and Chapter 10, MVAPICH2 Parameters of the MVAPICH2 User Guide for more information. MVAPICH docs can be found here.

It is relatively easy to distribute tasks on nodes in the SLURM parallel environment when only the E5 cores are used. The -N option sets the number of nodes and the -n option sets the total MPI tasks.

Launching MPI Applications with ibrun

For all codes compiled with any MPI library, use the ibrun command (NOT mpirun) to launch the executable within the job script. The syntax is:

login1$ ibrun ./myprogram 
login1$ ibrun ./a.out 2000

The ibrun command supports options for advanced host selection. A subset of the processors from the list of all hosts can be selected to run an executable. An offset must be applied. This offset can also be used to run two different executables on two different subsets, simultaneously. The option syntax is:

ibrun -n number_of_cores -o hostlist_offset myprogram myprogram_args

For the following advanced example, 64 cores were requested in the job script.

login1$ ibrun -n 32 -o  0 ./a.out &
login1$ ibrun -n 32 -o 32 ./a.out &
login1$ wait

The first call launches a 32-core run on the first 32 hosts in the hostfile, while the second call launches a 32-core run on the second 32 hosts in the hostfile, concurrently (by terminating each command with "&"). The wait command (required) waits for all processes to finish before the shell continues. The wait command works in all shells. Note that the -n and -o options must be used together.

The ibrun command also supports the "-np" option which limits the total number of tasks used by the batch job.

ibrun -np number_of_cores myprogram myprogram_args 

Unlike the "-n" option, the "-np" option requires no offset. It is assumed that the offset is 0.

Using a multiple of 16 cores per node

For many pure MPI applications, the most cost-efficient choices are to use a multiple of 16 tasks per node. This will ensure that each core on all the nodes is assigned one task. Specify the total number of tasks (use a value evenly divisible by 16). SLURM will automatically place 16 tasks on each node. (If the number of tasks is not divisible by 16, less than 16 tasks will be placed on one of the nodes.)

The following example will run on 4 nodes, 16 tasks per node.:

  #SBATCH -n 64

Do not use the -N (number of nodes) option alone; only a single task will be launched on each node in this case.

Using fewer than 16 cores per node

When fewer than 16 tasks are needed per node, use a combination of -n and -N. The following resource request

#SBATCH -N 4 -n 32

requests 4 nodes with 32 task distributed evenly across the nodes and sockets. The tasks per node is determined from the ratio tasks/nodes, and the tasks for a node are divided evenly across the two sockets (one socket will acquire an extra task when the task number is odd). When the tasks/nodes ratio is not an integer, floor(tasks/nodes) tasks are placed on each node, and the remaining are assigned sequentially to nodes in the hostfile list, one to each node until no more remain. The distribution across sockets allows maximal memory bandwidth to each socket.

Runtime Environment for Hybrid Programs

For hybrid jobs, specify a total-task/nodes ratio with values of 1/2/4/6/8/12/16. Then, set the $OMP_NUM_THREADS environment variable to the number of threads per task, and use tacc_affinity with ibrun. The hybrid Bourne-type shell example below illustrates parameters to run a hybrid job. It requests 2 nodes and 4 tasks, 2 tasks per node, 8 threads/task.

  #SBATCH -n4 -N2
  export OMP_NUM_THREADS=8 #8 threads/task
  ibrun tacc_affinity ./a.out

Please see the sample SLURM job scripts section below for a complete hybrid job script.

Runtime Environment for Symmetric Computing

An MPI application can run tasks on both host CPUs and MICs. This is called symmetric computing because both host CPUs and MICs act as though they are separate nodes and may have MPI processes launched ("symmetrically" on both systems, unlike offloading which depends upon the host to distribute work to the MIC. As a reminder, a node contains both a host CPUs and a MIC. Below, references to "CPUs" will refer to the Sandy Bridge component of the node and references to a "MIC" will refer to the Intel Xeon Phi co-processor card.

There are 4 steps of preparation for a symmetric run:

  1. Build separate executables for the host and the MIC
  2. Determine the appropriate number of CPU and MIC MPI tasks per node, as well as the number of host and MIC OpenMP threads per MPI task.
  3. Create a job script that requests resources and specifies the distribution of tasks across hosts and MICs
  4. Launch the two executables (CPU/MIC binaries) using ibrun.symm.

Presently, CPU and MIC MPI executables for symmetric execution can only be built with impi or mvapich2-mic. Since the mvapich2 model is loaded by default, swap out this module for either the impi or mvapich2-mic modules as shown below.

  $ module swap mvapich2 impi/mvapich2-mic
  $ mpif90/mpicc/mpic++ -O3 -xhost myprogram.f90/c/cpp -o a.out.cpu
  $ mpif90/mpicc/mpic++ -O3 -mmic  myprogram.f90/c/cpp -o a.out.mic

Once MIC and CPU executables for an application have been created, named "a.out.mic" and "a.out.cpu", respectively, the two executables may be launched using ibrun.symm in a job script with the following syntax:

$ ibrun.symm -m ./a.out.mic -c ./a.out.cpu

where the "-m" and "-c" options specify the MIC and CPU executables, respectively. ibrun.symm does not support the "-o" and "-n" options of the regular ibrun command.

If the executables requires command-line arguments, combine the executable and its arguments in quotes, as shown here:

$ ibrun.symm -m "./a.out.mic args" -c "./a.out.cpu args"

where args are the command-line arguments required by the "a.out.mic" and "a.out.cpu" executables. If the executables require redirection from stdin, create a simple executable shell script to run the executable, e.g.,
a.out.mic args < inputfile
#!/bin/sh args < inputfile

Note: The bash, tcsh, and csh shells are not available on the MIC. Only the sh shell interpreter runs on the MIC, therefore we begin each MIC shell script with "#!/bin/sh" to specify that the sh shell is to be used and remind users that only sh syntax (a subset of Bash) is allowed in the MIC shell script. Any shell can be used for the host script. These scripts are to be used as arguments of the ibrun.symm command:

$ ibrun.symm -m ./   -c

No quoting is required for executable arguments in the shell scripts.

The total number of tasks to be executed on the host and the number of hosts to use should be entered as SLURM resource options ("-n total_tasks" and "-N nodes") in the job script, as described in the SLURM Parallel Environment section above. However, only the total number of tasks needs to be specified if 16 tasks per host are to be used. The following SLURM directive assumes 16 tasks per node:

#SBATCH -n total_tasks

A combination of the total number of nodes and total number of tasks should be specified when fewer than 16 tasks per host (CPUs) are required. The following SLURM directive allocates total_tasks/nodes tasks per node across the specified number of nodes.

#SBATCH -N nodes -n total_tasks

The number of SUs billed depends on the total number of nodes used:
SUs billed = # nodes * 16 cores/node * wallclock time
for all queues except for the largemem queue. This charge applies whether you use the MICS or not.
Use $OMP_NUM_THREADS to set the number of OpenMP threads per CPU process (CPU MPI task).

Use the "$MIC_PPN" and "$MIC_OMP_NUM_THREADS" environment variables to set the number of tasks and threads for each MIC. The $MIC_PPN and $MIC_OMP_NUM_THREADS environment variables should contain the number of MPI processes (tasks) per MIC and the number of threads per MIC process (MPI task) respectively. The job script snippet below illustrates a job that will execute 8 MPI CPU tasks per node on 4 nodes using 2 threads per CPU task. With the "a.out.mic" executable it will execute 2 MPI processes (tasks) on each MIC, and 60 threads for each MIC process (MIC MPI task).

  #SBATCH -N 4 -n 32
  export OMP_NUM_THREADS=2
  export MIC_PPN=2
  ibrun.symm -m a.out.mic  -c a.out.cpu

The MPI tasks will be allocated in consecutive order by node (CPU tasks first, the MIC tasks). For example, the task allocation described by the above script snippet will be:

  NODE1:  8 host tasks ( 0 - 7) :  2 MIC tasks ( 8 - 9)
  NODE2:  8 host tasks (10 -17) :  2 MIC tasks (18 -19)
  NODE3:  8 host tasks (20 -27) :  2 MIC tasks (28 -29)
  NODE4:  8 host tasks (30 -37) :  2 MIC tasks (38 -39)

Although the total number of host MPI tasks may still be controlled with the $MY_NSLOTS environment variable, the number of MIC MPI tasks will be consistently set across MICs with the $MIC_PPN environment variable. If your applications needs a more advanced configuration for the CPU and MIC task topology, please contact TACC Consulting.

Please see the sample SLURM job scripts section below for a sample symmetric job script.

Runtime Environment for Serial Programs

For serial batch executions, use one node and one task, and do not use the ibrun command to launch the executable (just use the executable name) and submit your job to the serial queue (partition). The serial queue has a 12-hour runtime limit and allows up to 6 simultaneous runs per user. There are 148 nodes available for the serial queue.

  #SBATCH -N 1 -n 1  # one node and one task
  #SBATCH -p serial  # run in serial queue
  ./a.out            # execute your application (no ibrun)

Runtime Environment for Development

Interactive access to a single node on the supercomputer is extremely useful for developing und debugging codes that may not be ready for full-scale deployment. Interactive sessions are charged to accounts just like normal batch jobs. Please restrict usage to the (default) development queue.


TACC's HPC staff have recently implemented the idev application on Stampede. idev provides interactive access to a single node and then spawns the resulting interactive environment to as many terminal sessions as needed for debugging purposes. idev is simple to use, bypassing the arcane syntax of the srun command. Further idev documentation can be found here:

In the sample session below, a user requests interactive access to a single node for 15 minutes in order to debug the progindevelopment application. idev returns a compute node login prompt:

login1$ idev -m 15    
--> Sleeping for 7 seconds...OK
i--> Creating interactive terminal session (login) on master node c557-704.
c557-704$ vim progindevelopment.c
c557-704$ make progindevelopment

Now the user may open another window to run the newly-compiled application, while continuing to debug in the original terminal session:

WINDOW2 c557-704$ ibrun -np 16 ./progindevelopment
WINDOW2 ...program output ...
WINDOW2 c557-704$

Use the "-h" switch to see more options:

login1$ idev -h

SLURM's srun command will interactively request a batch job, returning a compute-node name as a prompt, usually scheduled within a short period of time. Issue the srun command only from a login node. Command syntax is:

srun --pty -A acct -p queue -t hh:mm:ss -n tasks -N nodes /bin/bash -l

The "-A", "-p", "-t", "-n" and "-N" batch options respectively specify the account, queue (partition), time, total number of tasks and the number of nodes. If a login is associated with only one project, the account does not need to be specified. The batch job is terminated when the shell is exited. The following example illustrates a request for 1 hour in the development queue on one compute node using the bash shell, followed by an MPI executable launch.

login1$ srun --pty -p development -t 01:00:00 -n16 /bin/bash -l
c423-001$ ibrun ./a.out

Sample SLURM Job Scripts

Please click on the links below for pop-up sample SLURM job submission scripts.

Affinity and Memory Locality

HPC workloads often benefit from pinning processes to hardware instead of allowing the operating system to migrate them at will. This is particularly important in multicore and heterogeneous systems, where process (and thread) migration can lead to less than optimal memory access and resource sharing patterns, and thus a significant performance degradation. TACC provides an affinity script called tacc_affinity, to enforce strict local memory allocation and process pinning to the socket. For most HPC workloads, the use of tacc_affinity will ensure that processes do not migrate and memory accesses are local. To use tacc_affinity with your MPI executable, use this command:

  c423-001$ ibrun tacc_affinity a.out

This will apply an affinity for the tasks_per_socket option (or an appropriate affinity if tasks_per_socket is not used, and a memory policy that forces memory assignments to the local socket. Try ibrun with and without tacc_affinity to determine if your application runs better with TACC affinity setting.

However, there may be instances in which tacc_affinity is not flexible enough to meet the user's requirements. This section describes techniques to control process affinity and memory locality that can be used to improve execution performance in Stampede and other HPC resources. In this section an MPI task is synonymous with a process.

Do not use multiple methods to set affinity simultaneously as this can lead to unpredictable results.

Using numactl

numactl is a linux command that allows explicit control of process affinity and memory policy. Since each MPI task is launched as a separate process, numactl can be used to specify the affinity and memory policy for each task. There are two ways this can be used to exercise numa control when launching a batch executable:

c423-001$ ibrun numactl options ./a.out
c423-001$ ibrun my_affinity ./a.out

The first command sets the same options for each task. Because the ranks for the execution of each a.out are not known to numactl it is not possible to use this command-line to tailor options for each individual task. The second command launches an executable script, my_affinity, that sets affinity for each task. The script will have access to the number of tasks per node and the rank of each task, and so it is possible to set individual affinity options for each task using this method. In general any execution using more than one task should employ the second method to set affinity so that tasks can be properly pinned to the hardware.

In threaded applications, the same numactl command may be used, but its scope is limited globally to all threads, because every forked process or thread inherits the affinity and memory policy of the parent. This behavior can be modified from within a program using the numa API to control affinity. The basic calls for binding tasks and threads are "sched_getaffinity", "sched_setaffinity" and "numalib", respectively. Note, on the login nodes the core numbers for masking are assigned round-robin to the sockets (cores 0, 2, 4,... are on socket 0 and cores 1, 3, 5, ... are on socket 1) while on the compute nodes they are assigned contiguously (cores 0-7 are on socket 0 and 8-15 are on socket 1).

The TACC provided affinity script, tacc_affinity, enforces a strict local memory allocation to the socket, forcing eviction of previous user's IO buffers, and also distributes tasks evenly across sockets. Use this script as a template for implementing your own affinity script if a custom affinity script is needed for your jobs.

Table 7.7 Common numactl options

OptionArguments Description
-N0,1 Socket Affinity. Execute process only on this (these) socket(s)
-C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 Core Affinity. Execute process on this (these, comma separated list) core(s).
-l None Memory Policy. Allocate only on socket where process runs. Fallback to another if full.
-i 0,1 Memory Policy. Strictly allocate round robin on these (comma separated list) sockets. No fallback; abort if no more allocation space is available.
-m 0,1 Memory Policy. Strictly allocate on this (these, comma separated list) sockets. No fallback; abort if no more allocation space is available.
--preferred= 0,1 (select only one) Memory Policy. Allocate on this socket. Fallback to the other if full.

Additional details on numactl are given in its man page and help information:

  login1$ man numactl
  login1$ numactl --help

Using Intel's KMP_AFFINITY

To alleviate the complexity of setting affinity in architectures that support multiple hardware threads per core, such as the MIC family of coprocessors, Intel provides the means of controlling thread pinning via an environmental variable, $KMP_AFFINITY.

The way to set $KMP_AFFINITY is:

login1$ export KMP_AFFINITY=[<modifier>,...]type

For offload execution on the Phi coprocessors use the $MIC_KMP_AFFINITY variable to control affinity.

A comprehensive description and list of modifiers and types is given in the Intel "Thread Affinity Interface" document, but the basic options are shown in the table below.

Table 7.8 Common KMP_AFFINITY options

Argument Default Description
Modifier noverbose
Optional. String consisting of a keyword and specifier.
  • granularity=<specifier>
    Takes the following specifiers: fine (pinned to HW thread) or core (pinned to core, able to jump between HW threads within the core).
  • norespect / respect (OS thread placement)
  • noverbose /verbose
  • nowarnings / warnings
  • proclist={<proc-list>} (for explicit affinity setting)
Type none Required string. Indicates the affinity to use.
  • compact (pack threads close to each other)
  • explicit (use the proclist modifier to pin threads)
  • none (does not pin threads)
  • scatter (Round-Robin threads to cores)
  • balanced (Phi coprocessor only, use scatter but keep OMP thread ids consecutive)

The meaning of the different affinity types is best explained with an example. Imagine that we have a system with 4 cores and 4 hardware threads per core. If we place 8 threads the assignments produced by the compact, scatter, and balanced types are shown in Figure 1.4 below. Notice that compact does not fully utilize all the cores in the system. For this reason it is recommended that applications are run using the scatter or balanced (Phi coprocessor only) options in most cases.

Figure 1.5 KMP affinity-type distributions

File Systems

The TACC HPC platforms have several different file systems with distinct storage characteristics. These are pre-defined, user-owned directories in these file systems for users to store their data. Of course, these file systems are shared with other users, so they are managed by either a quota limit, a purge policy (time-residency) limit, or a migration policy.

The $HOME, $WORK and $SCRATCH directories on Stampede are Lustre file systems, designed for parallel and high performance data access of large files from within applications. They have been configured to work well with MPI-IO and support access from many compute nodes. Since metadata services for each file system are through a single server (limitation of Lustre), users should consider efficient strategies for minimizing file services (opening and closing) files when scaling applications to large node counts.

To determine the amount of disk space used in a file system, cd to the directory of interest and execute the df -k . command, including the dot that represents the current directory as demonstrated below:

  login1$ cd mydirectory
  login1$ df -k .
  Filesystem          1K-blocks   Used     Available   Use% Mounted on 15382877568 31900512 15350977056 1%   /home1

In the command output above, the file system name appears on the left (IP number, followed by the file system name), and the used and available space ( -k , in units of 1 KBytes) appear in the middle columns, followed by the percent used, and the mount point:

To determine the amount of space available in a user-owned directory, cd to the directory and execute the du -sh command (s=summary, h=units 'human readable):

  login1$ du -sh

To determine quota limits and usage on $HOME and $WORK execute the Lustre files system " lfs quota " command without any options (from any directory). Usage and quotas are reported at each login.

  login1$ lfs quota $HOME
  login1$ lfs quota $WORK

Stampede's major file systems, $HOME, $WORK, $SCRATCH, /tmp and $ARCHIVE, are detailed below.


  • At login, the system automatically sets the current working directory to your home directory.
  • Store your source code and build your executables here.
  • This directory has a quota limit of of 5GB, 150K files.
  • This file system is backed up.
  • The login nodes and any compute node can access this directory.
  • Use the environment variable $HOME to reference your home directory in scripts.
  • Use the "cdh" or "cd" commands to change to $HOME .


  • This directory has a quota limit of of 400GB, 3M files.
  • Store large files here.
  • Change to this directory in your batch scripts and run jobs in this file system.
  • The work file system is approximately 450?TB
  • This file system is not backed up.
  • The login nodes and any compute node can access this directory.
  • Purge Policy: not purged
  • Use the environment variable $WORK to reference this directory in scripts.
  • Use "cdw" to change to $WORK.


  • Store large files here.
  • Change to this directory in your batch scripts and run jobs in this file system.
  • The scratch file system is approximately approximately 8.5PB.
  • This file system is not backed up.
  • The login nodes and any compute node can access this directory.
  • Purge Policy: Files with access times greater than 10 days are purged.
  • Use $SCRATCH to reference this directory in scripts.
  • Use the "cds" command to change to $SCRATCH.
  • NOTE: TACC staff may delete files from $SCRATCH if the file system becomes full, even if files are less than 10 days old. A full file system inhibits use of the file system for everyone. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.


  • This is a directory in a local disk on each node where you can store files and perform local I/O for the duration of a batch job.
  • It is often more efficient to use and store files directly in $SCRATCH (to avoid moving files from /tmp at the end of a batch job).
  • The /tmp file system is approximately 80GB available to users.
  • Files stored in the /tmp directory on each node are removed immediately after the job terminates.
  • Use "/tmp" to reference this file system in scripts.


Stampede's archival storage system is Ranch and is accessible via the $ARCHIVER and $ARCHIVE environment variables.

  • Store permanent files here for archival storage.
  • This file system is NOT NSF mounted (directly accessible) on any node.
  • Use the $ARCHIVE file system only for long-term file storage to the $ARCHIVER system; it is not appropriate to use it as a staging area.
  • Use the scp command to transfer data to this system.
login1$ scp ${ARCHIVER}:$ARCHIVE/mybigfile $WORK


login1$ scp mybigfile ${ARCHIVER}:
  • Use the ssh command to login to the $ARCHIVER system from any TACC machine. For example:
login1$ ssh $ARCHIVER
  • Some files stored on the archiver may require staging prior to retrieval.
  • See the Ranch User Guide for more on archiving your files.

Coprocessor (MIC) Programming

Many Fortran or C/C++ applications designed to run on the E5 processor (host) can be modified to automatically execute blocks of code or routines on the Phi coprocessor through directives. The Intel compiler, without requiring any additional options, will interpret the directives and include Phi executable code within the normal executable binary. A binary with Phi executable offload code can be launched on the host in the usual manner (with ibrun for MPI codes, and as a process execution for serial and OpenMP), and the offloaded sections of code will automatically execute on the Phi coprocessor.

There are two points to remember when discussing computations on the host (E5 CPUs) and coprocessor (Phi):

  1. The instruction sets and architectures of the host E5 and Phi coprocessor are quite similar, but are not identical. (Expect differences in performance.)
  2. Host processors and MIC coprocessors have their own memory subsystems. They are effectively separate SMP systems with their own OS and environment.
  3. Programming details for offloading can be found in the respective user guides listed below.

    Automatic Offloading

    Some of the MKL routines that do large amounts of floating point operations compared to data accesses (having computational complexity O(n3) compared to O(n2) data access; e.g., level 3 BLAS) have been configured with automatic offload (AO) capabilities. This capability allows the user to offload work in the library routines automatically, without any coding changes. No special compiler options are required. Just compile with the usual flags and the MKL library load options (-mkl is the new shortened way to load MKL libraries). Then set the "$MKL_MIC_ENABLE" environment variable to request the automatic offload to occur at run time:

      login1% ifort -mkl -xhost -O3 app_has_MKLdgemm.f90
      login1% export MKL_MIC_ENABLE=1
      login2% ./a.out

    Depending upon the problem size (e.g., n>2048 for dgemm) the library runtime may choose to run all, part or none of the routine on the coprocessor. Offloading and the work division between the CPU and MIC are transparent to the user; but these may be controlled with environment variables and Fortran/C/C++ APIs (application program interfaces), particularly when compiler-assisted offloading is also employed. Also, MPI applications that use multiple tasks per node will need to adjust the workload division for sharing the coprocessor among all of the tasks. For example, setting the $MKL_MIC_WORKDIVISION environment variable or using the support function mkl_mic_set_workdivision() with a fraction value, advises the runtime to give the MIC that fraction of work. Set the $OFFLOAD_REPORT variable value, or mkl_mic_set_offload_report function argument, to 0-2 to disclose a range of information, as shown below:

      login1% export MKL_MIC_ENABLE=1 OFFLOAD_REPORT=2
      login1% ./a.out

    Details and a list of all the automatic offload controls are available in the MKL User Guide document.

    Compiler Assisted Offloading

    Developers can explicitly direct a block of code or routine to be executed on the MIC, in the base Fortran, C/C++ languages using directives. The code to be executed on the MIC is called an offload region. No special coding is required in an offloaded region and Intel specific and OpenMP threading methods may be used. Code Example 1 illustrates an offload directive for a code block containing an OpenMP loop. The "target(mic:0)" clause specifies that the MIC coprocessor with id=0 should execute the code region.

    When the host execution encounters the offload region the runtime performs several offload operations: detection of a target Phi coprocessor, allocation of memory space on the coprocessor, data transfer from the host to the coprocessor, execution of the coprocessor binary on the Phi, transfer of data from the Phi back to the host after the completion of the coprocessor binary, and memory deallocation. The offload model is suitable when the data exchanged between the host and the MIC consists of scalars, arrays, and Fortran derived types and C/C++ structures that can be copied using a simple memcpy. This data characteristic is often described as being flat or bit-wise copyable. The data to be transferred at the offload point need not be declared or allocated in any special way if the data is within scope (as in Code Example 1); although pointer data (arrays pointed to by a pointer) need their size specified (see Advanced Offloading).

    Example 9.1 Offloaded OpenMP code block with automatic data transfer

    C code F90 code
        int main(){
            float a[N], b[N], c[N];
        #pragma offload target(mic:0)
        #pragma omp parallel for
        program main
          real :: a(N), b(N), c(N)
          !dir$ offload begin target(mic:0)
          !$omp parallel do
          do i=1,N
          end do
          !dir$ end offload
        end program

    By default the compiler will recognize any offload directive. During development it is useful to observe the names and sizes of variables tagged for transfer by including the "-opt-report-phase=offload" option as shown here:

      login2% ifort/icc/icpc -openmp -O3 -xhost -opt-report-phase=offload \
      login2% export OMP_NUM_THREADS=16
      login2% export MIC_ENV_PREFIX=MIC  MIC_OMP_NUM_THREADS=240  KMP_AFFINITY=scatter
      login2% ./a.out

    The "-openmp" and "-O2" options apply to both the host (E5 CPU) and offload (MIC) code regions, and the -xhost is specific to the host code. Environment variables such as $OMP_NUM_THREADS will normally have different values on the host and the MIC. In these cases variables intended for the MIC should be prefixed with "MIC_" and set on the host as shown above; also the "$MIC_ENV_PREFIX" variable must be set to "MIC". Actually, any prefix may be used; but we strongly recommend using MIC.

    Advanced Offloading

    A few of the important concepts you will need to develop and optimize offload paradigms are summarized below. The corresponding directives, clauses and qualifiers are explained as well. More details and examples, as well as references to Intel documentation are provided in TACC's Advanced Offloading document.

    Data Transfers: in/out/inout: In Code Example 1 the compiler will make sure that the a, b and c arrays are copied over to the MIC before the offloaded region is executed, and are copied back at the end of the execution. Because the a array is only written on the MIC there is no reason to "copy in" the array into the coprocessor; likewise there is no reason to "copy out" b and c out of the coprocessor. To eliminate unnecessary transfers, data intent clauses (in, out, inout) on the offload directive can be used to optimize transfers.

    Persistent Data: alloc_if() and free_if():

    The automatic data transfers in Code Example 1 allocate storage on the MIC, transfer the data, and deallocate storage for each call. If the same data is to be used in different offloads the data can be made to persist across the offloads by modifying the memory allocation defaults with alloc_if(arg) and free_if(arg) qualifiers within the intent data clauses (in, out, inout). If the argument is false (.false. for Fortran, 0 for C, false for C++) the allocation or deallocation is not performed, respectively.

    Data Transfer Directive: offload_transfer:

    The programmer can transfer data without offloading executable code. The offload_transfer directive fulfills this function. It is a stand-alone directive (requiring no code block), and uses all the same data clauses and modifiers of a normal offload statement. One common use case is to initially load persistent data (asynchronously) onto the MIC at the beginning of a program.

    Asynchronous Offloading: signal and wait:

    Often a developer may want to transfer data or do offload work while continuing to do work on the cpu. An offload region can be executed asynchronously when a signal clause is included on the directive. The host process encountering the offload will initiate the offload (offload or offload_transfer), and then immediately continue to execute the program code following the offload region. The offload event is identified by a variable argument within a signal clause, and uses it in the wait clause in a subsequent offload directive or stand-alone wait directive.

    GPU Programming

    Accelerator (CUDA) Programming

    NVIDIA's CUDA compiler and libraries are accessed by loading the CUDA module:

    login1$ module load cuda

    Use the nvcc compiler on the login node to compile code, and run executables on nodes with GPUs-there are no GPUs on the login nodes. Stampede's K20 GPUs are compute capability 3.5 devices. When compiling your code, make sure to specify this level of capability with:

    nvcc -arch=compute_35 -code=sm_35 ...

    GPU nodes are accessible through the gpu queue for production work and the devel-gpu queue for development work. Production job scripts should include the "module load cuda" command before executing cuda code; likewise, load the cuda module before or after acquiring an interactive, development gpu node with the "srun" command.

    The NVIDA CUDA debugger is cuda-gdb. Applications must be debugged through a VNC session or an interactive srun session. Please see the relevant srun and VNC sections for more details.

    The NVIDIA Compute Visual Profiler, computeprof, can be used to profile both CUDA and OpenCL programs that have been developed in NVIDIA CUDA/OpenCL programming environment. Since the profiler is X based, it must be run either within a VNC session or by ssh-ing into an allocated compute node with X-forwarding enabled. The profiler command and library paths are included in the $PATH and $LD_LIBRARY_PATH variables by the CUDA module. The computeprof executable and libraries can be found in the following respective directories:


    For further information on the CUDA compiler, programming, the API, and debugger, please see:

    • $TACC_CUDA_DIR/doc/nvcc.pdf
    • $TACC_CUDA_DIR/doc/CUDA_C_Programming_Guide.pdf
    • $TACC_CUDA_DIR/doc/CUDA_Toolkit_Reference_Manual.pdf
    • $TACC_CUDA_DIR/doc/cuda-gdb.pdf

    Heterogeneous (OpenCL) Programming

    The OpenCL heterogeneous computing language is supported on all Stampede computing platforms. The Intel OpenCL environment will support the Xeon processors and Xeon Phi coprocessors, and the NVIDIA OpenCL environment supports the Tesla accelerators.

    Using the Intel OpenCL Environment

    The Intel OpenCL stack is not yet installed on Stampede. A user news announcement will be sent once it is installed.

    Using the NVIDIA OpenCL Environment

    The NVIDIA OpenCL environment supports the v1.1 API is accessible through the cuda module:

      login1$ module load cuda

    For programming with NVIDIA OpenCL, please see the OpenCL specification at:

    Use the g++ compiler to compile NVIDIA-based OpenCL. The include files are located in the $TACC_CUDA_DIR/include subdirectory. The OpenCL library is installed in the /usr/lib64 directory, which is on the default library path. Use this path and g++ options to compile OpenCL code:

      login1$ export OCL=$TACC_CUDA_DIR
      login1$ g++ -I $OCL/include -lOpenCL prog.cpp

    Visualization on Stampede

    While batch visualization can be performed on any Stampede node, a set of nodes have been configured for hardware-accelerated rendering. The vis queue contains a set of 128 compute nodes configured with one NVIDIA K20 GPU each. The largemem queue contains a set of 16 compute nodes configured with one NVIDIA Quadro 2000 GPU each.

    Remote Desktop Access

    Remote desktop access to Stampede is formed through a VNC connection to one or more visualization nodes. Users must first connect to a Stampede login node (see System Access) and submit a special interactive batch job that:

    • allocates a set of Stampede visualization nodes
    • starts a vncserver process on the first allocated node
    • sets up a tunnel through the login node to the vncserver access port

    Once the vncserver process is running on the visualization node and a tunnel through the login node is created, an output message identifies the access port for connecting a VNC viewer. A VNC viewer application is run on the user's remote system and presents the desktop to the user.

    Follow the steps below to start an interactive session.

    1. Start a Remote Desktop

      TACC has provided a VNC job script (/share/doc/slurm/job.vnc) that requests one node in the vis queue for four hours, creating a VNC session.

       login1$ sbatch /share/doc/slurm/job.vnc

      You may modify or overwrite script defaults with sbatch command-line options:

      • "-t hours:minutes:seconds" modifies the job runtime
      • "-A projectnumber" specifies the project to be charged
      • "-N nodes" sets the number of nodes needed
      • "-p partition" to specify an alternate partition (queue).

      See more sbatch options in Table 7.4

      All arguments after the job script name are sent to the vncserver command. For example, to set the desktop resolution to 1440x900, use:

       login1$ sbatch /share/doc/slurm/job.vnc -geometry 1440x900

      The vnc.job script starts a vncserver process and writes to the output file, vncserver.out in the job submission directory, with the connect port for the vncviewer. Watch for the "To connect via VNC client" message at the end of the output file, or watch the output stream in a separate window with the commands:

       login1$ touch vncserver.out ; tail -f vncserver.out

      The lightweight window manager, xfce, is the default VNC desktop and is recommended for remote performance. Gnome is available; to use gnome, open the "~/.vnc/xstartup" file (created after your first VNC session) and replace "startxfce4" with "gnome-session". Note that gnome may lag over slow internet connections.

    2. Create an SSH Tunnel to Stampede

      TACC requires users to create an SSH tunnel from the local system to the Stampede login node to assure that the connection is secure. On a Unix or Linux system, execute the following command once the port has been opened on the Stampede login node:

       localhost$ ssh -f -N -L


      • "yyyy" is the port number given by the vncserver batch job
      • "xxxx" is a port on the remote system. Generally, the port number specified on the Stampede login node, yyyy, is a good choice to use on your local system as well
      • "-f" instructs SSH to only forward ports, not to execute a remote command
      • "-N" puts the ssh command into the background after connecting
      • "-L" forwards the port

      On Windows systems find the menu in the Windows SSH client where tunnels can be specified, and enter the local and remote ports as required, then ssh to Stampede.

    3. Connecting vncviewer

      Once the SSH tunnel has been established, use a VNC client to connect to the local port you created, which will then be tunneled to your VNC server on Stampede. Connect to localhost:xxxx, where xxxx is the local port you used for your tunnel. In the examples above, we would connect the VNC client to localhost::xxxx. (Some VNC clients accept localhost:xxxx).

      We recommend the TigerVNC VNC Client, a platform independent client/server application.

      Once the desktop has been established, two initial xterm windows are presented (which may be overlapping). One, which is white-on-black, manages the lifetime of the VNC server process. Killing this window (typically by typing "exit" or "ctrl-D" at the prompt) will cause the vncserver to terminate and the original batch job to end. Because of this, we recommend that this window not be used for other purposes; it is just too easy to accidentally kill it and terminate the session.

      The other xterm window is black-on-white, and can be used to start both serial programs running on the node hosting the vncserver process, or parallel jobs running across the set of cores associated with the original batch job. Additional xterm windows can be created using the window-manager left-button menu.

    Running Applications on the VNC Desktop

    From an interactive desktop, applications can be run from icons or from xterm command prompts. Two special cases arise: running parallel applications, and running applications that use OpenGL.

    Running Parallel Applications from the Desktop

    Parallel applications are run on the desktop using the same ibrun wrapper described above (see Running Applications). The command:

    c442-001$ ibrun [ibrun options] application [application options]

    will run application on the associated nodes, as modified by the ibrun options.

    Running OpenGL/X Applications On The Desktop

    Running OpenGL/X applications on Stampede visualization nodes requires that the native X server be running on each participating visualization node. Like other TACC visualization servers, on Stampede the X servers are started automatically on each node (this happens for all jobs submitted to the vis and largemem queues).

    Once native X servers are running, several scripts are provided to enable rendering in different scenarios.

    • vglrun: Because VNC does not support OpenGL applications, VirtualGL is used to intercept OpenGL/X commands issued by application code and re-direct it to a local native X display for rendering; rendered results are then automatically read back and sent to VNC as pixel buffers. To run an OpenGL/X application from a VNC desktop command prompt:
        c442-0011$ vglrun [vglrun options] application application-args
    • tacc_xrun: Some visualization applications present a client/server architecture, in which every process of a parallel server renders to local graphics resources, then returns rendered pixels to a separate, possibly remote client process for display. By wrapping server processes in the tacc_xrun wrapper, the $DISPLAY environment variable is manipulated to share the rendering load across the two GPUs available on each node. For example,
        c442-001$ ibrun tacc_xrun application application-args
      will cause the tasks to utilize each node, but will not render to any VNC desktop windows.
    • tacc_vglrun: Other visualization applications incorporate the final display function in the root process of the parallel application. This case is much like the one described above except for the root node, which must use vglrun to return rendered pixels to the VNC desktop. For example,
        c442-001$ ibrun tacc_vglrun application application-args
      will cause the tasks to utilize the GPU for rendering, but will transfer the root process' graphics results to the VNC desktop.

    Visualization Applications

    Stampede provides a set of visualization-specific modules listed below.:

    • VisIt: Access to Visit visualization application
    • ParaView: Access to ParaView visualization application

    Running Parallel VisIt on Stampede

    VisIt was compiled under the Intel compiler and the mvapich2 and MPI stacks.

    After connecting to a VNC server on Stampede, as described above, load the VisIt module at the beginning of your interactive session before launching the Visit application:

    c442-001$ module load visit
    c442-001$ vglrun visit

    VisIt first loads a dataset and presents a dialog allowing for selecting either a serial or parallel engine. Select the parallel engine. Note that this dialog will also present options for the number of processes to start and the number of nodes to use; these options are actually ignored in favor of the options specified when the VNC server job was started.

    Preparing data for Parallel Visit

    In order to take advantage of parallel processing, VisIt input data must be partitioned and distributed across the cooperating processes. This requires that the input data be explicitly partitioned into independent subsets at the time it is input to VisIt. VisIt supports SILO data, which incorporates a parallel, partitioned representation. Otherwise, VisIt supports a metadata file (with a .visit extension) that lists multiple data files of any supported format that are to be associated into a single logical dataset. In addition, VisIt supports a "brick of values" format, also using the .visit metadata file, which enables single files containing data defined on rectilinear grids to be partitioned and imported in parallel. Note that VisIt does not support VTK parallel XML formats (.pvti, .pvtu, .pvtr, .pvtp, and .pvts). For more information on importing data into VisIt, see Getting Data Into VisIt; though this documentation refers to VisIt version 2.0, it appears to be the most current available.

    Running Parallel ParaView on Stampede

    After connecting to a VNC server on Stampede, as described above, do the following:

    1. Set the $NO_HOSTSORT environment variable to 1

      csh shell login1% setenv NO_HOSTSORT 1
      bash shell login1$ export NO_HOSTSORT=1

    2. Set up your environment with the necessary modules:

      If the user is intending to use the Python interface to Paraview via any of the following methods:

      • the Python scripting tool available through the ParaView GUI
      • pvpython
      • loading the paraview.simple module into python

      then load the python, qt and paraview modules in this order:

       c442-001$ module load python qt paraview
      else just load the qt and paraview modules in this order:
       c442-001$ module load qt paraview
      Note that the qt module is always required and must be loaded prior to the paraview module.

    3. Launch ParaView:

       c442-001$ vglrun paraview [paraview client options]

    4. Click the "Connect" button, or select File -> Connect
    5. If this is the first time you've used ParaView in parallel (or failed to save your connection configuration in your prior runs):

      1. Select "Add Server"
      2. Enter a "Name" e.g. "ibrun"
      3. Click "Configure"
      4. For "Startup Type" in the configuration dialog, select "Command" and enter the command:
         c442-001$ ibrun tacc_xrun pvserver [paraview server options]
        and click "Save"
      5. Select the name of your server configuration, and click "Connect"

    You will see the parallel servers being spawned and the connection established in the ParaView Output Messages window.


    Timing Tools

    Measuring the performance of a program should be an integral part of code development. It provides benchmarks to gauge the effectiveness of performance modifications and can be used to evaluate the scalability of the whole package and/or specific routines. There are quite a few tools for measuring performance, ranging from simple timers to hardware counters. Reporting methods vary too, from simple ASCII text to X-Window graphs of time series.

    Most of the advanced timing tools access hardware counters and can provide performance characteristics about floating point/integer operations, as well as memory access, cache misses/hits, and instruction counts. Some tools can provide statistics for an entire executable with little or no instrumentation, while others requires source code modification.

    The most accurate way to evaluate changes in overall performance is to measure the wall-clock (real) time when an executable is running in a dedicated environment. On Symmetric Multi-Processor (SMP) machines, where resources are shared (e.g., the TACC IBM Power4 P690 nodes), user time plus sys time is a reasonable metric; but the values will not be as consistent as when running without any other user processes on the system. The user and sys times are the amount of time a user's application executes the code's instructions and the amount of time the kernel spends executing system calls on behalf of the user, respectively.

    Package Timers

    The time command is available on most UNIX systems. In some shells there is a built-in time command, but it doesn't have the functionality of the command found in /usr/bin. Therefore you might have to use the full pathname to access the time command in /usr/bin. To measure a program's time, run the executable with time using the syntax:

    login1$ /usr/bin/time -p

    The -p option specifies traditional precision output, units in seconds. See the time man page for additional information.

    To use time with an MPI task, use:

    login1$ /usr/bin/time -p mpirun -np 4 ./a.out

    This example provides timing information only for the rank 0 task on the master node (the node that executes the job script); however, the time output labeled real is applicable to all tasks since MPI tasks terminate together. The user and sys times may vary markedly from task to task if they do not perform the same amount of computational work (not load balanced).

    Code Section Timers

    Section timing is another popular mechanism for obtaining timing information. Use these to measure the performance of individual routines or blocks of code by inserting the timer calls before and after the regions of interest. Several of the more common timers and their characteristics are listed below.

    Table 6.1 Code Section Timers

    Routine Type Resolution (usec) OS/Compiler
    times user/sys 1000 Linux/AIX/IRIX/UNICOS
    getrusage wall/user/sys 1000 Linux/AIX/IRIX
    gettimeofday wall clock 1 Linux/AIX/IRIX/UNICOS
    rdtsc wall clock 0.1 Linux
    read_real_time wall clock 0.001 AIX
    system_clock wall clock system dependent Fortran90 Intrinsic
    MPI_Wtime wall clock system dependent MPI Library (C and Fortran)

    For general purpose or coarse-grain timings, precision is not important; therefore, the millisecond and MPI/Fortran timers should be sufficient. These timers are available on many systems; and hence, can also be used when portability is important. For benchmarking loops, it is best to use the most accurate timer (and time as many loop iterations as possible to obtain a time duration of at least an order of magnitude larger than the timer resolution). The times, getrussage, gettimeofday, rdtsc, and read_real_time timers have been packaged into a group of C wrapper routines (also callable from Fortran). The routines are function calls that return double (precision) floating point numbers with units in seconds. All of these TACC wrapper timers (x_timer) can be accessed in the same way:


    Fortran C code
        real*8, external :: x_timer
        real*8 :: sec0, sec1, tseconds
        sec0 = x_timer()
        sec1 = x_timer()
        tseconds = sec1-sec0
        double x_timer(void);
        double sec0, sec1, tseconds;
        sec0 = x_timer();
        sec1 = x_timer();
        tseconds = sec1-sec0

    Standard Profilers

    The gprof profiling tool provides a convenient mechanism to obtain timing information for an entire program or package. gprof reports a basic profile of how much time is spent in each subroutine and can direct developers to where optimization might be beneficial to the most time-consuming routines, the hotspots As with all profiling tools, the code must be instrumented to collect the timing data and then executed to create a raw-date report file. Finally, the data file must be read and translated into an ASCII report or a graphic display. The instrumentation is accomplished by simply recompiling the code using the "-p" (Intel compiler) option. The compilation, execution, and profiler commands for gprof are shown below with a sample Fortran program.

    Profiling Serial Executables

      login1$ ifort -p prog.f90  ;instruments code
      login1$ a.out              ;produces gmon.out trace file
      login1$ gprof              ;reads gmon.out (default args: a.out gmon.out),
                                 ;report sent to STDOUT

    Profiling Parallel Executables

      login1$ mpif90 -p prog.f90            ;instruments code
      login1$ setenv GMON_OUT_PREFIX gout.* ;forces each task to produce a gout
      login1$ mpirun -np <#> a.out          ;produces gmon.out trace file
      login1$ gprof -s gout.*               ;combines gout files into gmon.sum
      login1$ gprof a.out gmon.sum          ;reads executable (a.out) and gmon.sum,
                                            ;report sent to STDOUT

    Detailed documentation is available at

    Profiling with PerfExpert

    Source-code performance optimization has four stages: measurement, diagnosis of bottlenecks, determination of optimizations, and rewriting source code. Executing these steps for today's complex many processor and heterogeneous computer architectures requires a wide spectrum of knowledge that many application developers would rather not have to learn. PerfExpert (, an expert system built on generations of performance measurement and analysis tools, utilizes knowledge of architectures and compilers to implement (partial) automation of performance optimization for multicore chips and heterogeneous nodes of cluster computers. PerfExpert automates the first three performance optimization stages, then implements those optimizations as part of the fourth stage.

    PerfExpert is available on the Stampede Sandy-Bridge nodes, but not yet on the MICs. PerfExpert is dependent upon the Java interface, HPC toolkit, and the PAPI hardware counter utility, and requires the papi, hpctoolkit, and perfexpert modules to be loaded. The "module help" command provides additional information.

    login1$ module load papi hpctoolkit perfexpert
    login1$ module help perfexpert


    The following manuals and other reference documents were used to gather information for this User Guide and may contain additional information of use.


    TACC resources are deployed, configured, and operated to serve a large, diverse user community. It is important that all users are aware of and abide by TACC Usage Policies. Failure to do so may result in suspension or cancellation of the project and associated allocation and closure of all associated logins. Illegal transgressions will be addressed through UT and/or legal authorities. The Usage Policies are documented here:


    TACC offers several means of user support.

    * Stampede was generously funded by the National Science Foundation (NSF) through award ACI-1134872.

    Last update: April 16, 2014