NOTICE: On February 4, 2013 the Ranger compute cluster and the Spur visualization cluster will be decommissioned. The last day of production for both systems is February 3, 2013. After February 3, 2013 users will NOT be able to run jobs on either system. The Ranger and Spur login nodes will, however, remain available to users throughout the month of February to enable users access to their $WORK and $SCRATCH directories for data migration to Stampede.


Ranger User Guide

System Overview
System Access
Transferring files to Ranger
Application Development
Running your applications
Tools
Reference

System Overview

Overview

The Sun Constellation Linux Cluster, named Ranger, is one of the largest computational resources in the world. Ranger was made possible by a grant awarded by the National Science Foundation in September 2006 to TACC and its partners including Sun Microsystems, Arizona State University, and Cornell University. Ranger entered formal production on February 4, 2008 and supports high-end computational science for NSF XSEDE researchers throughout the United States, academic institutions within Texas, and components of the University of Texas System.

The Ranger system is comprised of 3,936 16-way SMP compute nodes providing 15,744 AMD Opteron processors for a total of 62,976 compute cores, 123 TB of total memory and 1.7 PB of raw global disk space. It has a theoretical peak performance of 579 TFLOPS. All Ranger nodes are interconnected using InfiniBand technology in a full-CLOS topology providing a 1GB/sec point-to-point bandwidth. A 10 PB capacity archival system is available for long-term storage and backups. Example pictures highlighting various components of the system are shown in Figures 1-3.

One of 6 rows: Management/IO Racks (front, black), Compute Rack (silver), and In-row Cooler (black). SunBlade motherboard
Figure 2. SunBlade x6420 motherboard (compute blade).
Constellation Switch (partially wired
Figure 1. One of six Ranger rows: Management/IO Racks (front, black), Compute Rack (silver), and In-row Heat Exchanger (black). Figure 3. InfiniBand Constellation Core Switch.

Architecture

The Ranger compute and login nodes run a Linux OS and are managed by the Rocks 4.1 cluster toolkit. Two 3456 port Constellation switches provide dual-plane access between NEMs (Network Element Modules) of each 12-blade chassis. Several global, parallel Lustre filesystems have been configured to target different storage needs. The configuration and features for the compute nodes, interconnect and I/O systems are described below.

Ranger is a blade-based system. Each node is a SunBlade x6420 blade running a 2.6.18.8 Linux kernel. Each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in all) on a single board, as an SMP unit. The core frequency is 2.3 GHz and supports 4 floating-point operations per clock period with a peak performance of 9.2 GFLOPS/core or 128 GFLOPS/node.

Each node contains 32 GB of memory. The memory subsystem has a 1.0 GHz HyperTransport system Bus, and 2 channels with 667 MHz DDR2 DIMMS. Each socket possesses an independent memory controller connected directly to an L3 cache.

The interconnect topology is a 7-stage, full-CLOS fat tree with two large Sun InfiniBand Datacenter switches at the core of the fabric (each switch can support up to a maximum of 3,456 SDR InfiniBand ports). Each of the 328 compute chassis is connected directly to the 2 core switches. Twelve additional frames are also connected directly to the core switches and provide files system, administration and login capabilities.

Filesystems: Ranger's filesystems are hosted on 72 Sun x4500 disk servers, each containing 48 SATA drives, and six Sun x4600 metadata servers. From this aggregate space of 1.7PB, three global filesystems are available for all users.

File Systems

Ranger has several different filesystems with distinct storage characteristics. There are predefined directories in these filesystems for you to store your data. Since these filesystems are shared with others, they are managed either by a quota limit or a purge policy.

Ranger Shared Filesystems

Three Lustre filesystems are available: $HOME, $WORK, and $SCRATCH. The $HOME directory has a 6GB quota. $WORK on Ranger is a non-purged, non-backed up filesystem with a 200GB quota. $SCRATCH is periodically purged and not backed up, and has a very large, 400TB quota. All filesystems also impose an inode limit, which affects the number of files allowed.

$HOME
  • At login, the system automatically sets the current working directory to your home directory.
  • Store your source code and build your executables here.
  • This directory has a quota limit of 6 GB.
  • The frontend nodes and any compute node can access this directory.
  • Use $HOME to reference your home directory in scripts.
  • Use cdh to change to $HOME.
$WORK
  • Store large files here.
  • Change to this directory in your batch scripts and run jobs in this filesystem.
  • This directory has a quota limit of 200 GB.
  • The frontend nodes and any compute node can access this directory.
  • Currently, there is no purging policy in effect on this filesystem.
  • This filesystem is not backed up.
  • Use $WORK to reference this directory in scripts.
  • Use cdw to change to $WORK.
$SCRATCH
  • This is NOT a local disk filesystem on each node.
  • This is a global Lustre filesystem for storing temporary files.
  • The quota on this system is 400 TB.
  • Files on this system may be purged when a file's access time exceeds 10 days
  • If possible, have your job scripts use and store files directly in $WORK (to avoid moving files from $SCRATCH later, before they are purged).
  • Use $SCRATCH to reference this filesystem in scripts.
  • Use cds to change to $SCRATCH.

NOTE: TACC staff may delete files from scratch if the scratch filesystem becomes full, even if files are less than 10 days old. A full filesystem inhibits use of the filesystem for everyone. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.

$ARCHIVE

Additionally, the $ARCHIVE variable is set on Ranger to point to the user's directory on the tape archival system, Ranch. This directory belongs to the tape archival system ($ARCHIVER). Use this directory when copying files to Ranch. For example:

login3$ scp filename ${ARCHIVER}:${ARCHIVE}

See the Ranch User Guide for more information on archiving files.

Ranger Local Filesystems

Although Ranger has no local disk, there are two filesystems that use RAM for a local filesystem. These filesystems may be used to greatly improve I/O performance. However, any files written to these filesystems are purged at the end of the job. Files that need to be saved must be copied to $SCRATCH, $WORK, or $HOME at the end of the batch script.

/tmp

A small amount of space is carved out of memory as a tmpfs filesystem in /tmp. This filesystem is limited to 300 MB.

/dev/shm

For the adventurous, the /dev/shm partition may also be used as a local filesystem. As with /tmp, this is also carved out of memory and provides up to 16 GB of space. However, this 16 GB comes from the total 32 GB of RAM per node, and, if oversubscribed, may cause your job to crash unexpectedly. Use at your own risk.

Determining your filesystem usage

To determine the amount of disk space used in a filesystem, cd to the directory of interest and execute the following command:

login3$ df -k .

It is important to include the dot which represents the current directory. Without the dot, all filesystems are reported.

In the command output below, the filesystem name appears on the left , and the used and available space (-k, in units of 1 KBytes) appear in the middle columns followed by the percent used and the mount point:

login3$ df -k .
File System 1k-blocks Used Available Use% Mounted on
129.114.97.1@o2ib:/share 103836108384 290290196 103545814376 1% /share

To determine the amount of space occupied in a user-owned directory, cd to the directory and execute the du command with the -sm option (s=summary, m=units in mbytes):

login3$ du -sm *

To determine quota limits and usage, execute the lfs quota command with your login (user) name and the directory of interest:

login3$ lfs quota -u userid $HOME
login3$ lfs quota -u userid $WORK
login3$ lfs quota -u userid $SCRATCH

To determine quota limits and usage for yourself and your group:

login3$ lfs quota $HOME

System Access

Methods of Access

ssh

To ensure a secure login session, users must connect to machines using the secure shell, ssh program. Unix based systems generally have an ssh client available locally. Freely available clients for other platforms are also available; a popular choice for Windows users is PuTTY.

To initiate an ssh connection to a Ranger login node from a UNIX or Linux system with an ssh client already installed, execute the following command:

login3$ ssh userid@ranger.tacc.utexas.edu

where userid is replaced with the Ranger user name assigned to you during the allocation process. Note that this userid specification is only required if the user name on the local machine and the TACC machine differ. You can also login to a specific login node. For example, to log in to Ranger's login4 node, execute the following:

login3$ ssh userid@login4.ranger.tacc.utexas.edu

Important Note: Do not run the optional ssh-keygen command on Ranger to set up public-key authentication. This option sets up a passphrase that will interfere with applications running in the batch queues. If you have already done this, remove the .ssh directory (and the files under it) from your top-level home directory. Log out and log back in to generate a new .ssh directory with TACC generated keys.

Logging into Ranger

Login Shell

The most important component of a user's environment is the login shell that interprets text on each interactive command line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. To determine your login shell, use:

login3$ echo $SHELL

You can use the chsh command to change your login shell. Full instructions are in the chsh man page. Available shells are defined by the /etc/shells file, along with their full-path.

To display the list of available shells with chsh and change your login shell to bash, execute the following:

login3$ chsh -l <username>
login3$ chsh -s /bin/bash

User Environment

The next most important component of a user's environment is the set of environment variables. Many of the UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:

login3$ env

The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the $HOME and $PATH variables.

HOME=/home/utexas/staff/jones
PATH=/bin:/usr/bin:/usr/local/apps:/opt/intel/bin

Notice that the $PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with setenv for C shells and export for Bourne shells) are "carried" to the environment of shell scripts and new shell invocations, while normal "shell" variables (created with an assignment) are useful only in the present shell. Only environment variables are displayed by the env (or printenv) command. Execute set to see the (normal) shell variables.

Startup Scripts

Unix shells allow the user to customize their environment via startup files containing scripts. Since there are various types of shell, setting up these customizations is not entirely trivial. Here follow some simple instructions, as well as the full explanation of what is going on.

Technical background

All UNIX systems set up a default environment and provide administrators and users with the ability to execute additional UNIX commands to alter that environment. These commands are sourced; that is, they are executed by your login shell, and the variables (both normal and environmental), as well as aliases and functions, are included in the present environment. TACC supports the Bourne-shell and its variants (/bin/sh, /bin/bash) and the C-shell and its variants (/bin/csh, bin/tcsh). Each shell's environment is controlled by system-wide and the user's own startup files. TACC deploys system-specific startup files in the /etc/profile.d/ directory.

Each UNIX shell may be invoked in three different ways: as a login shell, as an interactive shell or as a non-interactive shell. The differences between a login and interactive shell are rather arcane. For our purposes, just be aware that each type of shell runs different startup scripts at different times depending on how it's invoked. Both login and interactive shells are shells in which the user interacts with the operating system via a terminal window. A user issues standard command-line instructions interactively. A non-interactive shell is one that is launched by a script and does not interact with the user such as when a queued job runs. We've outlined the correct startup file configurations for the bash and csh shells below.

The commands in the startup scripts set operating system interaction and the initial PATH, ulimit, umask, and environment variables such as HOSTNAME. They also source command scripts in /etc/profile.d -- the /etc/csh.cshrc sources files ending in ".csh", and /etc/profile sources files ending in ".sh". Many site administrators use these scripts to set up the environments for common user tools (vim, less, etc.) and system utilities (ganglia, modules, Globus, etc.)

For long-time TACC users

Long time users of TACC computers have controlled their initial environment with the system supplied version of ~/.profile for bash and ~/.login, and ~/.cshrc for csh users and then placed their personal setup in ~/.login_user or ~/.profile_user for (csh/bash) users. We will continue to support sourcing these files. However, this old system is no longer required. All the required setup has been moved to system files, so users are free to have their own startup files.

Startup files in a nutshell

To ensure that your modules and other customizations will always be properly loaded, please follow the instructions below. It is best to load modules in your ~/.bashrc or ~/.cshrc file so that your modules are always loaded independent of login or interactive shell startup. This way your modules are loaded only once and not reloaded on every sub-shell. This can be important when running large parallel jobs.

/bin/bash startup file configurations

To load modules correctly, bash users must do two things:

  1. Create a .profile file in your home directory containing the code snippet below. These commands will source your ~/.bashrc file.

    if [ -f ~/.bashrc ]; then
        . ~/.bashrc
    fi

  2. Then place any module commands in your ~/.bashrc file:

    if [ -z "$__BASHRC_READ__" -a "$ENVIRONMENT" != BATCH ]; then
        export __BASHRC_READ__="read"
        # place module commands below this line
        module load git fftw2
        module load gotoblas
    fi
    # no module commands below this line

One might ask: Why can't I just put everything in my ~/.profile file and ignore my ~/.bashrc? The answer is that bash login shells source only your ~/.profile and bash interactive shells source only your ~/.bashrc. So a login shell will source ~/.profile file which then sources the ~/.bashrc file. Whereas an interactive shell will ignore the ~/.profile file and only source your ~/.bashrc.

/bin/csh startup file configurations

The C based shells (csh, tcsh, etc.) source two types of files. The .*rc file, (.cshrc, .tcshrc) is sourced first. This file is used to set up the execution environment used by all scripts and for access to the machine without an interactive login. For example, the following commands execute only the .cshrc type files on the remote machine:

login1$ scp data lonestar.tacc.utexas.edu
login1$ ssh lonestar.tacc.utexas.edu date

The ~/.cshrc file is always sourced upon any type of shell of invocation. Place any module commands in your ~/.cshrc file:

if ( ! $?__CSHRC_READ__ && ! $?ENVIRONMENT ) then
    setenv __CSHRC_READ__ "read"
    # place module commands below this line
    module load git fftw2
    module load gotoblas
endif
# no module commands below this line

Transferring files to Ranger

TACC supports the use of ssh/scp, rcp and globus-url-copy for file transfers.

globus-url-copy

To transfer data between XSEDE sites, use globus-url-copy. Complete globus-url-copy documentation is here. The globus-url-copy command requires the use of an XSEDE certificate to create a proxy for password-less transfers. It has a complex syntax, but provides high-speed access to other XSEDE machines that support gridFTP services (the protocol for globus-url-copy). High-speed transfers of a file or directory occur between the different FTP servers at the XSEDE sites. The GridFTP servers mount the filesystems of the compute machines, thereby providing access to your files or directories. Third party transfers are possible (transfers initiated between two machines from another machine). For a list of XSEDE GridFTP servers and mounted directory names for XSEDE, please see the XSEDE Data Transfers & Management page.

Use myproxy-logon to obtain a proxy certificate. For example:

login3$ myproxy-logon

This command will prompt for the certificate password. The proxy is valid for 12 hours for all logins on the local machine. With globus-url-copy, you must include the name of the server and a full path to the file. The general syntax looks like:

login3$ globus-url-copy <options>
   gsiftp://<gridftp_server1>/<directory>|<file> \
   gsiftp://<gridftp_server2>/<directory>|<file>

An example file transfer might look like this:

login3$ globus-url-copy -stripe -tcp-bs 11M -vb \
   gsiftp://gridftp.ranger.tacc.xsede.org/`pwd`/file1 \
   gsiftp://tg-gridftp.ncsa.xsede.org/home/ncsa/johndoe/file2

An example directory transfer might look like this:

login3$ globus-url-copy -stripe -tcp-bs 11M -vb \
   gsiftp://gridftp.ranger.tacc.xsede.org/`pwd`/directory1/ \
   gsiftp://gridftp.ranch.tacc.xsede.org/home/00000/johndoe/directory2/

Use the stripe and buffer size options (-stripe to use multiple service nodes, -tcp-bs 11M to set ftp data channel buffer size). Otherwise, the speed will be about 20 times slower! When transferring directories, the directory path must end with a slash (/). The -rp option (not shown) allows paths relative to the user's "starting" directory of the filesystem mounted on the server. Without this option, you must specify the full path to the file or directory.

Application Development

Software on TACC Resources

A collection of program libraries and software packages are supported on Ranger across diverse disciplines. For a comprehensive list of software packages available, visit the TACC Software page. These software products for the supercomputing environment have been selected on the basis of quality, history of performance, system compatibility, and benefit to the scientific community. If you need a particular software package for your work, let us know via the Consulting Form. An organized and customizable listing of all packages (name, version, etc.) as well as execution/loading information is available in the Software and Tools Table. The same information for individual packages can be found in the module files on the machine through the execution of the module help command (module help <module_name>).

Modules

TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.

At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIBPATH environment variables, directory locations ($WORK, $HOME, etc.), aliases (cdw, cdh, etc.) and license paths. Therefore, there is no need for you to set them or update them when updates are made to system and application software.

Users that require 3rd party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.

Each of the major TACC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. You need only to invoke a single command to configure the application/programming environment properly. The general format of this command is:

login3$ module load <module_name>

where <module_name> is the name of the module to load. If you often need an application environment, place the module commands required in your .login_user and/or .profile_user shell startup file.

Most of the package directories are in /opt/apps ($APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.

As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set up by loading the fftw3 module:

login3$ module load fftw3

To see a list of available modules, a synopsis of a particular modulefile's operations (in this case, fftw3), and a list of currently loaded modules, execute the following commands:

login3$ module avail
login3$ module help fftw3
login3$ module list

During upgrades, new modulefiles are created to reflect the changes made to the environment variables. TACC will always announce upgrades and module changes in advance.

Programming Models

There are two distinct memory models for computing: distributed-memory and shared-memory. In the former, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the latter, open multiprocessing (OMP) programming techniques are employed for multiple threads (light weight processes) to access memory in a common address space.

For distributed memory systems, single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms are used. In the SPMD paradigm, each processor loads the same program image and executes and operates on data in its own address space (different data). This is illustrated in Figure 1. It is the usual mechanism for MPI code: a single executable (a.out in the figure) is available on each node (through a globally accessible filesystem such as $WORK or $HOME), and launched on each node (through the batch MPI launch command, "ibrun a.out").

In the MPMD paradigm, each processor loads up and executes a different program image and operates on different data sets, as illustrated in Figure 1. This paradigm is often used by researchers who are investigating the parameter space (parameter sweeps) of certain models, and need to launch 10s or 100s of single processor executions on different data. (This is a special case of MPMD in which the same executable is used, and there is NO MPI communication.) The executables are launched through the same mechanism as SPMD jobs, but a UNIX script is used to assign input parameters for the execution command (through the batch MPI launcher, "ibrun script_command"). Details of the batch mechanism for parameter sweeps are described in the help information for the launcher module:

login3$ module help launcher

The shared-memory programming model is used on Symmetric Multi-Processor (SMP) nodes, like the large memory nodes on Lonestar (5 nodes: 24 cores, 1TB memory per node) or any Ranger node (16 cores, 32GB memory per node). The programming paradigm for this memory model is called Parallel Vector Processing (PVP) or Shared-Memory Parallel Programming (SMPP). The latter name is derived from the fact that vectorizable loops are often employed as the primary structure for parallelization. The main point of SMPP computing is that all of the processors in the same node share data in a single memory subsystem. There is no need for explicit messaging between processors as with MPI coding.

In the SMPP paradigm either compiler directives (as pragmas in C, and special comments in Fortran) or explicit threading calls (e.g. with Pthreads) is employed. The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.

In cluster systems that have SMP nodes and a high speed interconnect between them, programmers often treat all CPUs within the cluster as having their own local memory. On a node an MPI executable is launched on each CPU and runs within a separate address space. In this way, all CPUs appear as a set of distributed memory machines, even though each node has CPUs that share a single memory subsystem.

In clusters with SMPs, hybrid programming is sometimes employed to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.

Compiling Code

The following sections present the compiler invocation for serial and MPI executions, and follows with a section on options. All compiler commands can be used for just compiling with the -c option (create just the ".o" object files) or compiling and linking (to create executables). To use a different (non-default) compiler you first unload the MPI environment (mvapich), swap the compiler environment, and then reload the MPI environment.

Compiling Serial Programs

The compiler invocation commands for the supported vendor compiler systems are tabulated below.

Table 4.1 Compiler Invocations

Vendor Compiler Language File Extension
intel icc C .c
intel icpc C++ .C/c/cc/cpp/cxx/c++/i/ii
intel ifort F77/F90/F95 .f, .for, .ftn, .f90, .fpp
pgi pgcc C .c
pgi pgcpp C++ .C, .cc
pgi pgf95 F77/90/95 .f, .F, .FOR, .f90, .f95, .hpf
gnu gcc C .c
sun sun_cc C .c
sun sun_CC C++ .C, .cc, .cpp, .cxx
sun sunf90 F77/F90/F95 .f, .F, .FOR, .f90, .hpf
sun sunf95 F95 .f, .F, .FOR, .f90, .f95, .hpf

Note: pgf90 is an alias for pgf95.

Appropriate source code file extensions are required for each compiler. By default, the executable file name is a.out. It may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options.

An Intel C program example:

login3$ icc -o flamec.exe -O2 -xW prog.c

A PGI Fortran program example:

login3$ pgf95 -o flamef.exe -fast -tp barcelona-64 prog.f90

A gnu C program example:

login3$ gcc -o flamec.exe -mtune=barcelona -march=barcelona prog.c

A Sun Fortran program example:

login3$ sunf90 -o flamef.exe -xarch=sse2 prog.f90

To see a list of compiler options, their syntax, and a terse explanation, execute the compiler command with the -help or --help option. Also, man pages are available.

Compiling Parallel Programs with MPI

The "mpicmds" commands support the compilation and execution of parallel MPI programs for specific interconnects and compilers. At login, MPI MVAPICH (mvapich) and PGI compiler (intel) modules are loaded to produce the default environment which provides the location to the corresponding mpicmds. Compiler scripts (wrappers) compile MPI code and automatically link startup and message passing libraries into the executable. The compiler and MVAPICH library are selected according to the modules that have been loaded. The following table lists the compiler wrappers for each language:

Table 4.2 Compiler Wrappers

Compiler Program File Extension
mpicc C .c
mpicxx C++ intel: .C/c/cc/cpp/cxx/c++/i/ii
pgi: .C/c/cc/cpp/cxx//I
mpif90 F77/F90 .f, .for, .ftn, .f90, .f95, .fpp

Appropriate source code file extensions are required for each wrapper. By default, the executable name is a.out. It may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options:

An Intel Fortran example:

login3$ mpif90 -o prog.exe -O2 -xW prog.f90

A PGI C example:

login3$ mpicc -o prog.exe -fast -tp barcelona-64 prog.c

Include linker options such as library paths and library names after the program module names, as explained in the Loading Libraries Section below. The Running Code Section explains how to execute MPI executables in batch scripts and "interactive batch" runs on compute nodes.

We recommend that you use either the Intel or the PGI compiler for optimal code performance. TACC does not support the use of the gcc compiler for production codes on the Ranger system. For those rare cases when gcc is required, for either a module or the main program, you can specify the gcc compiler with the -cc mpcc option. (Since gcc- and Intel-compiled codes are binary compatible, you should compile all other modules that don't require gcc with the Intel compiler.) When gcc is used to compile the main program, an additional Intel library is required. The examples below show how to invoke the gcc compiler in combination with the Intel compiler for the two cases:

login3$ mpicc -O2 -xW -c -cc=gcc suba.c
login3$ mpicc -O2 -xW mymain.c suba.o
login3$ mpicc -O2 -xW -c suba.c
login3$ mpicc -O2 -xW -cc=gcc -L$ICC_LIB -lirc mymain.c suba.o

Compiling OpenMP Programs

Since each of the blades (nodes) of the Ranger cluster is an AMD Opteron quad-processor quad-core system, applications can use the shared memory programming paradigm "on node". With a total number of 16 cores per node, we encourage the use of a shared-memory model on the node.

The OpenMP compiler options are listed below for those who do need SMP support on the nodes. For hybrid programming, use the mpi-compiler commands, and include the openmp options.

With the Intel C compiler, for example:

login3$ mpicc -O2 -xW -openmp

Using the PGI Fortran compiler:

login3$ mpif90 -fast -tp barcelona-64 -mp

Basic Optimization

The MPI compiler wrappers use the same compilers that are invoked for serial code compilation. So, any of the compiler flags used with the icc command can also be used with mpicc; likewise for ifort and mpif90; and icpc and mpicxx. Below are some of the common serial compiler options with descriptions.

Table 4.3 Intel Compiler Options

Option Description
-O3 performs some compile time and memory intensive optimizations in addition to those executed with -O2, but may not improve performance for all programs.
-ipo Creates inter-procedural optimizations.
-vec_report[0|...|5] Controls the amount of vectorizer diagnostic information.
-xW Includes specialized code for SSE and SSE2 instructions (recommended).
-xO Includes specialized code for SSE, SSE2 and SSE3 instructions. Use, if code benefits from SSE3 instructions.
-fast Includes: -ipo, -O2, -static DO NOT USE -- static load not allowed.
-g -fp debugging information produced, disable using EBP as general purpose register.
-openmp Enable the parallelizer to generate multi-threaded code based on the OpenMP directives.
-openmp_report[0|1|2] Controls the OpenMP parallelizer diagnostic level.
-help Lists options.

Developers often experiment with the following options: -pad, -align, -ip, -no-rec-div and -no-rec-sqrt. In some codes performance may decrease. Please see the Intel compiler manual (below) for a full description of each option.

Use the -help option with the mpicmds commands for additional information, for example:

login3$ mpicc -help
login3$ mpif90 -help
login3$ mpirun -help

Use the options listed in the mpirun man page with the ibrun command.

Table 4.4 PGI Compiler Options

Option Description
-O3 Performs some compile time and memory intensive optimizations in addition to those executed with -O2, but may not improve performance for all programs.
-Mipa=fast, inline Creates inter-procedural optimizations. There is a loader problem with this option. Please include the following path in the LD_LIBRARY_PATH environment variable: For Bourne-based shells: export LD_LIBRARY_PATH=/share/apps/binutils-amd/070220/lib64:${LD_LIBRARY_PATH} For C-based shells: setenv LD_LIBRARY_PATH /share/apps/binutils-amd/070220/lib64:${LD_LIBRARY_PATH}
-tp barcelona-64 Includes specialized code for the barcelona chip.
-fast Includes: -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz
-g, -gopt Produces debugging information.
-mp Enables the parallelizer to generate multi-threaded code based on the OpenMP directives.
-Minfo=mp,ipa Provides information about OpenMP, and inter-procedural optimization.
-help Lists options.
-help -fast Lists options for the -fast option.

Compiler Usage Guidelines

The AMD Compiler Usage Guidelines document provides the "best-known" peak switches for various compilers tailored to their Opteron products. Developers and installers should read Chapter 5 of this document before experimenting with PGI and Intel compiler options.

The Intel Compiler Suite

The Intel compilers are not the default compilers. You must use the module commands to load the Intel compiler environment. (The 10.1 suite is the most current installed, but the 9.1 compilers are available for special porting needs.) The gcc 3.4.6 compiler and module are also available. We recommend using the Intel (or the PGI) suite whenever possible. The 10.1 suite is installed with the 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode).

The PGI 7.1 Compiler Suite

The PGI 7.1 compilers are loaded as the default compilers at login with the pgi module. We are recommending the use of the PGI suite whenever possible (at this time). The 7.1 suite is installed with the 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode).

Selecting Compiler and MPI Environments

The Ranger programming environment supports several compilers (Intel 9/10 and PGI 7) and several MPI stacks (mvapich, mvapich2 and openmpi) for MPI programs. Each MPI stack must have a library compiled for each of the compilers, so that applications compiled with compiler x can load the x-compiled MPI libraries. Hence, the MPI environment is dependent upon the compiler environment you select. The generic way to change these two environments is to: unload the MPI environment, change the compiler environment (or not if you continue to use the present compiler environment), and then load the MPI environment you will need. The possible scenarios and examples are:

Change Only MPI stack

login3$ module unload <MPI_old>
login3$ module load <MPI_new>

Change Only Compiler (COMP)

login3$ module unload <MPI_orig>
login3$ module swap <COMP_old> <COMP_new>
login3$ module load <MPI_orig>

Change Compiler (COMP) and MPI stack

login3$ module unload <MPI_old>
login3$ module swap <COMP_old> <COMP_new>
login3$ module load <MPI_new>

Example of changing only the MPI stack:

login3$ module unload mvapich
login3$ module load mvapich2

Example of changing the compiler:

login3$ module unload mvapich
login3$ module swap pgi intel
login3$ module load mvapich

Example of changing the compiler and the MPI stack (the version is specified in this example):

login3$ module unload mvapich
login3$ module swap pgi intel/10.1
login3$ module load openmpi

By default, the pgi compiler and mvapich environments are set up at login. Execute the module avail command to determine the modulefile names for all the available compilers; they have the syntax compiler/version. Note, only certain MPI stacks and other compiler-dependent libraries are seen for each compiler environment. The above commands can be placed in your .login_user (C shells) or .profile_user (Bourne shells) file to automatically set an alternate default compiler and MPI stack in your environment at login. The matrix below shows the available combination of compilers and MPI stacks.

Table 4.5 Available Compilers and MPI Stack combinations

MPI Family Compiler Support MPI1-1 Full MPI-2 Notes
mvapich/1.0 pgi intel/9.1 intel/10.1 Yes No This is the current recommended stack for large scale analysis on Ranger. It has been used to run applications with O(32K) MPI tasks.
mvapich2/1.0 pgi intel/9.1 intel/10.1 Yes Yes This supports full MPI-2 functionality with a job-startup mechanism that is recommended for job sizes in the range from 16-2048 tasks.
openmpi/1.2.4 pgi intel/9.1 intel/10.1 Yes Yes OpenMPI also supports MPI-2 semantics and is the successor to the LAM/MPI project.

Compiler Options

Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

Table 4.6 More PGI Compiler Options

Option Description
-O[n] Optimization level, n=0, 1, 2 or 3
-tp barcelona-64 Targeting the architecture
-Mipa[=option] Interprocedural analysis, option=fast, inline

Table 4.7 More Intel Compiler Options

Option Description
-O[n] Optimization level, n=0, 1, 2 or 3
-x[p] Targeting the architecture, p=W or O
-ip, -ipo Interprocedural analysis

See the Development section for the use of these options.

Libraries

TACC provides several ISP (Independent Software Providers) and HPC vendor math libraries that can be used in many applications. These libraries provide highly optimized math packages and functions for the Ranger system.

The ACML library (AMD Core Math Library) contains many common math functions and routines (linear algebra, transformations, transcendental, sorting, etc.) specifically optimized for the AMD Barcelona processor. The ACML library also supports multi-processor (threading) FFT and BLAS routines. The Intel MKL (Math Kernel Library) has a similar set of math packages. Also, the GotoBLAS libraries contain the fastest set of BLAS routine for this machine.

The default compiler representation for Ranger consists of 32-bit ints, and 64-bit longs and pointers. Likewise for Fortran, integers are 32-bit, and the pointers are 64-bit. This is called the LP64 mode (Long and Pointers are 64-bit, ints and integers are 32-bit). Libraries with 64-bit integers are often suffixed with an ILP64.

ACML and MKL libraries

The "AMD Core Math Library" and "Math Kernel Library" consists of functions with Fortran, C, and C++ interfaces for the following computational areas:

  • BLAS (vector-vector, matrix-vector, matrix-matrix operations) and extended BLAS for sparse computations
  • LAPACK for linear algebraic equation solvers and eigensystem analysis
  • Fast Fourier Transforms
  • Transcendental Functions

Note, Intel provides performance libraries for most of the common math functions and routines (linear algebra, transformations, transcendental, sorting, etc.) for their em64t and Core-2 systems. These routines also work well on the AMD Opteron microarchitecture. In addition, MKL also offers a set of functions collectively known as VML -- the "Vector Math Library". VML is a set of vectorized transcendental functions which offer both high performance and excellent accuracy compared to the libm, for vectors longer than a few elements.

GotoBLAS library

The "GotoBLAS Library" (pronounced "goat-toe") contains highly optimized BLAS routines for the Barcelona microarchitecture. The library has been compiled for use with the PGI and Intel Fortran compilers on Ranger. The GotoBLAS routines are also supported on other architectures, and the source code is available free for academic use. We recommend the GotoBLAS libraries for performing the following linear algebra operations and solving matrix equations:

  • BLAS (vector-vector, matrix-vector, matrix-matrix operations)
  • LAPACK for linear algebraic equation solvers and eigensystem analysis

Loading Libraries

Some of the more useful load flags/options are listed below. For a more comprehensive list, consult the ld man page.

Use the -l loader option to link in a library at load time. For example:

compiler prog.f90 -l<name>

This links in either the shared library libname.so (default) or the static library libname.a, provided that the correct path can be found in ldd's library search path or the LD_LIBRARY_PATH environment variable paths.

To explicitly include a library directory, use the -L option:

compiler prog.f -L/mydirectory/lib -lbname

In this example, the libname.a library is not in the default search path, so the "-L" option is specified to point to the libname.a directory /mydirectory/lib.

MKL

Many of the modules for applications and libraries, such as the mkl library module, provide environment variables for compiling and linking commands. Execute "module help <module_name>" for a description, listing, and use cases for the assigned environment variables.

The following examples illustrates their use for the mkl library:

login3$ module load mkl
login3$ mpicc prog.c \
   -I$TACC_MKL_INC -Wl,-rpath,$TACC_MKL_LIB \
   -L$TACC_MKL_LIB -lmkl -lguide

login3$ module load mkl
login3$ mpf90 prog.f90 -Wl,-rpath,$TACC_MKL_LIB \
   -L$TACC_MKL_LIB -lmkl -lguide

Here, the module supplied environment variables TACC_MKL_LIB and TACC_MKL_INC contain the MKL library and header library directory paths, respectively. The loader option -Wl specifies that the $TACC_MKL_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of from the $LD_LIBRARY_PATH environment variable or the LDD dynamic cache of bindings between shared libraries and directory paths. This avoids having to set LD_LIBRARY_PATH (manually or through a module command) before running the executables.

gotoBLAS

The gotoBLAS library contains statically compiled routines and does not require the -Wl,rpath option to locate the library directory at run time. The multi-threaded library (for hybrid computing) has a "_mp" suffix. Also, codes that use long long int (C) or integer kind=8 should use the libraries differentiated with the "_ilp64" moniker. Otherwise, for normal, single core executions within MPI or serial code, use the libgoto_lp64.a library, as shown below (after loading the gotoBLAS module with the command module load gotoblas).

login3$ mpicc prog.c -L$TACC_GOTOBLAS_LIB -lgoto_lp64
login3$ mpf90 prog.f90 -L$TACC_GOTOBLAS_LIB -lgoto_lp64

If you are using the gotoBLAS in conjunction with other libraries that include the BLAS routines, make sure to include the gotoBLAS reference (-L$TACC_GOTOBLAS_LIB -lgoto_lp64) before any other library specification. A load map, which shows the library module for each static routine, can be used to validate that gotoBLAS libraries are being called from your program. Use the following loader option to create a map:

Compiler independent load map option -WI,-Map,mymap

ScaLAPACK

Since the Intel LAPACK routines are highly optimized for em64t-type architectures, TACC recommends loading the Intel MKL library to satisfy the LAPACK references within ScaLAPACK routines. In this case it is important to specify the ScaLAPACK library options BEFORE MKL and other libraries on the loader/compiler command line. Because ScaLAPACK uses two static libraries in the APIs for accessing the BLACS routines, it may be necessary to explicitly reference them more than once, since the loader makes only a single pass through the library path list while loading libraries. If you need help determining a path sequence, please contact TACC staff through the portal consulting system or the XSEDE Help Desk (help@xsede.org). The examples below illustrate the normal library specification to try for C codes, and the loading sequence found to work for Fortran codes (don't forget to load the scalapack and mkl modules with the command module load scalapack mkl):

login3$ mpicc \
   -I$TACC_SCALAPACK_INC prog.c -Wl,-rpath,$TACC_MKL_LIB \
   -L$TACC_SCALAPACK_LIB -lscalapack \
   $TACC_SCALAPACK_LIB/blacsCinit_MPI-LINUX-0.a \
   $TACC_SCALAPACK_LIB/blacs_MPI-LINUX-0.a \
   -L$TACC_MKL_LIB -lmkl -lguide -L$IFC_LIB -lifcore

login3$ mpf90 prog.f90-Wl,-rpath,$TACC_MKL_LIB \
   -L$TACC_SCALAPACK_LIB -lscalapack \
   $TACC_SCALAPACK_LIB/blacsF77init_MPI-LINUX-0.a \
   $TACC_SCALAPACK_LIB/blacs_MPI-LINUX-0.a \
   $TACC_SCALAPACK_LIB/blacsF77init_MPI-LINUX-0.a \
   -L$TACC_MKL_LIB -lmkl -lguide

Note, the C library list includes the Fortran libfcore.a to satisfy miscellaneous Intel run-time routines; other situations my require the portability or POSIX libraries (libifport.a and libifposix.a).

Code Tuning

Additional performance can be obtained with these techniques :

  • Memory Subsystem Tuning : Optimize access to the memory by minimizing the stride length and/or employ cache blocking techniques.
  • Floating-Point Tuning : Unroll inner loops to hide FP latencies, and avoid costly operations like division and exponentiation.
  • I/O Tuning : Use direct-access binary files to improve the I/O performance.

These techniques are explained in further detail, with examples, in a separate Memory Subsystem Tuning document.

Basic Optimization

The most practical approach to enhance the performance of applications is to use advanced compiler options, employ high performance libraries for common mathematical algorithms and scientific methods, and tune the code to take advantage of the architecture. Compiler options and libraries can provide a large benefit for a minimal amount of work. Always profile the entire application to ensure that the optimization efforts are focused on areas with the greatest return on the optimization efforts.

"Hot spots" and performance bottlenecks can be discovered with basic profiling tools like prof and gprof. Observe the relative changes in performance among the routines when experimenting with compiler options. Sometimes it might be advantageous to break out routines and compile them separately with different options than those used for the rest of the package. Also, review routines for "hand-coded" math algorithms that can be replaced by standard (optimized) library routines. You should also be familiar with general code tuning methods and restructure statements and code blocks so that the compiler can take advantage of the microarchitecture.

Code should:

  • be clear and comprehensible
  • provide flexible compiler support
  • should be portable

Avoid too many architecture-specific code constructs. Use language features and restructure code so the compiler can discover how to optimize code for the architecture. That is, expose optimization when possible for the compiler, but don't rewrite the code specifically for the architecture.

Some best practices:

  • Avoid excessive program modularization (i.e. too many functions/subroutines)
  • write routines that can be inlined
  • use macros and parameters whenever possible
  • Minimize the use of pointers
  • Avoid casts or type conversions, implicit or explicit
  • Avoid branches, function calls, and I/O inside loops
  • structure loops to eliminate conditionals
  • move loops around a subroutine, into the subroutine

Running your Applications

Runtime Environment

Bindings to the most recent shared libraries are configured in the file /etc/ld.so.conf (and cached in the /etc/ld.so.cache file). Cat /etc/ld.so.conf to see the TACC configured directories, or execute:

login3$ /sbin/ldconfig -p

to see a list of directories and candidate libraries. Use the -Wl,rpath loader option or the LD_LIBARY_PATH to override the default runtime bindings.

The Intel compiler, MKL math libraries, GOTO libraries, and other application libraries are located in directories specified by their respective environment variables, which are set when the module is loaded. Use the following command:

login3$ module help <modulefile>

to display instructions and examples on loading libraries for a particular modulefile.

The SGE Batch System

Batch facilities such as LoadLeveler, LSF, and SGE differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (hold, delete, resource request modification). The following paragraphs list the basic batch operations and their options, explain how to use the SGE batch environment, and describe the queue structure.

Job submission

SGE provides the qsub command for submitting batch jobs:

login3$ qsub myjobscript

where "myjobscript" is the name of a UNIX format text file containing job script commands. This file should contain both shell commands and special statements that include qsub options and resource specifications. Some of the most common options are described in Table 5.1. Details on using these options and examples of job scripts follow.

Table 5.1 Common qsub options

Option Argument Function
-q <queue_name> Submits to queue designated by <queue_name>.
-pe <TpN>way <NoN x 16> Executes the job using the specified number of tasks (cores to use) per node ("wayness") and the number of nodes times 16 (total number of cores). (See example script below.)
-N <job_name> Names the job <job_name>.
-M <email_address> Specify the email address to use for notifications.
-m {b|e|a|s|n} Specify when user notifications are to be sent.
-V   Use current environment setting in batch job.
-cwd   Use current directory as the job's working directory.
-j y Join stderr output with the file specified by the -o option. (Don't also use the -e option.)
-o <output_file> Direct job output to <output_file>.
-e <error_file> Direct job error to <error_file>. (Don't also use the -j option.)
-A <project_account_name> Charges run to <project_account_name>. Used only for multi-project logins. Account names and reports are displayed at login.
-l <resource>=<value> Specify resource limits. (See qsub man page.)

Options can be passed to qsub on the command line or specified in the job script file. The latter approach is preferable. It is easier to store commonly used qsub commands in a script file that will be reused several times rather than retyping the qsub commands at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script.

Batch scripts contain two types of statements: special comments and shell commands. Special comment lines begin with #$ and are followed with qsub options. The SGE shell_start_mode has been set to unix_behavior, which means the UNIX shell commands are interpreted by the shell specified on the first line after #! sentinel; otherwise the Bourne shell (/bin/sh) is used. The file job below requests an MPI job with 32 cores and 1.5 hours of run time:

#!/bin/bash
#$ -V                      # Inherit the submission environment
#$ -cwd                    # Start job in submission directory
#$ -N myMPI                # Job Name
#$ -j y                    # Combine stderr and stdout
#$ -o $JOB_NAME.o$JOB_ID   # Name of the output file (eg. myMPI.oJobID)
#$ -pe 16way 32            # Requests 16 tasks/node, 32 cores total
#$ -q normal               # Queue name "normal"
#$ -l h_rt=01:30:00        # Run time (hh:mm:ss) - 1.5 hours
#$ -M                      # address for email notification
#$ -m be                   # email at Begin and End of job
set -x                     # Echo commands, use "set echo" with csh
ibrun ./a.out              # Run the MPI executable named "a.out"

If you don't want stderr and stdout directed to the same file, replace the -j option line, with a -e option to name a separate output file for stderr (but don't use both). By default, stderr and stdout are sent to out.o and err.o, respectively.

Example job scripts are available online in /share/doc/sge. They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.

Please note that TACC does not support the qlogin command.

MPI Environment for Scalable Code

The MVAPICH-1 and MVAPICH-2 MPI packages provide runtime environments that can be tuned for scalable code. For packages with short messages, there is a "FAST_PATH" option that can reduce communication costs, as well as a mechanism to "Share Receive Queues". Also, there is a "Hot-Spot Congestion Avoidance" option for quelling communication patterns that produce hot spots in the switch. See Chapter 9, "Scalable features for Large Scale Clusters and Performance Tuning" and Chapter 10, "MVAPICH2 Parameters" of the MVAPICH2 User Guide for more information.

The SGE Parallel Environment

Each Ranger node (of 16 cores) is assigned to only one user at a time; hence, each node assigned to a job is dedicated only to that user and accrues wall-clock time for all 16 cores whether any cores are used or not. The SGE Parallel Environment option, -pe, sets the number of MPI Tasks per Node (TpN), and the Number of Nodes (NoN) in a job. The syntax is:

-pe <TpN>way <NoN x 16>

where "TpN" is the number of Tasks per Node, and "NoN x 16" is the total number of cores requested (Number of Nodes times 16). The Tasks per Node, called the "wayness", is the number of MPI tasks launched on each node through the ibrun command. Regardless of the value of "TpN", the second argument is always 16 times the number of nodes that you are requesting, because the batch scheduler can only allocate nodes (groups of 16 cores).

For example, to run the MPI a.out executable on 4 nodes with 16 tasks on each node use the following two lines in the job script:

#$ -pe 16way 48
...
ibrun ./a.out

For MPI jobs that require more than 2GB/task or use hybrid techniques (MPI code with OpenMP threads) the wayness (Tasks per Node) will be less than the number of cores per node. In the following example the MPI hybrid a.out code launches 2 tasks per node and 6 threads per task on 4 nodes:

#$ -pe 2way 48
...
export OMP_NUM_THREADS=6 #(setenv OMP_NUM_THREADS 6 --for C-type shell)
ibrun tacc_affinity ./a.out

Here the OMP_NUM_THREAD environment variable is set using the specific syntax for the shell interpreter employed to execute the script. The shell is set in the first line of the job script and is preceded by a "shebang" character sequence (#!). Most often users set the shell to #!/bin/tcsh or #!/bin/bash on the first line to invoke the tcsh shell or the Bash shell. The shell selection is a personal preference. The tacc_affinity wrapper is used to make sure that the 2 tasks launched on each node run in separate processors (often called "sockets"). Ranger nodes have four processors, each with four cores.

Using a multiple of 16 cores per node

For "pure" MPI applications, the most cost-efficient choices are: 16 tasks per node (16way) and a total number of tasks that is a multiple of 16, as described above. This will ensure that each core on all the nodes is assigned one task.

Using a large number of cores

For core counts above 4,000, cache the executable in each node's memory immediately before the ibrun statement with the following command:

cache_binary $PWD ./a.out
ibrun tacc_affinity ./a.out

In this example ./a.out is the executable, $PWD is evaluated to the present working directory, and cache_binary is a perl script that caches the executable, libraries, and other files on each node. Set the DEBUG_CACHE_BINARY environment variable to any value to get a report of the files cached by the cache_binary command.

Using fewer than 16 cores per node

When you want to use less than 16 MPI tasks per node, the choice of tasks per node is limited to the set of numbers {1, 2, 4, 8, 12, and 15}. When the number of tasks you need is equal to "Number of Tasks per Node x Number of Nodes", then use the following command:

#$ -pe <TpN>way <NoN x 16>

where <TpN>is a number in the set {1, 2, 4, 8, 12, 15}.

If the Total number of Tasks that you need is less than "Number of Tasks per Node x Number of Nodes", then set the MY_NSLOTS environment variable to the Total number of Tasks <TnoT>needed. In a job script, use the following -pe option and environment variable statement:

#$ -pe <TpN>way <NoN x 16>
export MY_NSLOTS=<TnoT> # For Bourne shells

or

setenv MY_NSLOTS <TnoT> # For C shells

where <TpN>is a number in the set {1, 2, 4, 8, 12, 15}. For example, using a Bourne shell:

#$ -pe 8way 64 # Use 8 Tasks per Node, 4 Nodes requested
export MY_NSLOTS=31 # 31 tasks are launched

Program Environment for Serial Programs

For serial batch executions, use the 1-way program environment (#$ -pe 1way 16), don't use the ibrun command to launch the executable (just use "./my_executable my_arguments"), and submit your job to the serial queue (#$ -q serial). The serial queue has a 12-hour runtime limit, 12 nodes reserved within its pool, and allows up to 6 simultaneous runs per user.

#$ -pe 1way 16    # 1 execution on node of 16 cores
#$ -q serial           # run in serial queue
./my_executable   # execute your applications (no ibrun)

Program Environment for Hybrid Programs

For hybrid jobs, specify the MPI Tasks per Node through the first -pe option (1/2/4/8/15/16way) and the Number of Nodes in the second -pe argument (as the number of assigned cores = Number of Nodes x 16). Then, use the OMP_NUM_THREADS environment variable to set the number of threads per task. (Make sure that "Tasks per Node x number of Nodes" is less than or equal to the number assigned cores, the second argument of the -pe option.) The hybrid Bourne shell example below illustrates the use of these parameters to run a hybrid job. It requests 4 tasks per node, 4 threads per task, and a total of 32 cores (2 nodes x 16 cores).

#$ -pe 4way 32                                   # 4 tasks/node, 32 cores total
export OMP_NUM_THREADS=4   # 4 threads/task

SGE provides several environment variables for the #$ options lines that are evaluated at submission time. The above $JOB_ID string is substituted with the job id. The job name (set with -N) is assigned to the environment variable JOB_NAME. The memory limit per task on a node is automatically adjusted to the maximum memory available to a user application (for serial and parallel codes).

Batch query

After job submission, users can monitor the status of their jobs with the qstat command. The following table lists qstat options:

Table 5.2 Common qstat options

Option Result
-t Show additional information about subtasks
-r Shows resource requirements of jobs
-ext Displays extended information about jobs
-j Displays information for specified job
-qs {a|c|d|o|s|u|A|C|D|E|S} Shows jobs in the specified state(s)
-f Shows "full" list of queue/job details

The qstat command output includes a listing of jobs and the following fields for each job:

Table 5.3 qstat Output Field Descriptions

Field Description
JOBID job id assigned to the job
USER user that owns the job
STATE current job status, including, but not limited to:
  w(aiting)
  s(uspended)
  r(unning)
  h(old)
  E(rror)
  d(eletion)

For convenience, TACC has created an additional job monitoring utility which summarizes jobs in the batch system in a manner similar to the "showq" utility from PBS. Execute:

login3$ showq

to summarize running, idle, and pending jobs, along with any advanced reservations scheduled within the next week. The command showq -u will show jobs associated with your userid only (type "showq --help" to obtain more information on available options).

The latest queue information can be determined using the following commands:

Table 5.4 SGE Queue Commands

Command Comment
qconf -sql Lists the available queues.
qconf -sq <queue_name> The s_rt and h_rt values are the soft and hard wall-clock limits for the queue.
cat /share/sge/default/tacc/sge_esub_control The first value after max_cores_per_job_<queue_name> is the queue core limit.

Job control

Control of job behavior takes many forms:

Job modification while in the pending/run state

Users can reset the qsub options of a pending job with the qalter command, using the following syntax:

login3$ qalter <options>

where <options> refers only to the following qsub resource options:

Table 5.5 qalter Command Options

-l h_rt= per-job wall clock time
-o output file
-e error file

Job deletion

The qdel command is used to remove pending and running jobs from the queue. The following table explains the different qdel invocations:

Table 5.6 qdel Command Options

qdel Removes pending or running job.
qdel -f Force immediate dequeuing of running job.

Use forced job deletion (with the -f option) only if all else fails, and immediately report only forced deleted job IDs to TACC staff through the portal consulting system or the XSEDE Help Desk (help@xsede.org), as this may leave hung processes that can interfere with the next job.

Job suspension/resumption

The qhold command allow users to prevent jobs from running. This command may be used to stop serial or parallel jobs and can be invoked by a user. A user cannot resume a job that was suspended by a sys admin nor can he suspend or resume a job owned by another user.

Jobs that have been placed on hold by qhold can be resumed by using the qalter -h U ${JOB_ID} command. (Note the character spacing is mandatory.)

SGE Batch Environment

In addition to the environment variables inherited by the job from the interactive login environment, SGE sets additional variables in every batch session. The following table lists some of the important SGE variables:

Table 5.7 SGE Variables

Environment Variable Contains
JOB_ID batch job id
TASK_ID task id of an array job sub-task
JOB_NAME name user assigned to the job

Ranger Queue Structure

The Ranger production queues and their characteristics (wall-clock and processor limits; priority charge factor; and purpose) are listed in the table below. Queues that don't appear in the list (such as systest, sysdebug, and clean) are non-production queues for system and HPC group testing and special support. From Ranger, users can submit jobs to the Spur nodes through the "vis" queue, using their Spur project name in the qsub job account option (#$ -A). All Ranger projects have a Spur account, and at least a complementary allocation for experimenting with the visualization tools. More details are available in the Spur User Guide.

Queue Limits: Users are restricted to a maximum of 50 jobs total over all queues plus a special limit of 5 jobs at a time in the development queue.

Table 5.7 Ranger Queues

Queue Name Max Runtime Max Procs SU Charge Rate Purpose
normal 24 hrs 4096 1 Normal Priority
long 48 hrs 1024 1 Long Times
large 24 hrs 16384 1 Large Core Count
development 2 hrs 256 1 Development
serial 12 hrs 16 1 Serial, Large Memory
request -- -- 1 Special Requests
vis 24 hrs 32 1 Visualization

Launching MPI Applications with ibrun

For all codes compiled with any MPI library, use the ibrun command to launch the executable within the job script. The syntax is:

Simple Syntax ibrun ./my_exec <code_options>
Example ibrun ./a.out 2000

The ibrun command now supports options for advanced host selection. A subset of the processors from the list of all hosts can be selected to run an executable. An offset must be applied. This offset can also be used to run two different executables on two different subsets, simultaneously. The option syntax is:

ibrun -n <# of cores> -o <hostlist offset> my_exec <code_options>

For the following advanced example, 64 cores were requested in the job script.

ibrun -n 32 -o0 ./a.out &
ibrun -n 32 -o 32 ./a.out &
wait

The first call launches a 32-core run on the first 32 hosts in the hostfile, while the second call launches a 32-core run on the second 32 hosts in the hostfile, concurrently (by terminating each command with &). The wait command (required) waits for all processes to finish before the shell continues. The wait command works in all shells. Note, the -n and -o options must be used together.

Affinity Policies

Controlling Process Affinity and Memory Locality

While many applications will run optimally using 16 MPI tasks on each node (a single task on each core), certain applications will run more efficiently with fewer than 16 cores per node and/or with a non-default memory policy. In these cases, consider using the techniques described in this section for a specific arrangement of tasks (process affinity) and memory allocation (memory policy). In HPC batch systems an MPI task is synonymous with a process. This section uses these two terms interchangeably. The "wayness" of a job, specified by the -pe option, determines the number of tasks, or processes, launched on a node. For example, the SGE batch option "#$ -pe 16way 128" will launch 16 tasks on each of the nodes assembled from a set of 128 cores.

Since each MPI task is launched on a node as a separate process, the process affinity and memory policy can be specified for each task as it is launched with the numactl wrapper command. When a task is launched it can be bound to a socket or a specific core; likewise, its memory allocation can be bound to any socket. The assignment of tasks to sockets (and cores) and memory to socket memories are specified as options of the numactl wrapper command. There are two forms that can be used for numa control when launching a batch executable (a.out).

Example 1.

login3$ ibrun numactl options ./a.out

In this first example, ibrun executes "numactl options a.out" for all tasks (a.out's) using the same options. Because the ranks for the execution of each a.out launch are not known to the job script, it is impossible on this command line to tailor numactl options for each individual task.

Example 2.

login3$ ibrun my_affinity ./a.out

Here, ibrun launches an executable UNIX script, my_affinity for each task on a node and passes the rank and wayness to the script via environment variables. Hence, the script can manipulate these two variables, derive numactl option arguments for each task to use in its "numactl options a.out" command, within the my_affinity script.

Consult the numactl man pages and command help for further information.

login3$ man numactl
login3$ numactl --help

In order to use numactl, it is necessary to understand the node configuration and the mapping between the logical core IDs, also called CPU Numbers, and the socket IDs, also called the Socket Numbers, and their physical location. The figure below shows the basic layout of the memory, sockets and cores of a Ranger node.

Ranger Sockets

Both the CPU and Socket Numbers start at 0. Also, note the asymmetry in the HyperTransport (HT) connections: an HT port on socket 0 and 3 is used to connect to a PCI-express bus; socket 3's PCI-e connects to the Infiniband Host Channel Adapter (HCA) and out to the InfiniBand network fabric.

Both memory bandwidth-limited and cache-friendly applications are affected by process and memory locality, and numa control can be used to modify affinity and memory policy for the specifics of the Ranger node architecture. The following figure shows two extremes cases of core scalability on a node. The upper curve shows the performance of a cache-friendly DGEMM matrix-matrix multiply from the GotoBLAS library. It scales to 15.2 for a 16 core run. The lower bound shows the scaling for the STREAM benchmark which is limited by the memory bandwidth to each socket. Efficiently parallelized codes should exhibit scaling between these curves. Applications which are memory-bandwidth limited are often more sensitive to memory policy and affinity than CPU-limited applications.

Numactl Command and Options

The table below lists important options for assigning processes and memory allocation to sockets, and assigning processes to specific cores. The "no fallback" condition implies a process will abort if no more allocation space is available.

Table 5.8 numactl Commands and Options

Control Type Command Option Arguments Description
Socket Affinity numactl -N {0,1,2,3} Only execute process on cores of this (these) socket(s).
Memory Policy numactl -l none Allocate only on socket were process runs; fallback to another if full.
Memory Policy numactl -i {0,1,2,3} Strictly allocate round robin (interleave) on these (comma separated list) sockets; no fallback.
Memory Policy numactl --preferred= {0,1,2,3} select only one Allocate on this (only one) socket; fallback to another if full.
Memory Policy numactl -m {0,1,2,3} Strictly allocate on this (these, comma separated list) socket(s); no fallback.
Core Affinity numactl -C {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15} Only execute process on this (these, comma separated list) core(s).

In threaded applications the same numactl command can be used, but its scope is limited globally to ALL threads. In an executable launched with the numactl command, any forked process or thread, inherits the process affinity and memory policy of the parent. Hence, in OpenMP codes, all threads at a parallel region take on the affinity and policy of the master process (thread). Even though multiple threads are forked (or spawned) at run time from a single process, you can still control (migrate) each thread after it is spawned through a numa API (application program interface). The basic underlying API utilities for binding processes (MPI Tasks) and threads (OpenMP/Pthreads threads) are sched_get/setaffinity, and numalib memory functions, respectively.

There are three levels of numa control that can be used when submitting batch jobs. These are:

  1. Global (usually with 16 tasks);
  2. Socket level (usually with 4, 8, and 12 tasks); and
  3. Core level (usually with an odd number of tasks)

Launch commands will be illustrated for each. The control of numactl can be directed either at a global level on the ibrun command line (usually for memory allocation only), or within a script (which accesses MPI variables) to specify the socket (-N) or core (-C) mapping to MPI ranks assigned to the execution. (Note, the numactl man page refers to sockets as "nodes". In HPC systems a node is a blade; to avoid confusion, we will only use "node" when we refer to a blade).

The default memory policy is a local allocation; that is, allocation preferentially occurs on the socket where the process is executing. Note that the (physical) memory allocation occurs when variables/arrays are first touched (assigned a value) in a program, not when the memory storage is malloc'd (C) or allocated (F90). Task assignments are set by numa control within MPI (and should not be of concern for 16way MPI and 1way pure OMP runs); nevertheless some applications can benefit from their own numa control. The syntax and script templates for the three levels of control are present below.

Numa Control in Batch Scripts

Global Control

Global control will affect every execution. This is often only used to control the layout of memory for 16-way MPI executions and SMP executions of 16 threads on a node. Since allocated I/O buffers may remain on a node from a previous job, TACC provides a script, tacc_affinity, to enforce a strict local memory allocation to the socket (-m memory policy), thereby removing I/O buffers (the default local memory policy does not evict the IO memory allocation, but simply overflows to another socket's memory and incurs a higher overhead for memory access). The tacc_affinity script also distributes 4-, 8-, and 12way executions evenly across sockets. Use this script as a template for implementing your own control. The examples below illustrate the syntax.

ibrun ./a.out# -pe 16way 16 ;
#uses local memory; MPI does core affinity
ibrun numactl -i all ./a.out# -pe 1way 16 ;
# interleaved; possibly use for OMP
ibrun tacc_affinity ./a.out# -pe 4/8/12/16way ... ;
# assigns MPI tasks round robin to sockets, mandatory memory allocation to socket

Socket Control

Often socket level affinity is used with hybrid applications (such as 4 tasks and 4 threads/task on a node = 4x4), and when the memory per process must be two to four times the default (~2GB/task). In this scenario, it is important to distribute 3, 2 or 1 tasks per socket. In the 4x4 hybrid model, it is critically important to launch a single task per socket and control 4 threads on the socket. For the usual cases of control TACC provides a tacc_affinity script for recommended behavior.

An example numa control script and related job commands are shown to illustrate how a task (rank) is mapped to a socket and how process memory policy is assigned. The my_affinity script below (based on TACC's tacc_affinity script) is executed once for each task on the node. This script captures the assigned rank (from the particular MPI stack, one of MPIRUN_RANK, PMI_RANK, etc.) and wayness from ibrun and uses them to assign a socket to the task. From the list of the nodes, ranks are assigned sequentially in block sizes determined by the "wayness" of the Program Environment (assigned in the PE environment variable). So, for example, in the job below, a 32 core, 4way job (#$ -pe 4way 32) will have ranks {0,1,2,3} assigned to the first node, {4,5,6,7} assigned to the next node, etc. The "export OMP_NUM_THREADS=4" specifies a hybrid code using 4 threads per task (for hybrid executions). For the given job script parameters, a task is assigned to each of the sockets 0, 1, 2 and 3 on each of two nodes in the my_affinity script. Also, the script works for 8way, 12way and 16way jobs.

Bourne shell based job script

...
#! -pe 4way 32
...
export OMP_NUM_THREADS=4
ibrun numa.sh

numa.sh

export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
export VIADEV_USE_AFFINITY=0
export VIADEV_ENABLE_AFFINITY=0

# Get rank from appropriate MPI API variable
[ "x$MPIRUN_RANK" != "x" ] && myrank=$MPIRUN_RANK
[ "x$PMI_ID" != "x" ] && myrank=$PMI_ID
[ "x$OMPI_COMM_WORLD_RANK" != "x" ] && myrank=$OMPI_COMM_WORLD_RANK
[ "x$OMPI_MCA_ns_nds_vpid" != "x" ] && myrank=$OMPI_MCA_ns_nds_vpid

# TasksPerNode
TPN=`echo $PE | sed 's/way//'`
[ ! $TPN ] && echo "TPN NOT defined!" && exit 1

# local_rank = 0...wayness = Rank modulo wayness

local_rank=$(( $myrank % $TPN ))

# Assign sockets in groups of 1,2,3 and 4 for
# only 4-, 8-, and 16-wayness, respectively.

socketID=$(( $local_rank / ( $TPN / 4 ) ))

exec numactl -N $socketID -m $socketID ./a.out

Core Control

When there is no simple arithmetic algorithm (such as using the modulo function above) to map the ranks, lists may be used. The template below illustrates the use of a list for mapping tasks and memory allocation onto sockets.

We use the numactl -C option for assigning tasks to cores. Each element of the task2core array holds a core assigned to a task for the sequence of local ranks. Using localrank as the index of task2core maps the task (for the local rank) onto a core ID. Similarly, for the task2socket array; the 0-15 local rank values are used as indexes to provide socket IDs for the -m memory argument. In this case, the Program Environment is 16way and the set of local ranks {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} are mapped onto the cores {15,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14} and sockets {3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3}, respectively.

Task 0, which is often the root rank in MPI broadcasts, is assigned to core 15, with its memory allocated on socket 3, which is connected directly through the HyperTransport via the PCI-e bridge to the HCA (network card). Such a mapping may actually degrade performance for other reasons; hence, custom mappings should be thoroughly tested for their benefits before being used on a production application. Nevertheless, the approach is quite general and provides a mechanism for list-directed control over any desired mapping.

NOTE: When the number of cores is not a multiple of 16 (e.g. 28 in this case), the user must set the environment variable MY_NSLOTS to the number of cores within the job script, as shown below (AND the second argument in the -pe option (32 below) must be equal to the value of MY_NSLOTS rounded up the nearest multiple of 16).

Bourne shell based job script

...
#! -pe 14way 32
...
export MY_NSLOTS=28
ibrun numa.sh

numa.sh

#!/bin/bash

# Unset any MPI Affinities
export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
export VIADEV_USE_AFFINITY=0
export VIADEV_ENABLE_AFFINITY=0

# Get rank from appropriate MPI API variable
[ "x$MPIRUN_RANK" != "x" ] && myrank=$MPIRUN_RANK
[ "x$PMI_ID" != "x" ] && myrank=$PMI_ID
[ "x$OMPI_COMM_WORLD_RANK" != "x" ] && myrank=$OMPI_COMM_WORLD_RANK
[ "x$OMPI_MCA_ns_nds_vpid" != "x" ] && myrank=$OMPI_MCA_ns_nds_vpid

# TasksPerNode
TPN=`echo $PE | sed 's/way//'`
[ ! $TPN ] && echo "TPN NOT defined!" && exit 1

# local_rank = 0...wayness = Rank modulo wayness

if [ $TPN = 16 ]; then
task2socket=( 3 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3)
task2core=( 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)
fi

localrank=$(( $myrank % $TPN ))

# sh: 1st element is 0

socket=${task2socket[$localrank]}
core=${task2core[ $localrank]}

exec numactl -C $core -m $socket ./a.out

In hybrid codes, a single MPI task (process) is launched and becomes the "master thread". It uses any numactl options specified on the launch command. When a parallel region forks the slave threads, the slaves inherit the affinity and memory policy of the parent, the master thread (launch process). The usual case is for an application to launch a single task on a node without numactl process binding, but to provide a memory policy for the task (and hence all of the threads).

For instance, "ibrun numactl -i all ./a.out" would be used to assign interleave as the memory policy. Another hybrid scenario is to assign a single task on each socket (a 4x4 hybrid). In this case, a socket binding and some form of local memory policy should be employed, as in the example above.

In the first case, a socket could have been assigned to the single task (socket 0 is the usual default). In both cases, a memory policy could have been neglected, and may even be the most appropriate action. Without an explicit memory policy, a local, sometimes called a "first touch" mechanism is used. For any shared array, the first time an element within a block of memory (block = a page of 4096 bytes) is accessed the page is assigned to the socket memory on which the thread is running, regardless of the location of the master thread that may have performed the allocation statement. (For special cases, you can force touching at allocation time with the --touch option; but that should not be used for a shared array because it assigns all the memory to a single socket.)

Tools

Program Timers and Performance Tools

Measuring the performance of a program should be an integral part of code development. It provides benchmarks to gauge the effectiveness of performance modifications and can be used to evaluate the scalability of the whole package and/or specific routines. There are quite a few tools for measuring performance, ranging from simple timers to hardware counters. Reporting methods vary too, from simple ASCII text to X-Window graphs of time series.

The most accurate way to evaluate changes in overall performance is to measure the wall-clock (real) time when an executable is running in a dedicated environment. On Symmetric Multi-Processor (SMP) machines, where resources are shared (e.g., the TACC IBM Power4 P690 nodes), user time plus sys time is a reasonable metric; but the values will not be as consistent as when running without any other user processes on the system. The user and sys times are the amount of time a user's application executes the code's instructions and the amount of time the kernel spends executing system calls on behalf of the user, respectively.

Package Timers

The time command is available on most UNIX systems. In some shells there is a built-in time command, but it doesn't have the functionality of the command found in /usr/bin. Therefore, you should use the full pathname to access the time command in /usr/bin. To measure a program's time, run the executable with time using the syntax "/usr/bin/time -p " (-p specifies traditional "precision" output, units in seconds):

login3$ /usr/bin/time -p ./a.out #Time for a.out execution
real 1.54 # Output (in seconds)
user 0.5
sys 0
login3$ /usr/bin/time -p ibrun -np 4 ./a.out #Time for rank 0 task

The MPI example above provides only the timing information for the rank 0 task on the master node (the node that executes the job script); however, the real time is applicable to all tasks since MPI tasks terminate together. The user and sys times may vary markedly from task to task if they do not perform the same amount of computational work (are not load balanced).

Code Section Timers

"Section" timing is another popular mechanism for obtaining timing information. The performance of individual routines or blocks of code can be measured with section timers by inserting the timer calls before and after the regions of interest. Several of the more common timers and their characteristics are listed below:

Table 6.1 Timers

Routine Type Resolution (usec) OS/Compiler
times user/sys 1000td> Linux
getrusage wall/user/sys 1000 Linux
gettimeofday wall clock 1 Linux
rdtsc wall clock 0.1 Linux
system_clock wall clock system dependent Fortran 90 Intrinsic
MPI_Wtime wall clock system dependent MPI Library (C & Fortran)

For general purpose or course-grain timings, precision is not important; therefore, the millisecond and MPI/Fortran timers should be sufficient. These timers are available on many systems; and hence, can also be used when portability is important. For benchmarking loops, it is best to use the most accurate timer (and time as many loop iterations as possible to obtain a time duration of at least an order of magnitude larger than the timer resolution). The times, getrussage, gettimeofday, rdtsc, and read_real_time timers have been packaged into a group of C wrapper routines (also callable from Fortran). The routines are function calls that return double (precision) floating point numbers with units in seconds. All of these TACC wrapper timers (x_timer) can be accessed in the same way:

Fortran C code

real*8, external :: x_timer
real*8 :: sec0, sec1, tseconds
...
sec0 = x_timer()
sec1 = x_timer()
tseconds = sec1-sec0

double x_timer(void);
double sec0, sec1, tseconds;
...
sec0 = x_timer();
sec1 = x_timer();
tseconds = sec1-sec0

Standard Profilers

The gprof profiling tool provides a convenient mechanism to obtain timing information for an entire program or package. Gprof reports a basic profile of how much time is spent in each subroutine and can direct developers to where optimization might be beneficial to the most time-consuming routines, the hotspots As with all profiling tools, the code must be instrumented to collect the timing data and then executed to create a raw-date report file. Finally, the data file must be read and translated into an ASCII report or a graphic display. The instrumentation is accomplished by simply recompiling the code using the -qp (Intel compiler) option. The compilation, execution, and profiler commands for gprof are shown below with a sample Fortran program:

Profiling Serial Executables

login1$ ifort -qp prog.f90 ;instruments code
login1$ a.out              ;produces gmon.out trace file
login1$ gprof              ;reads gmon.out (default args: a.out gmon.out), report sent to STDOUT

Profiling Parallel Executables

login1$ mpif90 -qp prog.f90           ;instruments code
login1$ setenv GMON_OUT_PREFIX gout.* ;forces each task to produce a gout
login1$ mpirun -np <#> a.out          ;produces gmon.out trace file
login1$ gprof -s gout.*               ;combines gout files into gmon.sum
login1$ gprof a.out gmon.sum          ;reads executable (a.out) and gmon.sum, report sent to STDOUT

Detailed documentation is available at www.gnu.org.

Timing Tools

Most of the advanced timing tools access hardware counters and can provide performance characteristics about floating point/integer operations, as well as memory access, cache misses/hits, and instruction counts. Some tools can provide statistics for an entire executable with little or no instrumentation, while others requires source code modification.

Debugging with DDT

DDT is a symbolic, parallel debugger that allows graphical debugging of MPI applications. For information on how to perform parallel debugging using DDT on Ranger, please see the DDT Debugging Guide.

Reference

The following manuals and other reference documents were used to gather information for this User Guide and may contain additional information of use.

AMD

Portland Group Compiler

Intel

Last updated: October 30, 2012