Architecture
TACC's enhanced IBM Power5 System, Champion, now has a homogeneous selection of resources for both MPI and OMP parallel computing paradigms. It also provides almost 50% improvement for peak performance on a per processor basis, and improved sustained performance compared to the previous heterogeneous configuration of Longhorn.

Champion consists of 12 nodes each with 8 processors, aggregating to a total of 96 processors. Ten of the twelve nodes are used for dedicated jobs, or where the node usage is not shared. In the remaining two nodes, including the login or frontend node, node usage is shared between multiple concurrent jobs. One of the 8 processors in the frontend node will be used for the General Parallel File System (GPFS). Two frontend proccessors will be used for compiling. A visual of the single p5-575 frame with configuration details is illustrated below in Figure 1.
Figure 1. Power5 System Configuration

All the nodes employ the same 1.9GHz IBM Power5 microarchitecture, run the same AIX 5.3 operating system, and are connected through an IBM High Performance "Federation" Switch (HPS). These features provide a common and integrated parallel programming environment throughout the system. The homogeneous node-architectures provide different computational spaces for parallel algorithms and paradigms.

The architectural features of the Power5 chip are designed with a high memory bandwidth to accommodate the superscalar operations of a 1.9 Gigahertz processor. Features such as speculative branching, out-of-order execution, predication, 8-stream prefetching, and a 3-tier cache hierarchy provide a continuous and high data throughput for high peak performance.

Figure 2. Power5 Chip

Some important design changes were made to the Power5 chip that improved the overall system performance. One such change is moving the L3 cache from the memory side of the ASIC fabric to the processor side of the fabric, but off-chip. This removed the L3 cache from the path between the chip and the memory controller, as was the case in Power4. This allowed for the memory controller to be moved from the fabric to on-chip. This can be seen in Figure 2. These two changes also had significant additional benefits: reduced latency to the L3 and to memory, increased bandwidth to the L3 cache and the memory, and improvement in intrinsic reliability resulting from the reduction in the number of chips necessary to build a system. The p5-575 system thus comprises of two chips: a POWER5 chip and a L3 cache chip. Finally both the General-Purpose Registers and the Floating-Point registers have been increased to 120, from 80 and 72 respectively, in the Power4 chip. The additional FP rename capability had the benefit of increasing Power5's single thread performance on some HPC workloads. This gain stemmed from an enhanced ability to execute critical code sections out of order.

The cache and memory characteristics for Power5 are as follows:

Each core processor has 2 floating point units that can each perform a "fused" multiply-add operation per clock cycle. At a clock speed of 1.9GHz, four floating point operations per cycle can deliver 7.6 GFLOPS from each processor. In all, the 96 processors of the TACC Power5 Champion system can deliver a peak performance of about 730 GFLOPS.

Beyond the die level, the next architectural level for the Power5 system is the Dual-Chip Module (DCM). This is the basic building block of IBM's p5-575 HPC system node which has 8 such DCMs. Unlike the Power4 MCM which had 4 chip cores, there is only one processor chip in a Power5 DCM and a L3 cache chip. Each processor chip is dual-core, with only one core or CPU active on TACC's Champion system (see Figure 2). This allows for greater bandwidth for the single active microprocessor across all levels of the memory hierarchy, and thus is extremely beneficial for most HPC applications.

Note that a considerable amount of the silicon real estate for the Power5 chip is now devoted to the on-chip memory controller as well as the directory for the L3 cache, and is consequently sometimes referred to as a "system on a chip". These design changes and the modest increases in L2 and L3 cache sizes as well as increased Simultaneous Multi-Threading (SMT) functionality offered in the new Power5 chip, have increased the Power5 die-size to 389 mm**2, from the previous 247 mm**2 size for the Power4 chip.

Figure 3. Champion Interconnect Configuration

It is important to remember that there are two ports to each and every node. However only one of the ports is linked to the High Performance Switch as only one switch is currently needed for a system of this size. This can be seen in Figure 3. The sustained switch performance should reflect this configuration aspect. The peak observed MPI bandwidth from the HPS switch is 1.88 GB/s (peak is 2 GB/s) while the reported latency is 4 us MPI task latency.

TACC's Power5 system has a variety of disk solutions for different storage needs. A General Parallel File System, GPFS, is available from any node and provides parallel access to files through MPI-I/O or native Unix calls with a distributed locking protocol for coherent access from any node. This file system has 7.2 TB of disk and one GPFS processor server, with large stripe groups across multiple disks. File access is served over the High Performance "Federation" Switch for high throughput. There are local scratch disks on each node, except the login node, for tasks that do independent, local I/O. Home directories are mounted on all computational nodes. Also, for large-file, long term storage, the TACC archival file system is accessible from the login node only. The Archive file system is not accessible from any other Champion nodes outside of the frontend. The connectivity and size of these file systems are shown in Figure 7.

Last modified: July 14 2009 10:51:40.


System Access

SSH

To ensure a secure login session, users must connect to machines using the secure shell, ssh program. Telnet is no longer allowed because of the security vulnerabilities associated with it. The "r" commands rlogin, rsh, and rcp, as well as ftp, are also disabled on this machine for similar reasons. These commands are replaced by the more secure alternatives included in SSH --- ssh, scp, and sftp.

Before any login sessions can be initiated using ssh, a working SSH client needs to be present in the local machine. Go to the TACC introduction to SSH for information on downloading and installing SSH.

To initiate a ssh connection to a machine, type the following on the local workstation

ssh <login-name> @ <machine-name>.tacc.utexas.edu
Note that the <login-name> is only needed if the user name on the machine being logged onto differs from the user name on the workstation.

Last modified: July 14 2009 10:51:40.


Login Info
Login Shell User Environment Startup Scripts Modules

Login Shell

The most important component of a user's environment is the login shell that interprets text on each interactive command line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. To determine your login shell, execute:

grep <my_login_name> /etc/passwd {to see your login shell}

You can use the chsh command to change your login shell; instructions are in the man page. Available shells are listed in the /etc/shells file with their full-path. To change your login shell, execute:

cat /etc/shells {select a <shell> from list}
chsh -s <shell> <username> {use full path of the shell}


User Environment

The next most important component of a user's environment is the set of environment variables. Many of the Unix commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:

env {to see environment variables}

The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the HOME and PATH variables.

HOME=/home/utexas/staff/milfeld
PATH=/bin:/usr/bin:/usr/local/apps:/opt/intel/bin

(PATH has a colon (:) separated list of paths for its value.) It is important to realize that variables set in the environment (with setenv for C shells and export for Bourne shells) are "carried" to the environment of shell scripts and new shell invocations, while normal "shell" variables (created with the set command) are useful only in the present shell. Only environment variables are seen in the env (or printenv) command; execute set to see the (normal) shell variables.


Startup Scripts

All Unix systems set up a default environment and provide administrators and users with the ability to execute additional Unix commands to alter the environment. These commands are "sourced"; that is, they are executed by your login shell, and the variables (both normal and environmental) as well as aliases and functions are included in the present environment. We recommend that you customize the login environment by inserting your "startup" commands in .cshrc_user, .login_user, and .profile_user files in your home directory.

Basic site environment variables and aliases are set in

/etc/csh.cshrc {C-shell, non-login specific}
/etc/csh.login {C-shell, specific to login}
/etc/profile {Bourne-type shells}

For historical reasons, the C shells source two types of files. The .cshrc type files are sourced first (/etc/csh.cshrc--> $HOME/.cshrc--> /usr/local/cshrc--> $HOME/.cshrc_user). These files are used to set up environments that are to be executed by all scripts and used for access to the machine without a login. For example, the following commands only execute the .cshrc type files on the remote machine:

scp data lonestar.tacc.utexas.edu: {only .cshrc sourced on lonestar}
ssh lonestar.tacc.utexas.edu date {only .cshrc sourced on lonestar}

The .login type files are used to setup environment variables that you commonly use in an interactive session. They are sourced after the .cshrc type files (/etc/csh.login--> $HOME/.login--> /usr/local/login-->
$HOME/.login_user
). Similarly, if your login shell is a Bourne shell (bash, sh, ksh, ...), the profile files are sourced (/etc/profile--> $HOME/.profile--> /usr/local/profile--> $HOME/.profile_user).

The commands in the /etc files above are concerned with operating system behavior and set the initial PATH, ulimit, umask, and environment variables such as the HOSTNAME. They also source command scripts in /etc/profile.d -- the /etc/csh.cshrc sources files ending in .csh, and /etc/profile sources files ending in .sh. Many site administrators use these scripts to setup the environments for common user tools (vim, less, etc.) and system utilities (ganglia, modules, Globus, LSF, etc.)

TACC has to coordinate the environments on platforms of several operating systems: AIX, Linux, IRIX, Solaris, and Unicos. In order to efficiently maintain and create a common environment among these systems, TACC uses its own startup files in /usr/local/etc. (A corresponding file in this etc directory is sourced by the .profile, , and .login files that reside in your home directory. (Please do not remove these files and the sourcing commands in them, even if you are a Unix guru.) Any commands that you put in your .login_user, .cshrc_user, or .profile_user file are sourced (if the file exists) at the end of the corresponding /usr/local/etc command files. If you accidentally remove your .login, .cshrc, and .login, you can copy new ones from /usr/local/etc/start-up or execute

/usr/local/bin/install_ut_startups

to get a new copy (your old files are renamed with a date suffix).


Modules

TACC is constantly including updates and installing revisions for application packages, compilers, communications libraries, and tools and math libraries. To facilitate the task of updating and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.

At login, a basic environment for the default compilers, tools, and libraries is set by several modules commands. Your PATH, MANPATH, LIBPATH, directory locations (WORK, ARCHIVE, HOME, ...), alias (cdw, cda, ...) and license paths, are just a few of the environment variables and aliases created for you. This frees you from having to initially set them and update them whenever modifications and updates are made in system and application software.

Users who need 3rd party applications, special libraries, and tools for their development can quickly tailor their environment with only the applications and tools they need. (Building your own specific application environment through modules allows you to keep your environment free from the clutter of all the other application environments you don't need.)

Each of the major TACC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. A user need only invoke a single command,

module load <application>

at each login to configure an application/programming environment properly. If you often need an application environment, place the modules command in your .login_user and/or .profile_user shell startup file.

Most of the package directories are in /usr/local/apps ($APPS) and are named after the package name (<app>). In each package directory there are subdirectories that contain the specific version of the package. The APPS directory structure is shown in the diagram below:

Lonestar Applications Directory Structure

TACC Applications Directory Structure

The directory structure for the fftw package is shown below. The directory fftw in /usr/local/apps contains 3 different version directories for the package: 2.1.3, 2.1.5 and version 3.0. Since fftw-2.1.5 is the present default version, a fftw link is created to the default, the fftw-2.1.5 subdirectory.

Example FFTW Applications Directory Structure

Example FFTW Applications Directory Structure

The directory paths for the different fftw package versions, can be constructed easily with the help of the $APPS variable:

$APPS/<app>/<app.version> {path to specific package version}
$APPS/<app>/<app> {link to default version}
$APPS/fftw/fftw {example, default version directory for fftw}
$APPS/fftw/fftw-2.1.3 {example, directory for earlier version of fftw}

The fftw package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set in your environment by loading the fftw module:

module load fftw

The details of the environmental changes are in the modulefile, /usr/local/opt/modules/modulefiles/fftw. To see a list of available modules and a synopsis of a modulefile's operations, execute:

module available {lists modules}
module help <app> {lists environment changes performed for <app>}

During upgrades, new modulefiles are created to reflect the changes made to the environment variables. TACC will always announce upgrades and module changes in advance.

Another feature of modules is the ease in changing the environment for experimenting with new updates or backing down to older application versions. TACC will often make a link from <app>.new to the updated package modulefile (<app>.<new-version>) that has not become the default version yet. Also, the retired default modulefile is often linked to <app>.old. This makes it easier for users to change to new or old environments with the commands:

module swap <app> <app>.old
module swap <app> <app>.new

(If the app module has not been loaded, then it is only necessary to load the new or old version; e.g. module load <app>.old.)

For more information on modules and a description of how to build modulefiles, check out the man pages and the following link.

For information on customizing your login, check out this link.

Last modified: July 14 2009 10:51:42.


File Systems

The TACC HPC platforms have several different file systems with distinct storage characteristics. There are predefined, user-owned directories in these file systems for users to store their data. Of course, these file systems are shared with other users, so they are managed by either a quota limit, a purge policy (time-residency) limit, or a migration policy.

To determine the size of a file system, cd to the directory of interest and execute the "df" command with the syntax:

df -k .

or simply execute it without the "dot" to see all file systems. In the example below the file system name appears on the left, and the used and available space (-k, in units of 1KBytes) appear in the middle columns followed by the percent used:

% df -k .          
File System 1k-blocks Used Available Use% Mounted on
/dev/vg/home 8256952 6675732 1161792 86% /home

To determine the amount of space occupied in a user-owned directory, cd to the directory and execute the du command with the -sb option (s=summary, b=units in bytes):

du -sb

To determine quota limits and usage on $HOME, execute the quota command without any options (from any directory):

quota

The four major file systems and directories available on TACC HPC machines are:

home directory
The system automatically changes to a user's home directory at login and this is the recommended location to store your source codes and build your executables.
Lonestar quota limit is 200MB.
Use $HOME to reference your home directory in scripts.
Use cd to change to $HOME.

work directory
Store large files and perform most of your job runs from this file system. This file system is accessible from all the nodes.
Lonestar quota limit is 500GB.
Use $WORK to reference your work directory in scripts.
Use cdw to change to $WORK.
Files are purged if they have not been accessed within 10 days.

scratch or temporary directory
This is the directory on each node where you can store files and perform local I/O for the duration of a batch job. Often, in batch jobs it is more efficient to use and store files directly on $WORK (to avoid moving files from scratch at the end of a job). The scratch directory is only available for the duration of a job.
Lonestar file system size and limit is 25GB on compute nodes.
Use $SCRATCH to reference the scratch directory in scripts.

san directory
The SAN directory is available on login nodes (front-ends) of Lonestar, Wrangler, and Champion. Space on the SAN is an allocatable resource; that is, space is not automatically available to a project, the Principal Investigator must request space on this file system.
Lonestar project limit is 250GB.
The top level directory of the san is /san/hpc.

archive directory
Store permanent files here. This file system has "archive" characteristics (see below). The access speed is low relative to the work directory.
Lonestar archive has no space limit..
Use $ARCHIVE to reference your archive directory in scripts.
Use cda to change to $ARCHIVE.

$HOME Directories

A user's home directory is the place to store files that are routinely used in development and day-to-day work. If the output files from production runs are small, then it is reasonable to store them in $HOME. Home directories are backed up daily; so, if you accidentally remove a critical file, submit a request using the Portal to recover the last saved version (include the full path name of the file(s) or directory, as well as the machine name). Since the home file system is of limited size, a 200 megabyte quota limit is imposed on every user (the quota limit is machine specific).

$WORK Directories

The work file system is configured with fast disks on TACC machines and should, therefore, be used when I/O performance significantly affects program performance. Work can also be used to store large files temporarily. The work file system may be as simple as SCSI disks arranged in a RAID-3/4/5 configuration and exported through NFS to compute nodes. On Champion parallel and global access to work is made through GPFS, a General Parallel File System. On Lonestar work is an (LUSTRE) File System; it can be used for parallel I/O; and it is accessible from the development/login nodes and all compute nodes.

The files in work are NOT backed up and are temporary. Files that are corrupted or accidentally removed are not recoverable. Files that are not accessed within 10 days are removed. (Each night a "scrubber" program evaluates access times for every file and removes outdated files. The scrubber does not remove any files if the user has a running batch job.) Reminder: for permanent storage use the TACC data archive. To see a daily log of the files that have been removed from your directory, view the file:

/work/reaver/$USER_YYYYMMDD

where $USER is evaluated as your login name, and YYYY, MM, and DD are the year(4 digits), month(2 digits), and day(2 digits) of the log date.

PLEASE NOTE: TACC staff may delete files from work if the work file system becomes full and directories consume an inordinately large amount of disk space, even if files are less than 10 days old. A full work file system inhibits use of the file system for ALL users. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.

$SCRATCH Directories

$SCRATCH is a "scratch" directory on each compute node, where a batch job can store files or perform local I/O. The $SCRATCH directory on each node is scrubbed after the job completes, so a job script's final commands should copy any valuable data to home or work. The scratch directory is the /tmp file system on the Linux clusters. On the Lonestar front-end, $SCRATCH is /tmp. If you use $SCRATCH on the front end, please create your own subdirectory, e.g. $SCRATCH/$USER, to isolate your files within a single directory.

$SAN Directory

The TACC SAN is a Storage Area Network that is accessible from the front-ends of Lonestar, Wrangler, and Champion. The /san/hpc/<project_name> directories are for projects that have been awarded (allocated) long-term space. The present configuration has ~5TB space for persistent, project-oriented storage. For more details read:

More on SAN.
$ARCHIVE Directories

For long term file storage, use the archive file system ($ARCHIVE). This file system physically resides on an SGI Origin 2000 (archive.tacc.utexas.edu), a machine dedicated to supporting the archive file system.

A user's archive directory is available on all TACC HPC computers because the archive file system is mounted as /archive on the front-end of each system. It appears as a normal UNIX file system but is managed by DMF, SGI's Data Migration Facility. Files that have not been accessed in 10 days are moved offline (migrated) to tape via two StorageTek 9310 robots. DMF automatically and transparently performs the archival and retrieval of files from the tape robot system. When an off-line file is accessed, DMF automatically retrieves the file while the process that is accessing the files waits. Under normal circumstances it takes less than a minute for the robots to start streaming a file's data back to the disk and for the user's process to continue.

Note: On all TACC production machines (Champion,Lonestar) $ARCHIVE is only mounted on (available from) the front-end and NOT from any of the compute nodes. As a consequence, users must be careful to migrate files to and from $ARCHIVE before jobs are submitted or after completion of their run. Access of any files located in $ARCHIVE through Loadleveler/LSF script options will result in the job hanging in an idle state.

When moving files from archive to a local file system (home or work) use ftp or rcp if the files are large (>100MB). The cp command transfer rates from an NSF mounted file system are about 5 times slower than a transfer with ftp or rcp.

rcp ${ARCHIVER}:$ARCHIVE/myfile    $WORK

For the fastest large-file transfer to a local work directory from archive, use the command above. Don't forget to include the archive machine name ($ARCHIVER is defined for you), else the rcp utility will simply use "cp" to do the transfer. In some shells, curly braces may be required around the environment variables, e.g. ${ARCHIVER} when followed by a colon (:).

Last modified: July 14 2009 10:51:40.


Programming Models

For both the Intel clusters and the IBM Power5 SMP clusters, there are a couple of programming models for consideration.

First, in the distributed memory, message-passing model one may use either the SPMD or the MPMD programming paradigms. With the former, each processor loads the same program image and executes but on different data sets. This is dependent on the local portion of distributed arrays according to their distribution, array sizes, and number of processors determined at runtime. With the MPMD paradigm, each processor loads up and executes a different program image and on different data sets, with a similar set of parameters as the SPMD paradigm. This distributed memory message passing programming model strictly uses MPI for its communication. This is advocated as the best model for optimal use of the clusters and is also suitable for the Power5 nodes.

The other programming model that is always recommended on the Symmetric Multi- Processor based nodes like the IBM Power5 nodes or Parallel Vector Processing (PVP) systems, is the SMP (Shared Memory Programming) model using either proprietary directives (or pragmas), vendor implementations of standards like OpenMP, or explicit threaded implementations such as Pthreads. Here, the major point of distinction is that all of the processors in the same node share data, hence strict processor to processor message-passing is unnecessary. Further, additional parallelism can be extracted by use of threads or multi-threaded implementations of the compilers, libraries, or tools. With the present state-of-the-art operating systems, compilers, and tools it is highly recommended to use OpenMP on SMP based nodes with MPI across nodes in a hybrid programming environment. There isn't sufficient evidence at this time to suggest that using OpenMP with MPI in a hybrid environment on the clusters would be of great benefit from the performance point of view. However, we will inform users if this changes in the future.

For further information on OpenMP, MPI and on programming models/paradigms, please see the manuals and packages sections for documentation.

Last modified: July 14 2009 10:51:40.


Compilation
Serial OpenMP MPI Optimization Loading Libraries Limiting Memory Usage

Compiling and Running Serial Programs

The compiler commands can be used for both compiling and loading (making an executable from a ".o" object file). The table(s) below lists the syntax for serial and parallel program compilation.

Compiling Serial Programs

Compiler Program TypeSuffix Example
xlc_r c .c xlc -c [options] ibmc.c
xlC_r C++ .cpp, .C xlC -c [options] ibmcpp.C
xlf_r F77 .f xlf -c [options] ibmfor.f
xlf90_r F90 .f xlf90 -c [options] ibmf90.f

Appropriate program-name suffixes are required for each compiler. By default an a.out executable is created by the compiler invocation. The executable may be renamed with the -o option, and execution is performed by specifying the executable on the command line. Also "options" denote additional compiler non-default options and may be added during the compilation.

Note: For those users migrating from Longhorn Power4 system, for the sake of uniformity, we are now recommending two major changes. First, use of the re-entrant version of the compiler (or use of compilers with trailing"_r" symantic notation) for all types of codes: serial, threaded or parallel. The second change is for building 64-bit codes for all application. All the threaded and parallel codes see a clear performance benefit from these changes. Additional details on this subject is provided below.

Compile/Link code prog.cpp or prog.f, naming the executable prog
C xlc_r -o prog -q64 [other options] prog.c [linker options]
Fortran xlf90_r -o prog -q64 [other options] prog.f [linker options]

Here "linker options" denote zero or more linker loader options such as library directories and their names. Also the "option" compiler options take precedence over the "link loader options."

To run the above interactive program, execute:

./prog

The relative path expression "./" tells the shell to look in the present working directory for the executable. It is often used to make sure that an executable of the same name in another directory (as determined by the PATH environment variable) is not executed. Also, if the "." is not in the PATH variable it is necessary to use "./" for the shell to find the executable. If it is in the PATH variable however, just typing the executable name is suffient to run your program.

A brief discussion of compiler options, their syntax, and a terse explanation is given below in the optimization section. In addition, typing in the compiler names xlc or xlf etc., will give additional information on all of the compiler options. Online documentation and user guides for compilers are also available.


Compiling OpenMP Programs

All the nodes are eight-way symmetric multiprocessors using shared memory, and as such are a perfect model for using Shared Memory Programming (SMP) paradigm. All the nodes have an equal amount of shared memory, roughly 14 GBytes, available for application use. Using any of the nodes as an SMP, they may be effectively used in combination with thread-safe compilers with appropriate compiler options and setting run-time variables, and libraries optimal for SMP environment.

The compilers are used similar to serial compilation, as given below, except in the slight change in the naming syntax. As an example, a program compiled using the SMP environment by the xlf90_r compiler is:

xlf90_r -qsmp=noauto:omp -qnosave -o prog [options] progOMP.f [linker options]

Use "-qsmp=noauto:omp" when compiling programs with OpenMP directives or pragmas, to disable automatic parallelization by the compiler. The "-qnosave" option sets all local variables as automatic, and is required to ensure correct behavior of Fortran programs that call routines within a parallel region. Also note that -qhot is turned on, by "-qsmp=auto" but not by "-qsmp=omp, so make sure to include the correct SMP flag for high order transformation. The option -qnohot should be used for consistency with directive/pragma based SMP runs. Preferably any optimizations including -qhot should be tested in a single threaded manner before using -qsmp, whenever possible.

Note that unlike the serial programs, any programs using threads, or parallel MPI calls, must use the reentrant version of the compiler or the compiler using the _r symantic notation . Here the xlf90_r is the xlf90 compiler which links to the thread-safe components, as well as other threaded version of the libraries.

Similarly, any compiler can be used in its place with the same syntax change. The compiler options are the same as the serial ones. The compiler options specifically using the SMP environment either automatically and/or with/without directives must be included as part of the linker options . A list of such options with description are given below Compiler Options. The use of SMP optimized libraries are given below Performance Libraries .


Compiling Parallel Programs with MPI

The compiler commands can be used for both compiling and loading (making an executable from a ".o" object file). IBM MPI shell scripts for Power5 servers are prefixed with mp. The table(s) below lists the syntax for serial and parallel program compilation.

Compiling Parallel Programs

Compiler Program TypeSuffix Example
mpcc c .c mpcc_r [options] ibmc.c
mpCC_r C++ .cpp, .C mpCC_r -cpp -q64 [other options] ibmcpp.C
mpxlf_r F77 .f mpxlf_r -q64 [other options] ibmfor.f
mpxlf90_r F90 .f mpxlf90_r -q64 [other options] ibmf90.f

Appropriate program-name suffixes are required for each compiler. By default an a.out executable is created by the compiler invocation. The executable may be renamed with the -o option, and execution is performed by specifying the executable on the command line. Also, "options" denote additional compiler non-default options and may be added during the compilation. For C++ programs in MPI, the mpCC_r needs to be used with -cpp option, along with any other options of the user's choice.

Compile/Link code prog.c or prog.f, naming the executable prog
C mpcc_r -o prog -q64 [other options] prog.c [linker options]
Fortran mpxlf90_r -o prog -q64 [other options] prog.f [linker options]

Here "linker options" represents zero or more linker loader options. Also the "option" compiler options take precedence over the "link loader options."

As we briefly stated above, from performance standpoint it is recommended that all MPI programs use the 64-bit version of the MPI libraries. These 64-bit MPI libraries are thread-safe, hence the re-entrant version of the compilers should be used in conjunction. For those needing to use the 32-bit MPI libraries do not need to use the _r semantic in the compiler invocation but will also see poorer performance from the 32-bit versions of these libraries. There are additional details on this issue later in the section for MPI Parallel Programming Environment .

To run the above compiled program interactively, execute:

% poe ./a.out -procs n

A brief list of compiler options, their syntax, and a terse explanation is given below in the optimization section. In addition, man mpcc or man mpxlf etc. provide additional details on all of the compiler options. User guides and online documentation provide additional resources.


Program Optimization for Serial and Parallel Programming using OpenMP and MPI

The MPI shell scripts can compile programs using the normal serial compiler options. Thus, as an example, any of the compiler flags normally accepted by the xlc command can also be used on mpcc. Given below are some of the common serial compiler options with descriptions.

Compiler Options Description
-O3 performs some compile time and memory intensive optimizations in addition to those executed with -O2. They also have the potential to alter semantics of a user's program.
-qarch=pwr5 (or auto) specifies instruction-set architecture for Power5 hardware
-qtune=pwr5 (or auto) produces object files optimized for Power5 architecture, including instruction scheduling and memory hierarchy.
-qhot high order transformations to maximize efficiency of loops and array language.
-qcache=auto specifies cache configuration for compiling machine or relevant executing environment.
-qipa optimizes by performing analysis across procedures.
-bmaxdata: Specifies the maximum amount of space to reserve for the program data segment for programs where the size of these regions is a constraint (units=bytes).

Warning: It has come to our attention that certain codes are giving erroneous results on use of the -qhot (high-order transformation) compiler option. Note that, for optimization levels above -O3, -qhot is used as default. Please verify your results for correctness when compiling at that level or when using -qhot directly. Using the option -qnohot avoids using any high-order transformation even at -O4 or -O5 compiler option level. This bug has been brought to the IBM compiler groups attention. We will apply any future patches for this issue.

In addition, in a Shared Memory Programming (SMP) environment, some of the additional compiler options, which must be used along with runtime environment variables and SMP optimized libraries, are listed below.

SMP Compiler Options Description
-qsmp Enables use of shared memory parallelization by user directives, OpenMP or explicitly or by MPI calls.
-qsmp=omp Explicit or User directed OpenMP environment or IBM proprietory directives (or pragmas for C/C++).
-qsmp=auto Automatic parallelization enabled.
-qsmp=explicit Explicit parallelization by use of Pthreads

For further details on additional compiler options on both IBM serial compilers and their respective MPI scripts, please see the online documentation and man pages.


Limiting Memory Usage

On the Champion Power5 system, the default object mode is 64-bit, which means that in the default environment you will build executables that run in 64-bit address mode. Large-memory applications use the -q64 compiler/loader option to enable access to essentially all of the available memory of any of the TACC Power5 nodes. All of the third-party applications will be built 64-bit, by default. The 32-bit versions of these applications will only be built for users who need them for dependencies in there legacy code.

The job memory-per-task limit specified in the ConsumableMemory resource statement is used by LoadLeveler to assign an appropriate queue to a job proportional to the number of parallel tasks and global memory required by the job. It does not actually limit the memory used by the executable on the Champion system.

Hence, large-memory applications (that use the -q64 option), in particular those that use dynamic memory, should be compiled not to exceed their "fair share" of physical memory per task (processor).

 Compile with -bmaxdata=1887000000 option 
Exceeding this limit will cause the executable to swap, and in some cases the node will crash when the total virtual memory requested exceeds the physical memory plus the available swap space (the swap disk space has been configured to equal the physical memory size). To avoid this problem, we ask developers to use the maxdata loader option to limit the memory used by executables. Note that for the new Power5 Champion system this limit will be enforced where the node usage is being shared. For users in dedicated mode, this memory limit will not be strictly enforced and monitored, but users should be aware of the performance impact.


Loading Libraries

Some of the more useful load flags/options are listed below. For a more comprehensive list, check out the ld manpage.

Last modified: July 14 2009 10:51:40.


Running Code
Runtime Environment Running Interactive Programs Running Batch Code

Runtime Environment

Parallel Environment
MPI
OpenMP

Parallel Environment

The Parallel Environment (PE) from IBM is designed for the development and execution of parallel C, C++, and Fortran programs and has the following:

Parallel Operating Environment

The Parallel Operating Environment (POE) is one part of the Parallel Environment. It can be looked upon as a "user interface" to PE and has an interactive parallel shell with a syntax similar to ksh.

Setting the Environment Variables

There are two ways to configure the way a parallel program is executed -- with environment variables or command-line arguments. The POE environment variables are set using either setenv (csh/tcsh) or export (ksh, bsh) commands and can be set:

The environment variables can be over-ridden with POE command line flags. The environment variable names are the same as the command line option names, but they start with "MP_" and are all in upper case. For example, the command line flag -procs has the corresponding environment variable "MP_PROCS." The following are a few, typical POE command line flags:

Set number of nodes or tasks

  • procs - number of task processes
  • nodes - number of physical nodes on which to run parallel tasks
  • tasks_per_node - number of tasks to be run on each of the physical nodes

Those interested in using POE command line flags exclusively, instead of using them through job scripts, should learn about the remaining flags viewed here. Those interested in using the environment variables through the use of Loadleveler job scripts, should go here.


MPI Parallel Programming Environment

The IBM Power5 nodes are connected to each other via the high performance singly-linked Federation or High Performance Switch (HPS) interconnect. All off node communication is available for use through the IP protocol as well as the recommended User Space (US) protocol for better performance over the HPS switch using both switch interfaces or adaptors on each node (device is "sn_all"). Users can thus run their MPI or OpenMP applications with the HPS switch interconnect with either protocol submitted through their batch scripts. In addition, the MPI implementation now uses threads instead of signals for asynchronization activities.

MPI can use several protocols for communication between tasks. IP (internet protocol) can be used for communication between tasks on a node or on different nodes. Aside from relatively high latencies, this also incurs high overhead from use of the slow IP protocol. This availability is a useful alternative on such systems which do not have the SP interconnect switch, since it is the only mode of communication protocol for ethernet, GigE networks etc., for MPI tasks. Off node communication between tasks is fastest when the US (user space) protocol is used over the HPS. Although use of this protocol significantly reduces the overhead and latency when compared to IP, on node communication may be more efficiently handled using the shared memory protocol.

In addition, the MPI implementation is now layered over the Low-level Application Programming Interface (LAPI) transport layer , now available in the current version of the PE. In contrast to the "reliable byte stream" approach of the previous PE, LAPI provides a "reliable message" protocol, which uses much less storage for jobs with a large number of tasks. It is based on an "active message style" mechanism that provides a one-sided communications model in which one process initiates an operation and the completion of that operation does not require any other process to take a complementary action. In addition, the 64-bit MPI library has shown improved performance particularly with features such as RDMA, so it is recommended for users to build there application in 64-bit, even if they do not have any large memory requirements.

The Remote Direct Memory Access (RDMA) data transfer feature has also been made available on the High Performance Switch interconnect. RDMA allows the network adapter direct access to the user memory of the application.This allows the network adapter to move data between computational tasks with minimal CPU involvement, offering greater potential for overlapping calculation and communication when using non-blocking communication. RDMA enables large messages to utilise both network adapters attached to each of the compute nodes of the p5-575 system without utilising multiple CPUs. This promises a substantially increased bandwidth for large messages, particularly for the 64-bit applications. We show below how RDMA use is enabled by environment variables.

The parallel operating environment or POE is responsible, among other things, for managing all communication between the MPI processes whether on-node or between nodes. The runtime behavior of MPI jobs can be modified by configuring the many environment variables in POE.


OpenMP Parallel Programming Environment

Implementation of shared memory parallelization is done through the creation of user threads, which are mapped to kernel threads by the operating system.

OpenMP codes run initially with just one user thread, but a team of threads is created when the first parallel region is encountered. Upon exiting the parallel region, only one thread resumes execution of the serial portion, while the rest consumes CPU time waiting for the next parallel section. The maximum time a thread spends waiting is regulated by the environment variable SPINLOOPTIME. After this maximum interval is exceeded, a thread can either go to sleep or yield its place on the kernel to another runnable thread, provided that a yield time has been specified as part of the OpenMP environment. Reactivation of a sleeping thread is more costly than one that is in a yielded state.

The behavior of threads can be modified by setting several POE environment variables. Please check the reference for more details.


Running Interactive Programs

All parallel jobs run under the Parallel Operating Environment (POE). Use the interactive session only to develop and debug your parallel programs. Use a command similar to the one given below for interactive jobs.

% poe your_job job_args -tasks_per_node n -nodes 1 -rmpool 1

n is the number of processors on which one wants to run. Interactive runs, which are queued through the development class, are limited to up to 5 processors, for 4 hours and to upto 8 GBytes of total memory requirements.


Running Batch Code using LoadLeveler

What is LoadLeveler?
LoadLeveler Batch Resources
Building a Simple Job Command File
Building an Advanced Job Command File
LoadLeveler Keywords
Submitting a Job
Managing a Job
Summary of LoadLeveler Commands


What is the LoadLeveler?

LoadLeveler is a batch job scheduling software that provides the ability to submit and manage jobs. It has three interfaces.


LoadLeveler Batch Resources on Champion

New Resource Limits

Following the recent upgrade with the IBM Power5 Node(s), the entire system consists of the eleven eight-way IBM Power5 HPC nodes, and a login node with 7 processors available for computation. The complete system consists of 96 Power5 compute processors with an aggregate memory of 192 GB, 7.2 TB of total disk and theoretical peak performance of about 730 Gigaflops .

In order to optimize node usage or job throughput on the newly configured Power5 system, a new resource policy has been implemented. The following table summarizes the new resource limits:

Table I: New resource limits on the TACC IBM Power5 system

# cpus memory walltime class
2-5 procs 1.8 GB/procs 4 hrs limit development
1-2 procs 1.8 GB/procs 12 hrs limit serial
upto 32 procs 1.8 GB/procs 24 hrs. limit normal

Users need to understand the logic employed by the filter which parses the jobscript commands and schedules the job accordingly. Due to the limitation of the total processor counts, and the homogeneous nature of the Champion system, relative to the Longhorn system, the filter logic is simplified. If certain user options are not allowed, then the filter provides the closest options for the user to make the appropriate choices. The filter logic employs the following general logic guidelines:


Building a Simple Job Command File

For all batch submitted jobs, a job script or job command file needs to be created. This section is for the early or first time users to the IBM system who need to start with a simple job script described in this section. Users already experienced with Loadleveler job scripts should go to the next section where the intricacies for building more advanced and detailed job scripts are described.

Some of the general rules and information necessary for building a simple job script are given below followed by examples of scripts for an MPI and OpenMP job.

  1. Keyword statements, which are case insensitive, begin with #@ followed by LoadLeveler keywords and statement components; to comment out any statements begin with ##;
  2. Common functionality such as shell, initialdir, input, output, job name etc., are common to all job scripts, and should be taken verbatim from the examples below and modified according to specific user and job.
  3. Shell command statements can go anywhere except where LoadLeveler keyword statements are being defined.
  4. A special keyword statement used in the simple job script is the total number of tasks for MPI jobs and number of threads for OpenMP jobs. Either of these options should be specified by the user in the job script as tasks or threads , depending on whether it is an MPI or an OpenMP job respectively.
  5. Users should also have an idea of the amount of memory the job consumes. If it is not specified, the job filter assumes the maximum limit which may not be sufficient for the job. The job will then be scheduled and may exit with erroneous statements. The memory statement is another special keyword statement and can be specified as

    #@ memory = 2

    This implies that the memory per task or thread is 2MB.The total memory is then computed by the job filter by multiplying it with the number of tasks or threads that are specified. Note that generally memory per task or thread should not exceed 2 GB or 2000 MB.

  6. Finally, the user should have a rough idea of the runtime. This is for fairness of how jobs are scheduled. Jobs which take a shorter runtime are scheduled ahead of those that are longer, all other things being equal, so the user may be penalised in seeing the expected turnaround time. The specification for the runtime, another special keyword statement, can be done as follows:

    #@ walltime = 00:10:00

    This specifies that the wall clock time is 10 minutes.

  7. Based on the user providing correct information on all of the above three special keyword statements, the job filter will then infer the other keyword statements that are necessary for the job to be scheduled. In certain situations, these generated conditions may create contradictory options for the filter, which is then returned to the user for making the appropriate choice.
  8. Another Loadleveler keyword statement

    # @ job_type = parallel

    also has to be added in addition to the special keyword statements above for all MPI and OpenMP jobs. Although this keyword statement is not a special one, it is necessary for the job filter to know the classification of the job.

  9. Users who belong to multiple projects should specify the project name to charge, as shown below for the A-abc project:

    # @ account_no = A-abc

    Project names are listed in the accounting report at login.

  10. The environment variable OMP_NUM_THREADS has to be specified by the user for OpenMP jobs. All other environment variables will be automatically generated by the filter for both OpenMP and MPI jobs.

Note: Users should be aware that all of the special keyword statements can ONLY be used in the simple job scripts. Any combination of the special keyword statements with some of the more advanced keyword statements will result in incorrect filtering and will result in either the script being returned to the user or an erroneous job script sent to the scheduler.

An example of a job command file for MPI job follows:

#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ notify_user = your_email
#@ job_type = parallel
#@ environment = COPY_ALL;
#@ walltime = 01:00:00
#@ tasks = 8
#@ memory = 100
#@ class = normal
#@ queue
poe ./a.out
An example of a job command file for OpenMP job follows:

#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ notify_user = your_email
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ job_type = parallel
#@ environment = COPY_ALL; LL_JOB=TRUE;
#@ walltime = 01:00:00
#@ threads = 8
#@ memory = 200
#@ class = normal
#@ queue
setenv OMP_NUM_THREADS 8
./a.out

Building an Advanced Job Command File

Every LoadLeveler job must be submitted through a job command file. The job command file has LoadLeveler statement lines with LoadLeveler keywords that describe the job. Some of the general rules/information are explained below, in addition to explaining the more important and complicated LoadLeveler keyword statements whose understanding is absolutely essential.

Note that the special keywords used for the simple jobscript CANNOT be used for the advanced version of the job script described in this section.

Different MPI and OpenMP jobs require separate job scripts and some examples of the different scenarios are listed following the explanation of the general rules below.

  1. Keyword statements, which are case insensitive, begin with #@ followed by LoadLeveler keywords and statement components; to comment out any statements begin with ##;
  2. Shell command statements can go anywhere except where LoadLeveler keyword statements are being defined.
  3. The fine grained resource management on node is done by the Work Load Manager (WLM). The following resource statement must be included in batch scripts.

    #@ resources = ConsumableCpus(CPUSpTask) ConsumableMemory (MEMpTASK)

             where  MEMpTASK = amount of MEMory  required Per TASK,
             and   CPUSpTASK = number of CPUs    required Per TASK.
             For OpenMP jobs use the value of OMP_NUM_THREADS for CPUSpTASK.
             For MPI    jobs use the value of        1        for CPUSpTASK.
            
    E.g. if ConsumableMemory(1024MB) is used with tasks_per_node=4, then 4096 Megabytes of memory will be reserved for the job. For OpenMP programs there is only 1 task, so you must make MEMpTASK equal to the amount of memory you need for ALL threads. [A master thread and its spawned group of threads is a single task and the ConsumableCPU's argument is the number of CPUs reserved per (for the) task.]

  4. Parallel MPI jobs that run across multiple nodes have to specify the use of the HPS Switch interconnect for off-node communication. This involves setting MP_EUILIB to US and MP_EUIDEVICE to sn_all. A concise way of doing this is to set the #@ network.MPI_LAPI keyword to the following
    #@ network.MPI_LAPI = <device>,<shared|not_shared>,<protocol>

    where <device> = sn_all uses dual-plane Switch adapters
      en1 Gigabit ethernet adapter
    where <protocol> = US to be used by the dual-plane Switch adaptors only
      IP can be used by ALL of the interconnects

  5. A number of environment variables , when used appropriately impact performance. Some of the environment variables are set as default by the system. Other environment variables need to be used under certain conditions. Here are a listing of some of them:

    Default Settings

    These settings are already set by default and the users need not declare them in the jobscript, unless they need to be changed.

    	 MP_SHARED_MEMORY=yes 
    	
    	 MP_EUIDEVICE=sn_all  
    
    	 MEMORY_AFFINITY=MCM 
    

    Using RDMA

    Users wanting to use the RDMA feature need to set the following environment in there job scripts, when running there 64-bit application

    	  MP_USE_BULK_XFER=yes  
    
    	  MP_BULK_MIN_MSG_SIZE=150K, (Note: this size can be increased upto 2MB.)   
    
            

    Special Environment Use

    The following environments can be declared and used in most of the general conditions but not under all circumstances. These exceptions are pointed out.

    	 MP_SINGLE_THREAD=yes (Except for  MPI-I/O or explict MPI threading routines)  
    
    	 MP_TASK_AFFINITY=MCM (Except OpenMP routines) 
    

  6. All OpenMP jobs, whether "serial" or "parallel", and hybrid MPI-OpenMP jobs which are "parallel", are so termed simply from the point of the number of tasks that are executed. But each of these tasks spawns multiple threads per process or task. This is done by the

    #@ resources = ConsumableCpus(n)..

    statement, where n is an integer between 1-8 for p575 HPC nodes. But this spawning of threads is equivalent to reserving n processes, not setting OpenMP threads. This is done by the statement

    setenv OMP_NUM_THREADS n

  7. For all MPI, OpenMP or a hybrid combination jobs, the job filter evaluates the number of processors to reserve per node by the following formula

    number of processors/node = tasks_per_node x ConsumableCpus(n)

    For MPI jobs, n=1, while for OpenMP jobs, n=OMP_NUM_THREADS.

An example of a job command file for MPI jobs on one node of p575 HPC node, follows:

#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ job_type = parallel
#@ environment = COPY_ALL; MP_SINGLE_THREAD=yes;
#@ resources = ConsumableCpus(1) ConsumableMemory(100mb)
#@ network.MPI_LAPI = sn_all, not_shared, US
#@ wall_clock_limit = 01:00:00
#@ node = 1
#@ tasks_per_node = 8
#@ node_usage = not_shared
#@ notify_user = your_email
#@ notification = error
#@ class = normal
#@ queue
poe ./a.out

An example of a job command file for MPI jobs on multiple nodes on p575 nodes , follows:

#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ job_type = parallel
#@ environment = COPY_ALL; MP_SINGLE_THREAD=yes;
#@ node_usage = not_shared
#@ resources = ConsumableCpus(1) ConsumableMemory(100mb)
#@ wall_clock_limit = 24:00:00
#@ network.MPI_LAPI = sn_all, not_shared, US
#@ node = 4
#@ tasks_per_node = 8
#@ notification = never
#@ class = normal
#@ queue
poe ./a.out

Note: Comparing the two MPI job scripts for running application codes on one node and for more than node, the main difference is specifying the appropriate network/adaptor information. In the latter script, that is contained in the line

#@ network.MPI_LAPI = sn_all, not_shared, US
An example of a job command file for OpenMP job on a single p575 node, follows:

#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ job_type = parallel
#@ environment = COPY_ALL; LL_JOB=TRUE;
#@ resources = ConsumableCpus(8) ConsumableMemory(1000mb)
#@ network.MPI_LAPI = sn_all, not_shared, US
#@ wall_clock_limit = 01:00:00
#@ node = 1
#@ tasks_per_node = 1
#@ node_usage = not_shared
#@ notify_user = your_email
#@ notification = error
#@ class = normal
#@ queue
setenv OMP_NUM_THREADS 8
./a.out

Note that unlike the MPI jobs, the ConsumableCpus(8) statement above is critical for OpenMP jobs, which is "serial" in terms of processors (tasks_per_node = 1), but uses multiple threads per process or task. This statement reserves 8 processes, but does not set the number of OpenMP threads which is always done by the environment variable OMP_NUM_THREADS . See Rule 5 for further explanation.

An example of a job command file for hybrid MPI-OpenMP job across multiple p690 HPC nodes, follows: (Note that here a single MPI process uses multiple OpenMP threads.)

#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ job_type = parallel
#@ environment = COPY_ALL; LL_JOB=TRUE
#@ resources = ConsumableCpus(4) ConsumableMemory(1000mb)
#@ wall_clock_limit = 24:00:00
#@ node = 2
#@ tasks_per_node = 2
#@ network.MPI_LAPI = sn_all,not_shared,US
#@ node_usage = not_shared
#@ notification = error
#@ class = normal
#@ queue
setenv OMP_NUM_THREADS 4
poe ./a.out

Note that in the above the job script, the #@ resources = ConsumableCpus(4) statement is critical for OpenMP jobs. In the above script, 4 total processors are reserved for MPI tasks (tasks_per_node x node) and 4 threads are reserved for each process or task (ConsumableCpus). Also note that this statement reserves 8 processes, but does not set the number of OpenMP threads which is done by the environment variable OMP_NUM_THREADS , which must ALWAYS be used for setting the correct number of OpenMP threads. See Rule 6 for additional information.


LoadLeveler Keywords

Keyword Description Example
environment specifies the inital environment variables when the job starts -- environment specifications should be separated with semicolons COPY_ALL
specifies all environment variables from your shell be copied
var = value
specifies the environment variable to which var should be set
error specifies the name of the file to use as standard error when the job step runs error = file_name
executable for serial jobs, executable identifies the name of the program to run -- if this keyword is not included and the job command file is a shell script, LoadLeveler uses the script file as the executable executable = script
initialdir this is the pathname to the working directory during execution of the job step initialdir = pathname or "." for current working dir.
input specifies the name of the file to use as standard input when the job step is run input = file_name (/dev/null if no input is reqd.)
job_name specifies the name of the job -- the name can be any combination of letters, numbers, or both job_name = my_job
job_type specifies the type of job to process -- valid entries are "parallel" or "serial" job_type = string
node specifies the minimum and maximum number of nodes requested by a job step -- at least one of these values must be specified node = [min][,max]
notification secifies when the user named in the notify_user keyword is sent mail notification = always | error | start | never | complete
  • always - notifies the user when the job begins, ends, or if there is an error
  • error - notifies the user only if the job fails
  • start - notifies the user only when the job begins
  • never - never notifies the user
  • complete - notifies the user when the job ends -- this is the default
notify_user specifies the user to whom mail is sent based on the notification keyword - the submitting user and the submitting machine are the defaults notify_user = login_name@tacc.utexas.edu
output specifies the name of the file to use as standard output when a job step runs output = file_name
node_usage whether a user wants to share their memory in the SMP node with other users. Depending on the class of job i.e. on 4 processors or 8 processors, the job filter always changes it to shared. node_usage = shared or not_shared
queue places one copy of the job step in the queue -- this statement is required and essentially marks the end of the job step  
resources specifies quantities of the consumable resources "consumed" by each job. The resources may be machine resources or floating resources. resources = name(count) ... name(count)
where name could be ConsumableCpus or ConsumableMemory. The count for ConsumableCpus is an integer between 1-8. The count for ConsumableMemory is integral units expressed in terms of MB or GB. The user requirement is checked against a filter and the job is not submitted if memory requirements are exceeded.
shell specifies the name of the shell to use for the job step -- the default is the shell used in the owner's password file entry shell = name
network device specifies the correct interconnect device for use; The other devices is en1 MP_EUIDEVICE = sn_all;
#@ network.MPI = csss;..;..
network internode communication specifies that parallel MPI programs use fastest internode communication available; other option is using ip protocol MP_EUILIB = us
#@ network.MPI = ..;..;us
tasks_per_node specifies the number of tasks to run per node of a parallel job and should be used in conjunction with the node keyword -- the maximum number of tasks a job step can request is limited by the total_tasks keyword in the administration file tasks_per_node = number
wall_clock_limit user limit for the elapsed time for job, but is tested for hard-limit by the filter wall_clock_limit = 04:00:00


Submitting a Job

A job can be submitted using the GUI xloadl or at the command line with the following code.

llsubmit my_job

LoadLeveler assigns the job a three part identifier and also sets the environment variables for the job. The identifier consists of the following:


Managing a Job We will go through some of the basic Loadleveler commands to illustrate how they can be used to monitor and/or manage user jobs.

Showing the Status of a Job

The LoadLeveler command llq is used to obtain information about jobs in the LoadLeveler queue. The following is a small snippet of job listing that is obtained with the llq command showing all three of the components of the heterogenous Power4 system. These listings are headed by abbreviated and/or cryptic acronyms (in some cases), which we will then explain.

Id                     Owner      Submitted  ST  PRI Class        Running On
------------          ----------- ---------  --  --- -----------  ----------
champ01.440.0          ux454432    7/13 11:23 R  50  normal       champ10
champ01.447.0          mlane       7/16 16:22 R  50  normal       champ07

Id: The identification number of the user submitted Job. Use this number to check status or cancel the job.
ST: State of job;
  R (running), job is currently running.
  I (Idle), job is waiting in queue to run but not currently running.
  H (hold), job is placed on hold by the user. The job will not run unless released from Hold state
Class: Which class the job is running. User determines if it is either normal or development and then the appropriate class according to the nodes is determined by the filter.
Running On: The node(s) -- champ10, champ07 or champ11 where the job has started running. The other numbers in the sequence are only of importance for system administration purposes and should not be of concern to the user.

Note: For special cases including when users erroneously access the $ARCHIVE filesystem at runtime, the job hangs in an idle state rather than quit with appropriate error messages. After the job tries to run once, the "(alloc)" state is displayed in the "Running on" column in the llq output.

Placing and releasing a hold on a job

A job can be placed on hold, where it will remain in the queue until it is released. To place a hold on a job with the identifier machine.123.0 use the following code.

llhold machine.123.0

To release the hold on the above job

llhold -r machine.123.0

Displaying the Status of a Machine

To display information about the machines in the LoadLeveler pool use the llstatus command. For a long listing use the -l flag as follows.

llstatus -l

Cancelling a Job

The llcancel command can be used to cancel either running or queued jobs as follows.

llcancel machine.123.0


Summary of LoadLeveler Commands

Command Function
llcancel cancels a submitted job
llclass returns information about LoadLeveler classes
llckpt used to checkpoint a single job step
llextSDR extracts adapter information from the system data repository (SDR)
llhold holds or releases a hold on a job
llmatrix returns GANG matrix information in the LoadLeveler cluster when GANG scheduling is used
llmodify changes attributes or characteristics of a submitted job step
llprio changes the user priority
llq queries the status of LoadLeveler jobs
llstatus queries the status of LoadLeveler machines
llsubmit submits a job
llsummary returns resource information on completed jobs
Last modified: July 14 2009 10:51:40.


Tools

Program Timers & Performance Tools

Measuring the performance of a program should be an integral part of code development. It provides benchmarks to gauge the effectiveness of performance modifications and can be used to evaluate the scalability of the whole package and/or specific routines. There are quite a few tools for measuring performance, ranging from simple timers to hardware counters. Reporting methods vary too, from simple ASCII text to X-Window graphs of time series.

The most accurate way to evaluate changes in overall performance is to measure the wall-clock (real) time when an executable is running in a dedicated environment. On Symmetric Multi-Processor (SMP) machines, where resources are shared (e.g., the TACC IBM Power4 P690 nodes), user time plus sys time is a reasonable metric; but the values will not be as consistent as when running without any other user processes on the system. The user and sys times are the amount of time a user's application executes the code's instructions and the amount of time the kernel spends executing system calls on behalf of the user, respectively.

Package Timers

The time command is available on most Unix systems. In some shells there is a built-in time command, but it doesn't have the functionality of the command found in /usr/bin. Therefore you might have to use the full pathname to access the time command in /usr/bin. To measure a program's time, run the executable with time using the syntax "/usr/bin/time -p <args>" (-p specifies traditional "precision" output, units in seconds):

/usr/bin/time -p ./a.out {Time for a.out execution}
real 1.54 {Output (in seconds)}
user 0.5  
sys 0  
/usr/bin/time -p mpirun -np 4 ./a.out {Time for rank 0 task}

The MPI example above only gives the timing information for the rank 0 task on the master node (the node that executes the job script); however, the real time is applicable to all tasks since MPI tasks terminate together. The user and sys times may vary markedly from task to task if they do not perform the same amount of computational work (are not load balanced).

Code Section Timers

"Section" timing is another popular mechanism for obtaining timing information. The performance of individual routines or blocks of code can be measured with section timers by inserting the timer calls before and after the regions of interest. Several of the more common timers and their characteristics are listed below in Table 1.

Routine Type Resolution (usec) OS/Compiler
times user/sys 1000 Linux/AIX/IRIX/UNICOS
getrusage wall/user/sys 1000 Linux/AIX/IRIX
gettimeofday wall clock 1 Linux/AIX/IRIX/UNICOS
rdtsc wall clock 0.1 Linux
read_real_time wall clock 0.001 AIX
system_clock wall clock system dependent Fortran 90 Intrinsic
MPI_Wtime wall clock system dependent MPI Library (C & Fortran)

For general purpose or course-grain timings, precision is not important; therefore, the millisecond and MPI/Fortran timers should be sufficient. These timers are available on many systems; and hence, can also be used when portability is important. For benchmarking loops, it is best to use the most accurate timer (and time as many loop iterations as possible to obtain a time duration of at least an order of magnitude larger than the timer resolution). The times, getrussage, gettimeofday, rdtsc, and read_real_time timers have been packaged into a group of C wrapper routines (also callable from Fortran). The routines are function calls that return double (precision) floating point numbers with units in seconds. All of these TACC wrapper timers (x_timer) can be accesses in the same way:

     external   x_timer                 double x_timer(void);
     real*8  :: x_timer                 ...
     real*8  :: sec0, sec1, tseconds    double sec0, sec1, tseconds;
     ...                                ...
     sec0     = x_timer()               sec0     = x_timer();
     ...Fortran Code                    ...C Codes
     sec1     = x_timer()               sec1     = x_timer();
     tseconds = sec1-sec0               tseconds = sec1-sec0

The wrappers and a makefile are available with a test example from TACC for instrumenting codes.

Standard Profilers

The gprof profiling tool provides a convenient mechanism to obtain timing information for an entire program or package. Gprof reports a basic profile of how much time is spent in each subroutine and can direct developers to where optimization might be beneficial to the most time-consuming routines, the "hotspots". As with all profiling tools, the code must be instrumented to collect the timing data and then executed to create a raw-date report file. Finally, the data file must be read and translated into an ASCII report or a graphic display. The instrumentation is accomplished by simply recompiling the code using the -qp (Intel compiler) option. The compilation, execution, and profiler commands for gprof are shown below for a Fortran program:

Profiling Serial Executables

ifc -qp prog.f90 {Instruments code}
a.out {Produces gmon.out trace file}
gprof {Reads gmon.out (default args: a.out gmon.out)
(report sent to STDOUT)}

Profiling Parallel Executables

mpif90 -qp prog.f90 {Instruments code}
setenv GMON_OUT_PREFIX gout.* {Forces each tasks to produce a gout.<pid>}
mpirun -np <#> a.out {Produces gmon.out trace file}
gprof -s gout.* {Combines gout files into gmon.sum}
gprof a.out gmon.sum {Reads executable (a.out) & gmon.sum
(report sent to STDOUT)}

Detailed documentation is available at www.gnu.org. A documented example of a gprof flat profile and call graph output is available to help you interpret gprof output.

Timing Tools

Most of the advanced timing tools access hardware counters and can provide performance characteristics about floating point/integer operations, as well as memory access, cache misses/hits, and instruction counts. Some tools can provide statistics for an entire executable with little or no instrumentation, while others requires source code modification.

PAPI Toolkit

The PAPI (Performance Application Programming Interface) toolkit is available on Tejas and Lonestar. A user guide and overview are available at the University of Tennessee-Knoxville Innovative Computing Laboratory's PAPI home page. PAPI collects hardware monitor data and provides raw counts and/or derived performance information. A simple easy-to-use, high-level interface is available for most of the common performance metrics. There is also a low-level interface for more control and direct access to the native hardware events.

To use PAPI on TACC systems, first load the PAPI module using the command module load papi. This will set some environment variables that can then be used for compiling and linking. Once the PAPI module is loaded, a list of the high-level PAPI counters available on the system can be shown by running $PAPI_DIR/ctests/avail. The third column of the output indicates if the PAPI counter is available for use on the system processor. The available low-level native counters can be listed by running $PAPI_DIR/ctests/native_avail. The "$PAPI_DIR/ctests" and "$PAPI_DIR/ftests" directories contain many programs which can be examined to see examples of how PAPI may be implemented. Some of the more interesting events collected by PAPI are shown in Table 2 below. It is important to note that there are a limited number of counters available and some PAPI event counters may interfere with other counters. It is up to the user to ensure that the desired counters are operating correctly. Also, make sure to always check the error codes to verify that the PAPI functions are running properly. An example C program that can be used to test the different counters can be downloaded from here.

IMPORTANT: Programs with PAPI functions will not run correctly on the Lonestar head node. Only run these programs by submitting them as batch jobs on the compute nodes.

PAPI Name Description
PAPI_BR_INS Branch Instructions
PAPI_FP_INS Floating Point Instructions
PAPI_TOT_INS Total Instructions completed
PAPI_L2_DCM L2 Data Cache Misses
PAPI_L2_LDM L2 Load Misses
PAPI_TLB_DM Data TLB Misses
PAPI_TLB_TM Total TLB Misses

The simplest method for instrumentating your progam with PAPI is to use the high level function PAPI_flops in C or PAPIF_flops for Fortran. This function is called once at the beginning to initialize the PAPI counters and timers, then each subsequent call returns the total wall time, processor time, and floating point operations since the initial PAPI_flops call. An example C program using PAPI_flops is shown below

   #include <papi.h>
   #include <stdio.h>

   int handle_error(int code)
   {
      char errmesg[PAPI_MAX_STR_LEN];
      PAPI_perror(code, errmesg, PAPI_MAX_STR_LEN);
      printf("PAPI Error %i: %s\n", code, errmesg);
   }

   int main()
   {
     int err_code;
     float wall_time, proc_time, mflops;
     long_long floatpnt_ops;

     if ((err_code=PAPI_flops(&wall_time, &proc_time, &floatpnt_ops, &mflops)) != PAPI_OK)
         handle_error(err_code);

     ...<main code>...

     if ((err_code=PAPI_flops(&wall_time, &proc_time, &floatpnt_ops, &mflops)) != PAPI_OK)
         handle_error(err_code);

     printf("Wall time = %g\n", wall_time);
     printf("Processor time = %g\n", proc_time);
     printf("MFLOPS = %g\n", floatpnt_ops/proc_time);
    }

Note: the MegaFLOPS value returned by PAPI_flops function in PAPI 3.0 Beta may be incorrect, hence, the actual value is determined in the above code by taking the number of floating point operations and dividing by the processor time. Compiling the above code is accomplished by issuing the command:

icc -Wl,-rpath,$PAPI_LIB -I$PAPI_INC prog.c -L$PAPI_LIB -lpapi

The PAPI instrumented example Fortran code below uses the high-level PAPI counter functions to collect the number of floating point operations, total processor cycles, total number of instructions issued and the number of L2 data cache misses.

     program example
     #include "f90papi.h"
     integer, parameter :: nevents = 4
     integer*4 ::  ievents(nevents), ierr
     integer*8 ::  icounts(nevents)
     ...
     events(1)= PAPI_FP_OPS   !  This event is for PAPI 3.0 and later
     events(2)= PAPI_TOT_CYC
     events(3)= PAPI_TOT_INS
     events(4)= PAPI_L2_DCM


     call PAPIF_start_counters(ievents,nevents,ierr)
     ...<main code>...
     call PAPIF_read_counters( icounts,nevents,ierr)
     print*,icounts
     stop
     end

PAPI's default loader mode is to dynamically load its library objects into executables; therefore, it is best to use the -rpath loader option to indicate the location of the libraries at run time. If you loaded the papi module, the library path is automatically included in LD_LIBRARY_PATH, and the library and include paths are set in PAPI_LIB and PAPI_INC, respectively. Compiling a Fortran code instrumented with PAPI has the following syntax:

ifc -Wl,-rpath,$PAPI_LIB -I$PAPI_INC prog.f90 -L$PAPI_LIB -lpapi

Further details and examples can be found on the PAPI home page.

Last modified: July 14 2009 10:51:40.


Single Processor Optimization

Tuning an application for single-processor performance is important for two reasons. First, optimized code runs faster and provides quicker turnaround, thereby delivering results back to the user sooner. Secondly, code optimization has a positive impact on system throughput and facilitates the processing of more jobs in a given time period; it will benefit your colleagues. Therefore, programmers should be conscientious about improving code performance for the benefit of the programmer/user and to prevent inefficient code (resource-hogging) from lowering system throughput, thus inconveniencing other users waiting for free computational resources.

For an in-depth look at single processor optimization on the IBM Power4 System click here.

Last modified: July 14 2009 10:51:40.


Performance Libraries

Most programs that perform scientific and technical computation depend on common mathematical operations (e.g. matrix-vector multiplication, dot products) and/or spend a considerable amount of time calling mathematical functions like sqrt, sin, log, etc. For this purpose, IBM has developed several high performance libraries specifically optimized for the Power4 architecture. Users are encouraged to link in these mathematical and scientific libraries instead of hand-coding their own routines:

  1. ESSL library

    The Engineering and Scientific Subroutine Library (ESSL) consists of routines in the following computational areas:

    • BLAS (vector-vector, matrix-vector, matrix-matrix operations)
    • linear algebraic equations
    • eigensystem analysis
    • Fourier transforms, convolutions and correlations, and other signal processing operations
    • sorting and searching
    • interpolation
    • numerical quadrature
    • random number generation

    Two types of runtime libraries are provided by ESSL:

    • The ESSL SMP (symmetric multiprocessing) library provides thread-safe versions of the ESSL routines for use on SMP architectures. Some subroutines are even designed for multithreaded execution. Consult this table for a list of these routines that support shared memory parallel processing.

      To call an ESSL routine with multithreaded programming support (i.e. routines that can execute in parallel using multiple threads), link in the ESSL SMP library, e.g.:

      xlf90_r -qsmp=auto prog.f -lessl_r

      Parallel execution of ESSL can only take place when the number of threads specified in the runtime environment is greater than one. To set the number of threads, adjust the OMP_NUM_THREADS environment variable:

      setenv OMP_NUM_THREADS n (in csh)
      export OMP_NUM_THREADS=n (in sh)

      Please consult chapter 7 of the POWER4 Processor Introduction and Tuning Guide for a discussion of the environment variables that affect runtime behavior of SMP applications.

    • The ESSL serial library provides thread-safe versions of the ESSL subroutines for use on all processors. Serial applications can link in this library using:

      xlf90 prog.f -lessl

    Refer to the ESSL User's Guide for more information on the use of this library, as well as details on the calling interface and syntax of each routine.
  2. PESSL library

    Parallel ESSL is the scalable version of ESSL, and uses the message passing (MPI) library for multiprocessor execution of mathematical computational routines in the following areas:

    • PBLAS levels 2 and 3 (matrix-vector and matrix-matrix operations)
    • linear algebraic equations
    • eigensystem analysis and SVD
    • Fourier transforms
    • random number generation

    Because PESSL routines make calls to MPI, use the "mp" invocation of the compiler when compiling and linking in the PESSl library, e.g.:

    mpxlf90 prog.f -lpessl
    mpcc prog.c -lpessl

    On the Power4 system, PESSL SMP libraries are provided for use with the MPI thread-safe library. Note that PESSL routines cannot be called simultaneously from multiple threads -- that is, calls to PESSL must not be within an SMP parallel region. Use the SMP version of PESSL when linking code with the thread-safe MPI libraries:

    mpxlf90_r prog.f -lpessl_r

    Both SMP and "serial" versions of PESSl support 64-bit and 32-bit applications.

    For more information on the PESSL library, please refer to the PESSL User's Guide.
  3. MASS and MASSV Libraries

    Short for Mathematical Acceleration Subsystem Library, MASS provides fast, high-performance versions of a subset of Fortran intrinsic functions at the expense of decreased accuracy (i.e. the last one or two bits might be inaccurate).

    For vector operands, the MASSV library should be used instead. However, this would require modification of the code to call the vector equivalent of the intrinsic function. For example:

    Integer, Parameter :: n=1000
    real(8) a(n)
    ...
    do i=1,n
    ...
    a(i)=sqrt(1.0/a(i))
    ...
    end do

    should be replaced by:

    ...
    call vsrsqrt(a,a,n)    ! 32-bit environment
    call vrsqrt(a,a,n)    ! 64-bit environment
    ...

    Execute the following command when linking in the MASS or MASSV libraries, making sure to link in MASS before libm.a:

    xlc prog.c -L/usr/local/mass -lmass -lm
    xlc prog.c -L/usr/local/mass -lmassvp4 -lm

    Fortran codes link in libm.a automatically so this need not be linked in explicitly. However, use -lmass prior to -lxlf90 in the load step.

    MASS contains support for both 64-bit and 32-bit applications, except for vatan2 and vsatan2.

    Please consult /usr/local/apps/MASS.readme or this page for more information on the scalar and vector versions of the MASS library.

Last modified: July 14 2009 10:51:40.


Manuals & References

Last modified: July 14 2009 10:51:42.