Ranger User Guide
- TOC
- System Info
- Environment
- Development
- Optimization
- Manuals & Refs
- What's New
- General System Info
- Overview
- History
- System & Programming Notes
- Architecture
- System Access
- Environment
- Login Information
- Login Shell
- User Environment
- Startup Scripts
- Modules
- File Systems
- Login Information
- Development
- Programming Models
- Compilation
- Guidelines
- Serial
- Open MP
- MPI
- Basic Optimization
- Loading Libraries
- Running Code
- Runtime Environment
- Batch
- Interactive
- Process Affinity and Memory Policy
- Ranger Sockets
- Numactl
- Numa Control in Batch Scripts
- Tools
- Optimization
- Performance Libraries
- GotoBlas
- Performance Libraries
- Manuals & References
Click on Header to expand or collapse section. PDF of General System Info section
Ranger is one of the largest computational resources in the world, serving NSF TeraGrid researchers throughout the United States, academic institutions within Texas, and the components of The University of Texas System.
The Sun Constellation Linux Cluster, Ranger, is configured with 3,936 16-way SMP compute-nodes (blades), 123TB of total memory and 1.73PB of global disk space. The theoretical peak performance is 579 TFLOPS. Nodes are interconnected with InfiniBand technology in a full-CLOS topology providing a 1GB/sec point-to-point bandwidth. Also, a 2.8PB archive system and 5TB SAN network storage system are available through the login/development nodes.
![]() |
![]() |
| Figure 2. SunBlade x6420 motherboard. | |
![]() |
|
| Figure 1. One of 6 rows:Management/IO Racks (black), Compute Rack (silver), and In-row Cooler (black). | Figure 3. Constellation Switch (partially wired). |
Project summary:
TACC and The University of Texas at Austin lead a project team that includes Sun Microsystems, Arizona State University, and Cornell University to build, deploy, operate, and support a new supercomputer of unprecedented capabilities under the National Science Foundation's HPC Systems Acquisitions program. The new system, Ranger, will be one of the more powerful supercomputing systems in the world when it enters full production in February 2008, offering 1/2 petaflop of performance, 1.7 petabytes of disk, and 125 terabytes of memory. Ranger will enable researchers to address problems that are too computationally demanding for today's terascale systems, stimulating and accelerating research across domains that advance science and society.
Configuration/Capabilities:
579 Teraflops of compute power, 3,936 Sun four-socket blades, 15,744 AMD " Barcelona" processors, Quad-core, four flops/cycle (dual pipelines), 125 Terabytes aggregate memory, 2 GB/core, 32 GB/node, 132 GB/s aggregate bandwidth, 1.7 Petabytes disk, 72 Sun x4500 aggregate "Thumper" I/O servers, 24TB each, 72 GB/sec total sec 1 PB in I/O bandwidth, largest filesystem, 10 Gbps / 2.3 latency Sun InfiniBand-based switches (2), up to 3456 4x ports each, Full non-blocking 7-stage Clos fabric.
Progress & Plans
Ranger is being constructed in TACCs new machine room, located on the J.J. Pickle Research Campus in Austin, Texas. As of SC07, all compute, storage, and management hardware is in place, and cabling/integration and scalability testing is in progress. Ranger will be accessible to initial users in December 2007, and is scheduled for full production status as the TeraGrids most powerful resource in February 2008.
Photo Journal of the Ranger build
-
Empty machine room
Photographed by Kent Milfeld, December, 2006
-
Raised floor with chillers
Photographed by ? , March 6, 2007 -
Chiller racks
Photographed by Kent Milfeld, March 6, 2007 -
Power supply delivery
Photographed by Kent Milfeld, March 12, 2007 -
More power installed
Photographed by Kent Milfeld, March 12, 2007 -
One of many deliveries
Photographed by Kent Milfeld, March 21, 2007 -
Ranger nodes on dock
Photographed by Kent Milfeld, September 15, 2007 -
Many deliveries
Photographed Kent Milfeld, September 28, 2007 -
Magnum assembly
Photographed by Kent Milfeld, June 26, 2007 -
Magnum switch
Photographed by Kent Milfeld, October 26, 2007 -
Magnum switch
Photographed by Kent Milfeld, December 17, 2007 -
Ranger Blade
Photographed by Kent Milfeld, September 20, 2007 -
Ranger rack
Photographed by Kent Milfeld, October 26, 2007 -
Ranger row view
Photographed by Jo Wozniak, October 23, 2007 -
Ranger footprint
Photographed by Kent Milfeld, October 26, 2007
- Login Nodes now have barcelona chips
- The 4-socket Login3 and Login4 (ranger.tacc.utexas.edu) nodes have been populated with barcelona chips.
*** These are 2.2 GHz chips, the compute nodes run at 2.3 GHz. (Do not run codes on the login nodes.) - Compiling on Login Nodes
- When you login to ranger.tacc.utexas.edu you will be connected to either login3.ranger.tacc.utexas.edu or login4.ranger.tacc.utexas.edu (login1 and login2 are not available yet).
- MPI Support for Compilers
- Only the Intel and PGI compilers will support MPI. The mvapich2 libraries have been compiled with both compilers, and are automatically linked by the mpicc and mpif90 compiler drivers when correctly loaded throught the module commands. (By default the MPI compiler drivers use the PGI-compiled mvapich2 libraries and the default compilers are PGI.)
- Debugging and Profiling
- DDT is not available yet. Please use the idb (Intel) debugger, pgdbg and pgprof (PGI), and gdb and gprof (GNU) for debugging and profiling.
- /tmp on Compute Nodes
- In the compute nodes, the only physical storage device is an 8GB compact flash, which stores the OS. Only 150MB are available in /tmp for user storage. Program developers should use $SCRATCH to store temporary files. (The /tmp directories on login nodes are 36G disk devices.)
- Parallel Environment (using less than 16 cores/node)
- The Parallel Environment Section shows how to use less than 16 tasks per node, and how to run hybrid codes.
- MPI (mvapich) Options for Scalable code
- See the mvapich1/2 User Guides.
- Core Affinity and Memory Allocation Policy
- See Numa Section for controlling process/thread execution on sockets and cores; and memory allocation policy on sockets.
- Core Count for Batch SGE Jobs
- See Numa Section (look for MY_NSLOTS) for core counts other than a multiple of 16.
- Experienced Users
- Check out the Quick Start Notes.
The Ranger compute and login nodes run a Linux OS and are managed by the Rocks 4.1 cluster toolkit. Two 3456 port Constellation switches provide dual-plane access between NEMs (Network Element Modules) of each 12-blade chassis. Several global, parallel Lustre file sytems have been configured to target different storage needs. Each compute node contains 16 cores as a 4-socket, quad-core platform. The configuration and features for the compute nodes, interconnect and I/O systems are described below, and summarized in Tables 1-3.
- Compute Nodes: Ranger is a blade-based system. Each node is a SunBlade x6420 blade running a 2.6.18.8 x86_64 Linux kernel from kernel.org. Each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in all) on a single board, as an SMP unit. The core frequency is 2.3GHz and supports 4 floating-point operations per clock period with a peak performance of 8 GFLOPS/core or 128 GFLOPS/node.
Each node contains 32GB of memory. The memory subsystem has an 1.0GHz Hypertransport system Bus, and 2 channels with 667MHz DDR2 DIMMS. Each socket possesses an independent memory controller connected directly to L3 cache.
- Interconnect: The interconnect topology is a full-CLOS fat tree. Each of the 328 12-node compute chassis is connected directly to the 2 core switches. 12 additional frames are also connected directly to the core switches and provide files systems, administration and login capabilities.
- File systems: Ranger's file systems are built on 72 Sun x4500 disk servers, each containing 48 SATA drives, and two Sun x4600 metadata servers. From this aggregate space of 1.73PB, several file systems will be partitioned (see Table 5).
| Table 1. System Configuration & Performance | ||
| Component | Technology | Performance/Size |
| Peak Floating Point Operations |
579 TFLOPS (Theoretical) | |
| Nodes(blades) | Four Quad-Core AMD Opteron processors | 3,936 Nodes / 62,976 Cores |
| Memory | Distributed | 123TB (Aggregate) |
| Shared Disk | Lustre, parallel File System | 1.73PB |
| Local Disk | Compact Flash | 31.4TB (Aggregate) |
| Interconnect | InfiniBand Switch | 1 GB/s P-2-P Bandwidth |
| Table 2. SunBlade x6420 Compute Node | |
| Component | Technology |
| Sockets per Node/Cores per Socket | 4/4 (Barcelona) |
| Clock Speed | 2.3GHz |
| Memory Per Node | 32GB memory |
| System Bus | HyperTransport, 6.4GB bidirectional |
| Memory | 2GB DDR2/667, PC2-5300 ECC-registered DIMMs |
| PCI Express | x8 |
| Compact Flash | 8GB |
| Table 3. Sun x4600 Login Nodes | |
| Component | Technology |
| 4 login nodes |
ranger.tacc.utexas.edu (login1.tacc.utexas.edu Not Available) (login2.tacc.utexas.edu Not Available) (login3.tacc.utexas.edu) (login4.tacc.utexas.edu) |
| Sockets per Node/Cores per Socket | 4/4 (Barcelona). |
| Clock Speed | 2.2GHz |
| Memory Per Node | 32GB |
| Table 4. AMD Barcelona Processor | |
| Technology | 64-bit |
| Clock Speed | 2.3GHz |
| FP Results/Clock Period | 4 |
| Peak Performance/core | 8GFLOPS/core |
| L3 Cache | 2MB on-die (shared) |
| L2 Cache | 4 x 512KB |
| L1 Cache | 64KB |
| Table 5. Storage Systems | |||
| Storage Class | Size | Architecture | Features |
| Local | 8GB/node | Compact Flash | not available to users (O/S only) |
| Parallel | 1.73PB | Lustre, Sun x4500 disk servers | 72 Sun x4500 I/O data servers, 2 Sun x4600 Metadata servers (See Table 6 for breakdown of the parallel filesystems) |
| SAN | 15TB | Synergy FS, SUN Storage Tek | QLogic switch, SUN V880 Server, mnt on /san/hpc/ |
| Ranch (Tape Storage) | 2.8PB | SAMFS (Storage Archive Manager) | 10Gb/s connection through 8 GridFTP Servers |
| Table 6. Parallel Filesystems | |||
| Storage Class | Size | Quota (per User) | Features |
| HOME | ~100TB | 6GB | Backed up nightly; Not purged |
| WORK | ~200TB | 350GB | Not backed up; Not purged |
| SCRATCH | ~800TB | 400TB | Not backed up; Purged every 10 days |
SSH
To ensure a secure login session, users must connect to machines using the secure shell, ssh program. Telnet is no longer allowed because of the security vulnerabilities associated with it. The "r" commands rlogin, rsh, and rcp, as well as ftp, are also disabled on this machine for similar reasons. These commands are replaced by the more secure alternatives included in SSH --- ssh, scp, and sftp.
Before any login sessions can be initiated using ssh, a working SSH client needs to be present in the local machine. Go to the TACC introduction to SSH for information on downloading and installing SSH.
To initiate an ssh connection to a ranger login node, execute the following command on your local workstation
Note
ssh @ ranger.tacc.utexas.edu
Additionally, each of the login nodes can be accessed directly, to allow users to move data to/from local disk space on the login nodes. These nodes are directly accessible by using the node name:
ssh @ <login{3|4}>.ranger.tacc.utexas.edu
Password changes (with the passwd command) are forced to adhere to "strength checking" rules, and users are asked to comply with practices presented in the TACC password guide.
Click on Header to expand or collapse section. PDF of Environment section
Login Shell
The most important component of a user's environment is the login shell that interprets text on each interactive command line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. To determine your login shell, execute:
| echo $SHELL | {to see your login shell} |
You can use the chsh command to change your login shell; instructions are in the man page. Available shells are listed in the /etc/shells file with their full-path. To change your login shell, execute:
| cat /etc/shells | {select a |
| chsh -s |
{use full path of the shell} |
User Environment
The next most important component of a user's environment is the set of environment variables. Many of the Unix commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:
The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the HOME and PATH variables.
| HOME=/home/utexas/staff/milfeld |
| PATH=/bin:/usr/bin:/usr/local/apps:/opt/intel/bin |
(PATH has a colon (:) separated list of paths for its value.) It is important to realize that variables set in the environment (with setenv for C shells and export for Bourne shells) are "carried" to the environment of shell scripts and new shell invocations, while normal "shell" variables (created with the set command) are useful only in the present shell. Only environment variables are seen in the env (or printenv) command; execute set to see the (normal) shell variables.
Startup Scripts
All Unix systems set up a default environment and provide administrators and users with the ability to execute additional Unix commands to alter the environment. These commands are "sourced"; that is, they are executed by your login shell, and the variables (both normal and environmental) as well as aliases and functions are included in the present environment. We recommend that you customize the login environment by inserting your "startup" commands in .cshrc_user, .login_user, and .profile_user files in your home directory.
Basic site environment variables and aliases are set in
| /usr/local/etc/cshrc | {C-shell, non-login specific} |
| /usr/local/etc/login | {C-shell, specific to login} |
| /usr/local/etc/profile | {Bourne-type shells} |
For historical reasons, the C shells source two types of files. The .cshrc type files are sourced first (/etc/csh.cshrc--> $HOME/.cshrc--> /usr/local/etc/cshrc--> $HOME/.cshrc_user). These files are used to set up environments that are to be executed by all scripts and used for access to the machine without a login. For example, the following commands only execute the .cshrc type files on the remote machine:
| scp data ranger.tacc.utexas.edu: | {only .cshrc sourced on ranger} |
| ssh ranger.tacc.utexas.edu date | {only .cshrc sourced on ranger} |
The .login type files are used to setup environment variables that you commonly use in an interactive session. They are sourced after the .cshrc type files (/etc/csh.login--> $HOME/.login--> /usr/local/etc/login-->
$HOME/.login_user). Similarly, if your login shell is a Bourne shell (bash, sh, ksh, ...), the profile files are sourced (/etc/profile--> $HOME/.profile--> /usr/local/etc/profile--> $HOME/.profile_user).
The commands in the /etc files above are concerned with operating system behavior and set the initial PATH, ulimit, umask, and environment variables such as the HOSTNAME. They also source command scripts in /etc/profile.d -- the /etc/csh.cshrc sources files ending in .csh, and /etc/profile sources files ending in .sh. Many site administrators use these scripts to setup the environments for common user tools (vim, less, etc.) and system utilities (ganglia, modules, Globus, LSF, etc.)
TACC has to coordinate the environments on platforms of several operating systems: AIX, Linux, IRIX, Solaris, and Unicos. In order to efficiently maintain and create a common environment among these systems, TACC uses its own startup files in /usr/local/etc. (A corresponding file in this etc directory is sourced by the .profile, , and .login files that reside in your home directory. (Please do not remove these files and the sourcing commands in them, even if you are a Unix guru.) Any commands that you put in your .login_user, .cshrc_user, or .profile_user file are sourced (if the file exists) at the end of the corresponding /usr/local/etc command files. If you accidentally remove your .login, .cshrc, and .login, you can copy new ones from /usr/local/etc/start-up or execute
to get a new copy (your old files are renamed with a date suffix).
Modules
TACC is constantly including updates and installing revisions for application packages, compilers, communications libraries, and tools and math libraries. To facilitate the task of updating and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.
At login, a basic environment for the default compilers, tools, and libraries is set by several modules commands. Your PATH, MANPATH, LIBPATH, directory locations (WORK, HOME, ...), aliases (cdw ...) and license paths, are just a few of the environment variables and aliases created for you. This frees you from having to initially set them and update them whenever modifications and updates are made in system and application software.
Users who need 3rd party applications, special libraries, and tools for their development can quickly tailor their environment with only the applications and tools they need. (Building your own specific application environment through modules allows you to keep your environment free from the clutter of all the other application environments you don't need.)
Each of the major TACC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. A user need only invoke a single command,
at each login to configure an application/programming environment properly. If you often need an application environment, place the modules command in your .login_user and/or .profile_user shell startup file.
Most of the package directories are in /opt/apps ($APPS) and are named after the package name (
As an example, the fftw package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set in your environment by loading the fftw module:
To see a list of available modules and a synopsis of a modulefile's operations, execute:
| module available | {lists modules} |
| module help |
{lists environment changes performed for |
During upgrades, new modulefiles are created to reflect the changes made to the environment variables. TACC will always announce upgrades and module changes in advance.
File Systems
The TACC HPC platforms have several different file systems with distinct storage characteristics. There are predefined, user-owned directories in these file systems for users to store their data. Of course, these file systems are shared with other users, so they are managed by either a quota limit, a purge policy (time-residency) limit, or a migration policy.
Three Lustre file systems are available to users: $HOME, $WORK and $SCRATCH. Users have 6GB for $HOME. $WORK on our Ranger system is NOT a purged file system, but is limited by a large quota. Use $SCRATCH for temporary, large file storage; this file sytem is purged periodically (TBD), and has a very large quota. All file systems also impose an inode limit.
To determine the size of a file system, cd to the directory of interest and execute the "df -k ." command.
Without the "dot" all file systems are reported. In the df command output below, the file system name appears on the left (IP number, "ib" protocol, using OFED gen2) , and the used and available space (-k, in units of 1KBytes) appear in the middle columns followed by the percent used and the mount point:
% df -k . File System 1k-blocks Used Available Use% Mounted on 129.114.97.1@o2ib:/share 103836108384 290290196 103545814376 1% /share
To determine the amount of space occupied in a user-owned directory, cd to the directory and execute the du command with the -sb option (s=summary, b=units in bytes):
du -sb
To determine quota limits and usage, execute the lfs quota command with your username and the directory of interest:
lfs quota -u <username> $HOME
lfs quota -u <username> $WORK
lfs quota -u <username> $SCRATCH
The four major file systems available on Ranger are:
- home directory
- At login, the system automatically changes to your home directory.
This is the recommended location to store your source codes and build your executables.
The quota limit on home is 6GB.
A user's home directory is accessible from the frontend node and any compute node.
Use $HOME to reference your home directory in scripts.
Use cd to change to $HOME. - work directory
- Store large files here.
Often users change to this directory in their batch scripts and run their jobs in this file system.
A user's work directory is accessible from the frontend node and any compute node.
The user quota is 350GB for Early Users, and will be set to a smaller value when $SCRATCH comes on line.
Purge Policy: There is no purging on this file system.
This file system is not backed up.
Use $WORK to reference your work directory in scripts.
Use cdw to change to $WORK. - scratch or temporary directory
- UNLIKE the TACC Lonestar system, this is NOT a local disk file system on each node.
This is a global Lustre file system for storing temporary files.
Purge Policy: Files on this system are purged when a file's access time exceeds 10 days.
PLEASE NOTE: TACC staff may delete files from scratch if the scratch file system becomes full and directories consume an inordinately large amount of disk space, even if files are less than 10 days old. A full file system inhibits use of the file system for all users. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.
Often, in batch jobs it is more efficient to use and store files directly in $WORK (to avoid moving files from scratch later before they are purged).
The quota on this system is 400TB.
Use $SCRATCH to reference this file system in scripts.
Use cds to change to $SCRATCH. - archive
More on Archive -- rls and sinc not available yet
High Speed File Transfers
There are two utilities, bbcopy and globus-url-copy, that can be used to achieve higher performance than the rcp and scp programs, when transferring large files between TACC clusters and the TACC archive (ranch). During production, the scp and rcp speeds between Ranger and Ranch average about 15MB/s, while bbcopy and globus-url-copy speeds are about 125MB/s. These values vary with I/O and network traffic.
The bbcopy utility works much the same way as scp, it is only available on TACC machines and includes an option to copy subdirectories. For each transfer command it is necessary to provide your ssh passphrase or login password. You can use the ssh-agent and ssh-add commands to automatically supply passphrases for ssh commands (including bbcp) during your login session. Details are available in the ssh Cookbook section of the TACC SSH Guide. The bbcp syntax is:
bbcp -help {displays option list}
bbcp options <file or directory> <toMachine>:<relativepath>/<fileordirectory>
e.g.
bbcp data ranch: {tranfers "data" to ranch as "data"} bbcp -f data ${ARCHIVER}:data
{transfers "data" to ranch, force replacement} bbcp -r dir1 ${ARCHIVER}: {transfers directory dir1, and subdirectories}
By default, files are not overwritten, use the -f to force replacement of files. The -r option recursively transfers subdirectories. You can also use bbcp to transfer files between lonestar and ranger.
TeraGrid users who wish to transfer data between TG sites can use globus-url-copy. This command requires the use of your TeraGrid certificate to create a proxy for passwordless transfers. It is syntactically complex, but provides high-speed access to other TeraGrid machines that support gridFTP services (protocol for globus-url-copy). High-speed transfers are accomplished by tranferring file/directories between the different FTP servers at the TG sites. The file systems of the compute machines are mounted on GridFTP servers, thereby providing the access to your files/directories through these servers. Third party transfers are possible (transfer initiated between two machines from another machine). All TG GridFTP servers and mounted directory names are listed in the TG transfer-location page.
It is necessary to use grid-proxy-init to obtain a proxy certificate. The proxy is valid for 12 hours for all logins on the local machine. The startup commands are:
With globus-url-copy you must include the name of the server, and a full path to the file. The syntax is:
grid-proxy-init {prompted for certificate password, proxy good for 12hrs.}
globus-url-copy <options> <gsiftp://<gridftp_server1>/<directory>/file> \ <gsiftp://<gridftp_server2>/<directory>/file>
e.g. file transfer: globus-url-copy -stripe -tcp-bs 11M -vb \ gsiftp://gridftp.ranger.tacc.teragrid.org/`pwd`/file1 \ gsiftp://tg-gridftp.ncsa.teragrid.org/home/ncsa/johndoe/file2 e.g. directory transfer: globus-url-copy -stripe -tcp-bs 11M -vb \ gsiftp://gridftp.ranger.tacc.teragrid.org/`pwd`/directory1/ \ gsiftp://gridftp.ranch.tacc.teragrid.org/home/00000/johndoe/directory2/
It is important to the use the stripe and buffer size options (-stripe to use multiple service nodes, -tcp-bs 11M to set ftp data channel buffer size), otherwise the speed will be about 20 times slower! By default all paths are root-relative-- that is, a full path must be specified. Note: when transferring directories, the directory path MUST end with a slash (/). The -rp option (not shown) allows paths relative to the user's "starting" directory of the filesystem mounted on the server.
Click on Header to expand or collapse section. PDF of Development section
There are two distinct memory models for computing: distributed-memory and shared-memory. In the former, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the latter, open multiprocessing (OMP) programming techniques are employed for multiple threads (light weight processes) to access memory in a common address space.
For distributed memory systems, single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms are used. In the SPMD paradigm, each processor loads the same program image and executes and operates on data in its own address space (different data). This is illustrated in Figure 4. It is the usual mechanism for MPI code: a single executable (a.out in the figure) is available on each node (through a globally accessible file system such as $WORK or $HOME), and launched on each node (through the batch MPI launch command, "ibrun a.out").
In the MPMD paradigm, each processor loads up and executes a different program image and operates on different data sets, as illustrated in Figure 4. This paradigm is often used by researchers who are investigating the parameter space (parameter sweeps) of certain models, and need to launch 10s or 100s of single processor executions on different data. (This is a special case of MPMD in which the same executable is used, and there is NO MPI communication.) The executables are launched through the same mechanism as SPMD jobs, but a Unix script is used to assign input parameters for the execution command (through the batch MPI launcher, "ibrun script_command"). Details of the batch mechanism for parameter sweeps are described in the Running Programs section.
|
The shared-memory programming model is used on Symmetric Multi- Processor (SMP) nodes, like the TACC Champion Power5 System (8 CPUs, 16GB memory per node) or the TACC Ranger Cluster (16 cores, 32GB memory per node).
The programming paradigm for this memory model is called Parallel Vector Processing (PVP) or Shared-Memory Parallel Programming (SMPP). The latter name is derived from the fact that vectorizable loops are often employed as the primary structure for parallelization. The main point of SMPP computing is that all of the processors in the same node share data in a single memory subsystem, as shown in Figure 5. There is no need for explicit messaging between processors as with with MPI coding.
|
In the SMPP paradigm either compiler directives (as pragmas in C, and special comments in FORTRAN) or explicit threading calls (e.g. with Pthreads) is employed. The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.
In cluster systems that have SMP nodes and a high speed interconnect between them, programmers often treat all CPUs within the cluster as having their own local memory. On a node an MPI executable is launched on each CPU and runs within a separate address space. In this way, all CPUs appear as a set of distributed memory machines, even though each node has CPUs that share a single memory subsystem.
In clusters with SMPs, hybrid programming is sometimes employed to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.
For further information on OpenMP, MPI and on programming models/paradigms, please see the manuals and packages sections of this document.
Compiler Usage Guidelines
The AMD Compiler Usage Guidelines document provides the "best-known" peak switches for various compilers tailored to their Opteron products. Developers and installers should read Chapter 5 of this document before experimenting with PGI and Intel compiler options.
See Chapter 5 of the AMD Compiler Usage Guidelines.
The Intel 10.1 Compiler Suite
The Intel 10.1 compilers are NOT the default compilers. You must use the module commands to load the Intel compiler environment (see above). (The 9.1 compilers are available for special porting needs.) The gcc 3.4.6 compiler and module are also available.We recommend using the Intel (or the PGI) suite whenever possible. The 10.1 suite is installed with the 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode).
Web accessible Intel manuals are available: Intel 10.1 C++ Compiler Documentation and Intel 10.1 Fortran Compiler Documentation.
The PGI 7.1 Compiler Suite
The PGI 7.1 compilers are loaded as the default compilers at login with the pgi module. We are recommending the use of the PGI suite whenever possible (at this time). The 7.1 suite is installed with the 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode).
A PDF version of the PGI User's Guide is avaiable.
Selecting Compiler and MPI Environments
The Ranger programming environment supports several compilers (Intel 9/10 and PGI 7) and several MPI stacks (mvapich, mvapich2 and openmpi) for MPI programs. Each MPI stack must have a library compiled for each of the compilers, so that applications compiled with compiler x can load the x-compiled MPI libraries. Hence, the MPI environment is dependent upon the compiler environment you select. The generic way to change these two environments is to: unload the MPI environment, change the compiler environment (or not if you continue to use the present compiler environment), and then load the MPI environment you will need. The possible scenarios and examples are:
| Change Only MPI stack | Change Only Compiler (COMP) | Change Compiler (COMP) and MPI stack |
| module unload <MPI_old> module load <MPI_new> |
module unload <MPI_now> module swap <COMP_old> <COMP_new> module load <MPI_now> |
module unload <MPI_old> module swap <COMP_old> <COMP_new> module load <MPI_new> |
| Change Only MPI stack | Change Only Compiler (COMP) | Change Compiler (COMP) and MPI stack |
| module unload mvapich2 module load mvapich/1.0 |
module unload mvapich2 module swap pgi intel module load mvapich2 |
module unload mvapich2 module swap pgi intel/9.1 module load openmpi |
By default, the pgi compiler and mvapich2 environments are set up at login. Execute the module avail command to determine the modulefile names for all the available compilers; they have the syntax compiler/version. Note, only certain MPI stacks and other compiler-dependent libraries are seen for each compiler environment. The above commands can be placed in your .login (C shells) or .profile (Bourne shells) file to automatically set an alternate default compiler and MPI stack in your environment at login. The matrix below shows the available combination of compilers and MPI stacks.
MPI Family |
Compiler Support |
MPI1-1 |
Full MPI-2 |
Notes |
mvapich/1.0 |
pgi |
Yes |
No |
This is the current recommended stack for large scale analysis on Ranger. It has been used to run applications with O(32K) MPI tasks. |
mvapich2/1.0 |
pgi |
Yes |
Yes |
This supports full MPI-2 functionality with a job-startup mechanism that is recommended for job sizes in the range from 16-2048 tasks. |
openmpi/1.2.4 |
pgi |
Yes |
Yes |
OpenMPI also supports MPI-2 semantics and is the successor to the LAM/MPI project. |
Available Compiler and MPI Stack combinations, and MPI-1 and MPI-2 compliance.
Compiling Codes
The following sections present the compiler invocation for serial and MPI executions, and follows with a section on options. All compiler commands can be used for just compiling with the -c option (create just the ".o" object files) or compiling and linking (to create executables). To use a different (no-default) compiler you first unload the MPI environment (mvapich2), swap the compiler environment, and then reload the MPI environment.
Compiling Serial Programs
The compiler invocation commands for the supported vendor compiler systems are tabulated below.
| Vendor | Compiler | Program | TypeSuffix |
| intel | icc | C | .c |
| intel | icc | C++ | .C, .cc, .cpp, .cxx |
| intel | ifort | F90 | .f, .for, .ftn, .f90, .fpp |
| pgi | pgcc | C | .c |
| pgi | pgcpp | C++ | .C, .cc |
| pgi | pgf95 | F77/90/95 | .f, .F, .FOR, .f90, .f95, .hpf |
| gnu | gcc | C | .c |
| sun | sun_cc | C | .c |
| sun | sun_CC | C++ | .C, .cc, .cpp, .cxx |
| sun | sunf90 | F77/90 | .f, .F, .FOR, .f90, .hpf |
| sun | sunf95 | F95 | .f, .F, .FOR, .f90, .f95, .hpf |
Appropriate program-name suffixes are required for each compiler. By default, the executable name is a.out. It may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options:
intel icc/ifort -o flamec.exe -O2 -xW prog.c/cc/f90 pgi pgcc/pgcpp/pgf95 -o flamef.exe -fast -tp barcelona-64 prog.c/cc/f90 gnu gcc -o flamef.exe -mtune=barcelona -march=barcelona prog.c sun sun_cc/sun_CC/sunf90 -o flamef.exe -xarch=sse2 prog.c/cc/f90
A list of all compiler options, their syntax, and a terse explanation, is given when the compiler command is executed with the -help option. Also, man pages are available. To see the help or and man information, execute one of:
compiler -help / man compiler with compiler = ifort, pgf90/95 or sunf90 or gcc --help / man gcc
Some of the more important options are listed below.
Compiling Parallel Programs with MPI
The "mpicmds" commands support the compilation and execution of parallel MPI programs for specific interconnects and compilers. At login, MPI MVAPICH (mvapich2) and PGI 7 compiler (intel) modules are loaded to produce the default environment which provide the location to the corresponding mpicmds. Compiler scripts (wrappers) compile MPI code and automatically link startup and message passing libraries into the executable. Note that the compiler and MVAPICH library are selected according to the modules that have been loaded. The following table lists the compiler wrappers for each language:
Compiling Parallel Programs with MPI
Compiler Program TypeSuffix mpicc c .c mpicxx C++ .cc, .C, .cpp, .cxx mpif90 F77/F90 .f, .for, .ftn, .f90, .f95, .fpp
Appropriate program-name suffixes are required for each wrapper. By default, the executable name is a.out. It may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options:
intel mpicc/mpif90 -o prog.exe -O2 -xW prog.cc/f90 pgi mpicc/mpif90 -o prog.exe -fast -tp barcelona-64 prog.f90
Include linker options such as library paths and library names after the program module names, as explained in the Loading Libraries section below. The Running Code section explains how to execute MPI executables in batch scripts and "interactive batch" runs on compute nodes.
We recommend that you use either the Intel or the PGI compiler for optimal code performance. TACC does not support the use of the gcc compiler for production codes on the Ranger system. For those rare cases when gcc is required, for either a module or the main program, you can specify the gcc compiler with the -cc mpcc option. (Since gcc- and Intel-compiled code are binary compatible, you should compile all other modules that don't require gcc with the Intel compiler.) When gcc is used to compile the main program, an additional Intel library is required. The examples below show how to invoke the gcc compiler in combination with the Intel compiler for the two cases:
mpicc -O2 -xW -c -cc=gcc suba.c
mpicc -O2 -xW mymain.c suba.o
mpicc -O2 -xW -c suba.c
mpicc -O2 -xW -cc=gcc -L$ICC_LIB -lirc mymain.c suba.o
Compiling OpenMP Programs
Since each of the blades (nodes) of the Ranger cluster is an AMD Opteron quad-processor quad-core system, applications can use the shared memory programming paradigm "on node". With a total number of 16 cores per node, we encourage the use of a shared-memory model on the node.
The OpenMP compiler options are listed below for those who do need SMP support on the nodes. For hybrid programming, use the mpi-compiler commands, and include the openmp options.
Intel : mpicc/mpif90 -O2 -xW -openmp
pgi : mpicc/mpif90 -fast -tp barcelona-64 -mp
Basic Optimization for Serial and Parallel Programming using OpenMP and MPI
The MPI compiler wrappers use the same compilers that are invoked for serial code compilation. So, any of the compiler flags used with the icc command can also be used with mpicc; likewise for ifort and mpif90; and iCC and mpicxx. Below are some of the common serial compiler options with descriptions.
Intel Compiler
Compiler Options Description -O3 performs some compile time and memory intensive optimizations in addition to those executed with -O2, but may not improve performance for all programs. -ipo Interprocedural optimizations -vec_report[0|...|5] control amount of vectorizer diagnostic information: -xW includes specialized code for SSE and SSE2 instructions (recommended). -xO includes specialized code for SSE, SSE2 and SSE3 instructions. Use, if code benefits from SSE3 instructions. -fast -ipo, -O2, -static DO NOT USE -- static load not allowed. -g -fp debugging information produced, disable using EBP as general purpose register -openmp enable the parallelizer to generate multi-threaded code based on the OpenMP directives -openmp_report[0|1|2] control the OpenMP parallelizer diagnostic level. -help lists options
Developers often experiment with the following options: -pad, -align, -ip, -no-rec-div and -no-rec-sqrt. In some codes performance may decrease. Please see the Intel compiler manual (below) for a full description of each option.
Use the -help option with the mpicmds commands for additional information:
mpicc -help mpif90 -help mpirun -help {use the listed options with the ibrun cmd}
pgi Compiler
For detail on the MPI standard go to the URL: www.mcs.anl.gov/mpi.
Compiler Options Description -O3 performs some compile time and memory intensive optimizations in addition to those executed with -O2, but may not improve performance for all programs. -Mipa=fast, inline Interprocedural optimizations There is a loader problem with this option. -tp barcelona-64 includes specialized code for the barcelona chip. -fast -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -g, -gopt debugging information produced -mp enable the parallelizer to generate multi-threaded code based on the OpenMP directives -Minfo=mp,ipa Information about OpenMP, interprocedural optimization -help lists options -help -fast lists options for -fast
Loading Libraries
Some of the more useful load flags/options are listed below. For a more comprehensive list, consult the ld man page.
- Use the -l loader option to link in a library at load time: e.g.
compiler prog.f90 -lname This links in either the shared library libname.so (default) or the static library libname.a, provided that the correct path can be found in ldd's library search path or the LD_LIBRARY_PATH environment variable paths.
- To explicitly include a library directory, use the -L option, e.g.
compiler prog.f -L/mydirectory/lib -lname In the above example, the user's libname.a library is not in the default search path, so the "-L" option is specified to point to the libname.a directory.
Many of the modules for applications and libraries, such as the mkl library module provide environment variables for compiling and linking commands. Execute module help module_name for a description, listing and use cases for the assigned environment variables. The following example illustrates their use for the mkl library:
mpicc -Wl,-rpath,$TACC_MKL_LIB -I$TACC_MKL_INC mkl_test.c \ -L$TACC_MKL_LIB -lmkl_em64t Here, the module supplied variables TACC_MKL_LIB and TACC_MKL_INC contain the MKL library and header library directory paths, respectively. The loader option -Wl specifies that the $TACC_MKL_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of the LD_LIBRARY path or the LDD dynamic cache of bindings between shared libraries and directory paths. (This avoids having to set the LD_LIBRARY path ("manually" or through a module command) before running the executables.
Runtime Environment
Bindings to the most recent shared libraries are configured in the file /etc/ld.so.conf (and cached in the /etc/ld.so.cache file). Cat /etc/ld.so.conf to see the TACC configured directories, or execute
/sbin/ldconfig -pto see a list of directories and candidate libraries. Use the -Wl,rpath loader option or the LD_LIBARY_PATH to override the default runtime bindings.
The Intel compiler and MKL math libraries are located in the /opt/intel directory (installation date TBD), and application libraries are located in /usr/local/apps ($APPS). The GOTO libraries are located in /opt/apps/gotoblas/gotoblas-1.02 (installation date TBD). Use the
module help libnamecommand to display instructions and examples on loading libraries.
The SGE Batch System
Batch facilities like LoadLeveler, NQS, LSF, OpenPBS or SGE differ in their user interface as well as implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (hold, delete, resource request modification). In Section I basic batch operations and their options are described. Section II discusses the SGE batch environment, and Section III provides the queue structure on SGE. In the references at the end of this section there are links to SGE manuals. New user should visit the SGE wiki and read the first chapter of the Introduction to the N1 Grid Engine 6 Software document.. To help users who are migrating from other systems, a comparison of the IBM LoadLeveler, OpenPBS, LSF and SGE batch options and commands is presented in a separate document.
Section I: Three Operations of Batch Processing: submission, monitoring, and control
Step 1: Job submission
SGE provides the qsub command for submitting batch jobs: Use the SGE qsub command to submit a batch job with the following syntax:
qsub job_script
where job_script is the name of a file with unix commands. This "job script" file can contain both shell commands and special commented statements that include qsub options and resource specifications. Details on how to build a script follow.
| Table 1. List of the Most Common qsub Options | ||
| Option | Argument | Function |
| -q | queue_name | Submits to queue designated by queue_name |
| -pe | pe_name min_proc[-max_proc] | Executes job via the Parallel Environemnt designated by pe_name with min_proc-max_proc number of processes |
| -N | job_name | Names the job job_name |
| -S | shell (absolute path) | Use shell as shell for the batch session |
| -M | emailaddress | Specify user's email address |
| -m | {b|e|a|s|n} | Specify when user notifications are to be sent |
| -V | Use current environment setting in batch job | |
| -cwd | Use current directory as the job's working directory | |
| -o | output_file | Direct job output to output_file |
| -e | error_file | Direct job error to error_file |
| -A | account_name | Charges run to account_name. Used only for multi-project logins. Account names and reports are displayed at login. |
| -l | resource=value | Specify resource limits (see qsub man page) |
Options can be passed to qsub on the command line or, specified in the job script file. The latter approach is preferable. It is easier to store commonly used qsub commands in a script file that will be reused several times rather than retyping the qsub commands at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script.
Batch scripts contain two types of statements: special comments and shell commands. Special comment lines begin with #$ and are followed with qsub options. The SGE shell_start_mode has been set to unix_behavior which means that the Unix shell commands are interpreted by the shell specified on the first line after #! sentinel; otherwise the Bourne shell (/bin/sh) is used. The file job below requests an MPI job with 32 cores and 1.5 hours of run time:
#!/bin/bash
#$ -V # Inherit the submission environment
#$ -cwd # Start job in submission directory
#$ -N myMPI # Job Name
#$ -j y # combine stderr & stdout into stdout
#$ -o $JOB_NAME.o$JOB_ID # Name of the output file (eg. myMPI.oJobID)
#$ -pe 16way 32 # Requests 16 cores/node, 32 cores total
#$ -q normal # Queue name
#$ -l h_rt=01:30:00 # Run time (hh:mm:ss) - 1.5 hours
## -M myEmailAddress # Email notification address (UNCOMMENT)
## -m be # Email at Begin/End of job (UNCOMMENT)
set -x #{echo cmds, use "set echo" in csh}
ibrun ./a.out # Run the MPI executable named "a.out"
If you don't want stderr and stdout directed to the same file, remove don't include a -j option line, and insert a -e option (stderr) to name the file that is to receive stderr output.
MPI Environment for Scalable Code
The MVAPICH-1 and MVAPICH-2(default) MPI packages provide runtime environments that can be tuned for scalable code. For packages with short messages, there is a "FAST_PATH" option that can reduce communication costs, as well as a mechanism to "Share Receive Queues". Also, there is a "Hot-Spot Congestion Avoidance" option for quelling communication patterns that produce hot spots in the switch. See Chapter 9, "Scalable features for Large Scale Clusters and Performance Tuning" and Chapter 10, "MVAPICH2 Parameters" of the MVAPICH2 User Guide for more information. The User Guides are available in PDF format at:
MVAPICH User Guides
Understanding the SGE Parallel Environment
Each Ranger node (of 16 cores) can only be assigned to one user; hence, a complete node is dedicated to a user's job and accrues wall-clock time for 16 cores whether they are used or not. The SGE parallel environment option "-pe" sets the number of MPI Tasks per Node (TpN), and the Number of Nodes (NoN). The syntax is:
#$ -peway
e.g.
#$ -pe 16way 64 {16 MPI tasks per node, 4 nodes (= a total of 64 assigned cores/16)}where:
TpN is the Task per Node, and
NoN is the Number of Nodes requested.
Note, regardless of the value of TpN, the second argument is always 16 times the number of nodes that you are requesting.
Using a multiple of 16 cores per node
For "pure" MPI applications, the most cost-efficient choices are: 16 tasks per node (16way) and a total number of tasks that is a multiple of 16. This will ensure that each core on all the nodes is assigned one task. In this case use:
$# -pe 16way
Using fewer than 16 cores per node
When you want to use less that 16 MPI tasks per node, the choice of tasks per node is limited to the set of numbers {1, 2, 4, 8, 12, and 15}. When the number of tasks you need is equal to "Number of Tasks per Node x Number of Nodes", then use the following prescription:
$# -peway where
TpN is a number in the set {1, 2, 4, 8, 12, 15}
If the total number of tasks that you need is less than "Number of Tasks per Node x Number of Nodes", then set the MY_NSLOTS environment variable to the total number of tasks. In a job script, use the following -pe option and environment variable statement:
$# -peway
...
setenv MY_NSLOTS{ C-type shells }
or
export MY_NSLOTS={ Bourne-type shells } e.g.
$# -pe & 8way 64 {use 8 Tasks per Node, 4 Nodes requested}
...
setenv MY_NSLOTS 31 {31 tasks are launched}where
TpN is a number in the set {1, 2, 4, 8, 12, 15}
Program Environment for Hybrid Programs
For hybrid jobs, specify the MPI Tasks per Node through the first -pe option (1/2/4/8/15/16way) and the Number of Nodes in the second -pe argument (as the number of assigned cores = Number of Nodes x 16). Then, use the OMP_NUM_THREADS environment variable to set the number of threads per task. (Make sure that "Tasks per Node x number of Nodes" is less than or equal to the number assigned cores, the second argument of the -pe option.) The hybrid job script below illustrates the use of these parameters to run a hybrid job. It requests 4 tasks per node, 4 threads per task, and a total of 32 cores (2 nodes x 16 cores).
| #!/bin/bash | |
| # | {use bash shell} |
| ... | |
| #$ -pe 4way 32 | |
| # | {4 cores/node, 32 cores total} |
| ... | |
| set -x | #{echo cmds, use "set echo" in csh} |
| setenv OMP_NUM_THREADS 4 | |
| #{4 threads/task} | |
| ibrun ./hybrid.exe |
The job output and error are sent to out.o
Step 2: Batch query
After job submission, users can monitor the status of their jobs with the qstat command. Table 2 lists the qstat options:
| Table 2. List of qstats Options | |
| Option | Result |
| -t | Show additional information about subtasks |
| -r | Show resource requirements of jobs |
| -ext | Displays extended information about jobs |
| -j |
Displays information for specified job |
| -qs {a|c|d|o|s|u|A|C|D|E|S} | Show jobs in the specified state(s) |
| -f | Shows "full" list of queue/job details |
The qstat command output includes a listing of jobs and the following fields for each job:
| Table 3. Some of the fields in qstats command output | |
| Field | Description |
| JOBID | job id assigned to the job |
| USER | user who owns the job |
| STATE | current job status, includes (but not limited to) |
| w | waiting |
| s | suspended |
| r | running |
| j | on hold |
| E | errored |
| d | deleted |
For convenience, TACC has created an additional job monitoring utility which summarizes all jobs in the batch system in a manner similar to the "showq" utility from PBS. Execute
showqto summarize all running, idle, and pending jobs, along with any advanced reservations scheduled within the next week. Note that showq -u will show jobs associated with your userid only (issue showq --help to obtain more information on available options). An example output from showq is shown below:
ACTIVE JOBS-------------------- JOBID JOBNAME USERNAME STATE PROC REMAINING STARTTIME 14694 equillda user1 Running 16 18:54:07 Tue Feb 3 17:32:41 14701 V user2 Running 16 7:02:41 Tue Feb 3 17:41:15 14707 V user3 Running 16 19:11:02 Tue Feb 3 17:49:36 14708 jet08 user4 Running 32 0:38:36 Tue Feb 3 18:17:10 14713 rti user5 Running 64 3:58:25 Tue Feb 3 20:36:59 14714 cyl user6 Running 128 23:16:36 Tue Feb 3 21:55:10 6 Active jobs 272 of 556 Processors Active (48.92%) IDLE JOBS---------------------- JOBID JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 14716 bigjob user7 Idle 512 0:15:00 Tue Feb 3 22:18:57 14719 smalljob user7 Idle 256 0:15:00 Tue Feb 3 22:35:31 2 Idle jobs BLOCKED JOBS------------------- JOBID JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 14717 hello user7 Held 16 0:15:00 Tue Feb 3 22:19:07 14718 hello user7 Held 32 0:15:00 Tue Feb 3 22:19:15 4 Blocked jobs Total Jobs: 12 Active Jobs: 6 Idle Jobs: 2 Blocked Jobs: 4 ADVANCED RESERVATIONS---------- RESV ID PROC RESERVATION WINDOW karl#79 556 Tue Mar 23 09:00:00 2004 - Tue Mar 23 17:30:00 2004
The latest queue information can be determined from the following commands (a single command to extract queue structure information will be available later):
qconf sql {Lists the available queues.}
qconf sq
cat /share/sge/default/tacc/sge_esub_control {first value after max_cores_per_job_
Step 3: Job control
Control of job behavior takes many forms:
- Job modification while in the pending/run state
Users can reset the qsub options of a pending jo



