 |
| General Sys. Info |
|
|
 |
|
|
|
|
 |
| Environment |
|
|
 |
|
|
|
|
 |
| Development |
|
|
 |
|
|
|
|
|
|
|
|
 |
| Optimization |
|
|
 |
|
|
|
|
 |
| References |
|
|
 |
|
|
|
|
|
TACC's enhanced IBM Power5 System, Champion, now has a homogeneous selection of
resources for both MPI and OMP parallel computing paradigms. It also
provides almost 50% improvement for peak performance on a per processor basis,
and improved sustained performance compared to the previous heterogeneous
configuration of Longhorn.
Champion consists of 12 nodes each with 8
processors, aggregating to a total of 96 processors. Ten of the twelve
nodes are used for dedicated jobs, or where the node usage is not shared.
In the remaining two nodes, including the login or frontend node, node usage
is shared between multiple concurrent jobs.
One of the 8 processors in the frontend node will be used for
the General Parallel File System (GPFS). Two frontend
proccessors will be used for compiling. A visual of the single p5-575 frame
with configuration details is illustrated below in Figure 1.
 | | Figure 1. Power5 System Configuration |
All the nodes employ the same 1.9GHz IBM Power5 microarchitecture, run
the same AIX 5.3 operating system, and are connected through an
IBM High Performance "Federation" Switch (HPS). These features provide
a common and integrated parallel programming environment throughout
the system. The homogeneous node-architectures provide different
computational spaces for parallel algorithms and paradigms.
The architectural features of the Power5 chip are designed with a high
memory bandwidth to accommodate the superscalar operations of a 1.9
Gigahertz processor. Features such as speculative branching,
out-of-order execution, predication, 8-stream prefetching, and a
3-tier cache hierarchy provide a continuous and high data throughput
for high peak performance.
 | | Figure 2. Power5 Chip |
Some important design changes were made to the Power5
chip that improved the overall system performance. One such change is
moving the L3 cache from the memory side of the ASIC fabric to the
processor side of the fabric, but off-chip. This removed the L3
cache from the path between the chip and the memory controller, as was
the case in Power4. This allowed for the memory controller to be moved
from the fabric to on-chip. This can be seen in Figure 2.
These two changes also had significant
additional benefits: reduced latency to the L3 and to memory,
increased bandwidth to the L3 cache and the memory, and improvement in
intrinsic reliability resulting from the reduction in the number of
chips necessary to build a system. The p5-575 system thus
comprises of two chips: a POWER5 chip and a L3
cache chip. Finally both the General-Purpose Registers and the
Floating-Point registers have been increased to 120, from 80 and 72
respectively, in the Power4 chip. The additional FP rename capability
had the benefit of increasing Power5's single thread performance on
some HPC workloads. This gain stemmed from an enhanced ability to
execute critical code sections out of order.
The cache and memory
characteristics for Power5 are as follows:
- L1 cache consists of 32 KB data with write-through, 4-way set associativity with a cache line size of 128 bytes. The transfer rate from L1 data cache to registers is about 2 words/clock period or about 30 GB/s with a latency of 2 clock periods. The L1 instruction is 64 KB and is direct-mapped.
- L2 cache is 1.9 MB with write-in, 10-way set associativity with a cache line size of 128 bytes. The transfer rate from L2 to L1 data cache is 4 words/clock period or
about 60 GB/s with a latency of 10 clock periods.
- L3 cache is 36 MB in size with 12-way set associativity with a cache line size
of 256 bytes. The transfer rate from L3 to L2 cache is .87 words/clock period or about
9 GB/s with a latency of about 90 clock periods.
- The memory size consists of 2 GB per DCM (see below) or 16 GB per node aggregating to a
total of 192 GB for the whole system. The memory bandwidth is about 16 GB/s/chip with a latency of about 200 clock periods. This is about four times the bandwidth and a third less latency
when compared to the Power4 Longhorn system. From performance benchmarks the Power5 bandwidth
is about twice that observed from similar benchmarks in the similar Power4 systems.
Each core processor has 2 floating point units that can each perform a
"fused" multiply-add operation per clock cycle. At a clock speed of
1.9GHz, four floating point operations per cycle can deliver 7.6 GFLOPS
from each processor. In all, the 96 processors of the TACC Power5
Champion system can deliver a peak performance of about 730 GFLOPS.
Beyond the die level, the next architectural level for the Power5 system
is the Dual-Chip Module (DCM). This is the basic building block of
IBM's p5-575 HPC system node which has 8 such DCMs.
Unlike the Power4 MCM which had 4 chip cores, there is only one processor chip
in a Power5 DCM and a L3 cache chip. Each processor chip is
dual-core, with only one core or CPU active on TACC's Champion
system (see Figure 2). This allows for greater bandwidth for the single active
microprocessor across all levels of the memory hierarchy, and thus is
extremely beneficial for most HPC applications.
Note that a considerable amount of the
silicon real estate for the Power5 chip is now devoted to the on-chip memory controller as
well as the directory for the L3 cache, and is consequently sometimes referred
to as a "system on a chip". These design changes and the
modest increases in L2 and L3 cache sizes as well as increased Simultaneous
Multi-Threading (SMT) functionality offered in the new Power5 chip,
have increased the Power5 die-size to 389 mm**2, from the previous 247 mm**2 size
for the Power4 chip.
 | | Figure 3. Champion Interconnect Configuration |
It is important to remember that there are two ports to each and every
node. However only one of the ports is linked to the High Performance
Switch as only one switch is currently needed for a system of this
size. This can be seen in Figure 3. The sustained switch performance should reflect this
configuration aspect. The peak observed MPI bandwidth
from the HPS switch is 1.88 GB/s (peak is 2 GB/s) while the reported latency is 4
us MPI task latency.
TACC's Power5 system has a variety of disk solutions for different
storage needs. A General Parallel File System, GPFS, is available from
any node and provides parallel access to files through MPI-I/O or
native Unix calls with a distributed locking protocol for coherent
access from any node. This file system has 7.2 TB of disk and one GPFS
processor server, with large stripe groups across multiple disks. File
access is served over the High Performance "Federation" Switch for
high throughput. There are local scratch disks on each node, except
the login node, for tasks that do independent, local I/O. Home
directories are mounted on all computational nodes. Also, for
large-file, long term storage, the TACC archival file system is accessible
from the login node only. The Archive file system is not accessible
from any other Champion nodes outside of the frontend.
The connectivity and size of these file systems
are shown in Figure 7.
Last modified: March 03 2008 11:01:59.
SSH
To ensure a secure login session, users must connect to machines using the
secure shell, ssh program. Telnet is no longer allowed
because of the security vulnerabilities associated with it.
The "r" commands rlogin, rsh, and rcp, as well as ftp,
are also disabled on this machine for similar reasons. These commands are
replaced by the more secure alternatives included in SSH --- ssh,
scp, and sftp.
Before any login sessions can be initiated using ssh, a working
SSH client needs to be present in the local machine. Go to the TACC introduction to SSH for information on downloading
and installing SSH.
To initiate a ssh connection to a machine, type the following on the
local workstation
| ssh <login-name>
@
<machine-name>.tacc.utexas.edu
|
Note that the <login-name> is only needed if the user name on the machine being logged
onto differs from the user name on the workstation.
Last modified: October 04 2006 09:47:16.
Login Shell
The most important component of a user's environment is the login shell that interprets text on
each interactive command line and statements in shell scripts. Each login has a line entry in the
/etc/passwd file, and the last field contains the shell launched at login. To
determine your login shell, execute:
grep <my_login_name> /etc/passwd {to see your login shell}
You can use the chsh command to change your login shell; instructions are in
the man page. Available shells are listed in the /etc/shells file with their
full-path. To change your login shell, execute:
cat /etc/shells |
{select a <shell> from list} |
chsh -s <shell> <username> |
{use full path of the shell} |
User Environment
The next most important component of a user's environment is the set of environment variables.
Many of the Unix commands and tools, such as the compilers, debuggers, profilers, editors, and
just about all applications that have GUIs (Graphical User Interfaces), look in the environment
for variables that specify information they may need to access. To see the variables in your
environment execute the command:
env {to see environment variables}
The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated
below by the HOME and PATH variables.
HOME=/home/utexas/staff/milfeld |
PATH=/bin:/usr/bin:/usr/local/apps:/opt/intel/bin |
(PATH has a colon (:) separated list of paths for its value.) It is important to realize that
variables set in the environment (with setenv for C shells and
export for Bourne shells) are "carried" to the environment of shell scripts
and new shell invocations, while normal "shell" variables (created with the
set command) are useful only in the present shell. Only environment variables
are seen in the env (or printenv) command; execute set to
see the (normal) shell variables.
Startup Scripts
All Unix systems set up a default environment and provide administrators and users with the
ability to execute additional Unix commands to alter the environment. These commands are
"sourced"; that is, they are executed by your login shell, and the variables (both normal and
environmental) as well as aliases and functions are included in the present environment. We
recommend that you customize the login environment by inserting your "startup" commands in
.cshrc_user, .login_user, and
.profile_user files in your home directory.
Basic site environment variables and aliases are set in
/etc/csh.cshrc |
{C-shell, non-login specific} |
/etc/csh.login |
{C-shell, specific to login} |
/etc/profile |
{Bourne-type shells} |
For historical reasons, the C shells source two types of files. The .cshrc
type files are sourced first (/etc/csh.cshrc--> $HOME/.cshrc--> /usr/local/cshrc-->
$HOME/.cshrc_user). These files are used to set up environments that are to be executed
by all scripts and used for access to the machine without a login. For example, the following
commands only execute the .cshrc type files on the remote machine:
scp data lonestar.tacc.utexas.edu: |
{only .cshrc sourced on lonestar} |
ssh lonestar.tacc.utexas.edu date |
{only .cshrc sourced on lonestar} |
The .login type files are used to setup environment variables that you
commonly use in an interactive session. They are sourced after the .cshrc type
files (/etc/csh.login--> $HOME/.login--> /usr/local/login-->
$HOME/.login_user). Similarly, if your login shell is a Bourne shell (bash, sh, ksh,
...), the profile files are sourced (/etc/profile--> $HOME/.profile-->
/usr/local/profile--> $HOME/.profile_user).
The commands in the /etc files above are concerned with operating system
behavior and set the initial PATH, ulimit,
umask, and environment variables such as the HOSTNAME.
They also source command scripts in /etc/profile.d -- the
/etc/csh.cshrc sources files ending in .csh, and
/etc/profile sources files ending in .sh. Many site
administrators use these scripts to setup the environments for common user tools (vim, less, etc.)
and system utilities (ganglia, modules, Globus, LSF, etc.)
TACC has to coordinate the environments on platforms of several operating systems: AIX, Linux,
IRIX, Solaris, and Unicos. In order to efficiently maintain and create a common environment among
these systems, TACC uses its own startup files in /usr/local/etc. (A
corresponding file in this etc directory is sourced by the
.profile, , and .login files that
reside in your home directory. (Please do not remove these files and the sourcing commands in
them, even if you are a Unix guru.) Any commands that you put in your
.login_user, .cshrc_user, or
.profile_user file are sourced (if the file exists) at the end of the corresponding
/usr/local/etc command files. If you accidentally remove your
.login, .cshrc, and .login, you can
copy new ones from /usr/local/etc/start-up or execute
/usr/local/bin/install_ut_startups
to get a new copy (your old files are renamed with a date suffix).
Modules
TACC is constantly including updates and installing revisions for application packages, compilers,
communications libraries, and tools and math libraries. To facilitate the task of updating and to
provide a uniform mechanism for accessing different revisions of software, TACC uses the modules
utility.
At login, a basic environment for the default compilers, tools, and libraries is set by several
modules commands. Your PATH, MANPATH,
LIBPATH, directory locations (WORK,
ARCHIVE, HOME, ...), alias (cdw,
cda, ...) and license paths, are just a few of the environment variables and
aliases created for you. This frees you from having to initially set them and update them whenever
modifications and updates are made in system and application software.
Users who need 3rd party applications, special libraries, and tools for their development can
quickly tailor their environment with only the applications and tools they need. (Building your
own specific application environment through modules allows you to keep your environment free from
the clutter of all the other application environments you don't need.)
Each of the major TACC applications has a modulefile that sets, unsets,
appends to, or prepends to environment variables such as $PATH,
$LD_LIBRARY_PATH, $INCLUDE_PATH,
$MANPATH for the specific application. Each modulefile
also sets functions or aliases for use with the application. A user need only invoke a single
command,
module load <application>
at each login to configure an application/programming environment properly. If you often need an
application environment, place the modules command in your .login_user and/or
.profile_user shell startup file.
Most of the package directories are in /usr/local/apps
($APPS) and are named after the package name
(<app>). In each package directory there are subdirectories that contain
the specific version of the package. The APPS directory structure is shown in
the diagram below:
The directory structure for the fftw package is shown below. The directory fftw in
/usr/local/apps contains 3 different version directories for the package:
2.1.3, 2.1.5 and version 3.0. Since fftw-2.1.5 is the present default version, a fftw link is
created to the default, the fftw-2.1.5 subdirectory.
The directory paths for the different fftw package versions, can be
constructed easily with the help of the $APPS variable:
$APPS/<app>/<app.version> |
{path to specific package version} |
$APPS/<app>/<app> |
{link to default version} |
$APPS/fftw/fftw |
{example, default version directory for fftw} |
$APPS/fftw/fftw-2.1.3 |
{example, directory for earlier version of fftw} |
The fftw package requires several environment variables that point to its home, libraries, include
files, and documentation. These can be set in your environment by loading the fftw module:
module load fftw
The details of the environmental changes are in the modulefile,
/usr/local/opt/modules/modulefiles/fftw. To see a list of available modules and a
synopsis of a modulefile's operations, execute:
module available |
{lists modules} |
module help <app> |
{lists environment changes performed for <app>} |
During upgrades, new modulefiles are created to reflect the changes made to
the environment variables. TACC will always announce upgrades and module changes in advance.
Another feature of modules is the ease in changing the environment for experimenting with new
updates or backing down to older application versions. TACC will often make a link from
<app>.new to the updated package modulefile
(<app>.<new-version>) that has not become the default version
yet. Also, the retired default modulefile is often linked to
<app>.old. This makes it easier for users to change to new or old
environments with the commands:
module swap <app> <app>.old |
module swap <app> <app>.new |
(If the app module has not been loaded, then it is only necessary to load the new or old version;
e.g. module load <app>.old.)
For more information on modules and a description of how to build
modulefiles, check out the man pages and the following
link.
For information on customizing your login, check out this
link.
Last modified: October 04 2006 09:47:16.
The TACC HPC platforms have several different file systems with distinct storage characteristics.
There are predefined, user-owned directories in these file systems for users to store their data.
Of course, these file systems are shared with other users, so they are managed by either a quota
limit, a purge policy (time-residency) limit, or a migration policy.
To determine the size of a file system, cd to the directory of interest and execute
the "df" command with the syntax:
or simply execute it without the "dot" to see all file systems. In the example
below the file system name appears on the left, and the used and available space
(-k, in units of 1KBytes) appear in the middle columns followed by the percent
used:
% df -k . |
|
|
|
|
|
File System |
1k-blocks |
Used |
Available |
Use% |
Mounted on |
/dev/vg/home |
8256952 |
6675732 |
1161792 |
86% |
/home |
To determine the amount of space occupied in a user-owned directory, cd to the
directory and execute the du command with the -sb option
(s=summary, b=units in bytes):
To determine quota limits and usage on $HOME, execute the quota command
without any options (from any directory):
The four major file systems and directories available on TACC HPC machines are:
- home directory
- The system automatically changes to a user's home directory at login
and this is the recommended location to store your source codes and build your executables.
Lonestar quota limit is 200MB.
Use $HOME to reference your home directory in scripts.
Use cd to change to $HOME.
- work directory
- Store large files and perform most of your job runs from this file system.
This file system is accessible from all the nodes.
Lonestar quota limit is 500GB.
Use $WORK to reference your work directory in scripts.
Use cdw to change to $WORK.
Files are purged if they have not been accessed within 10 days.
- scratch or temporary directory
- This is the directory on each node where you can store files and perform local I/O for the
duration of a batch job. Often, in batch jobs it is more efficient to use and store files
directly on
$WORK (to avoid moving files from scratch
at the end of a job). The scratch directory is only available for the duration of a job.
Lonestar file system size and limit is 25GB on compute nodes.
Use $SCRATCH to reference the scratch directory in scripts.
- san directory
- The SAN directory is available on login nodes (front-ends) of Lonestar, Wrangler, and Champion.
Space on the SAN is an allocatable resource; that is, space is not automatically available
to a project, the Principal Investigator must request space on this file system.
Lonestar project limit is 250GB.
The top level directory of the san is /san/hpc.
- archive directory
- Store permanent files here. This file system has "archive" characteristics (see below). The
access speed is low relative to the work directory.
Lonestar archive has no space limit..
Use $ARCHIVE to reference your archive directory in scripts.
Use cda to change to $ARCHIVE.
$HOME Directories
A user's home directory is the place to store files that are routinely used in development and
day-to-day work. If the output files from production runs are small, then it is reasonable to
store them in $HOME. Home directories are backed up daily; so, if you
accidentally remove a critical file, submit a request using the
Portal to recover the last saved version
(include the full path name of the file(s) or directory, as well as the machine name).
Since the home file system is of limited size, a 200 megabyte quota limit is imposed on every user
(the quota limit is machine specific).
$WORK Directories
The work file system is configured with fast disks on TACC machines and should, therefore,
be used when I/O performance significantly affects program performance. Work can also be used to
store large files temporarily. The work file system may be as simple as SCSI disks arranged in a
RAID-3/4/5 configuration and exported through NFS to compute nodes. On Champion
parallel and global access to work is made through GPFS, a General
Parallel File System. On Lonestar work is an
(LUSTRE) File System; it can be used for parallel I/O;
and it is accessible from the development/login nodes and all compute nodes.
The files in work are NOT backed up and are temporary. Files that are corrupted or
accidentally removed are not recoverable. Files that are not accessed within 10 days are
removed. (Each night a "scrubber" program evaluates access times for every file and removes
outdated files. The scrubber does not remove any files if the user has a running batch job.)
Reminder: for permanent storage use the TACC data archive. To see a daily log of the files that
have been removed from your directory, view the file:
/work/reaver/$USER_YYYYMMDD
where $USER is evaluated as your login name, and YYYY, MM, and DD are the year(4 digits), month(2 digits), and day(2 digits) of the log date.
PLEASE NOTE: TACC staff may delete files from work if the work file system becomes full and directories consume an inordinately large amount of disk space, even if files are less than 10 days old. A full work file system inhibits use of the file system for ALL users. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.
$SCRATCH Directories
$SCRATCH is a "scratch" directory on each compute node, where a batch job can
store files or perform local I/O. The $SCRATCH directory on each node is scrubbed
after the job completes, so a job script's final commands should copy any valuable data to
home or work. The scratch directory is the /tmp file system on the Linux clusters.
On the Lonestar front-end, $SCRATCH is /tmp.
If you use $SCRATCH on the front end, please create your own subdirectory, e.g. $SCRATCH/$USER,
to isolate your files within a single directory.
$SAN Directory
The TACC SAN is a Storage Area Network that is accessible from the front-ends of Lonestar, Wrangler, and Champion.
The /san/hpc/<project_name> directories are for
projects that have been awarded (allocated) long-term space. The present configuration
has ~5TB space for persistent, project-oriented storage. For more details read:
More on SAN.
$ARCHIVE Directories
For long term file storage, use the archive file system ($ARCHIVE). This file
system physically resides on an SGI Origin 2000 (archive.tacc.utexas.edu), a machine dedicated to
supporting the archive file system.
A user's archive directory is available on all TACC HPC computers because the archive file system is mounted as
/archive on the front-end of each system.
It appears as a normal UNIX file system but is managed by DMF, SGI's
Data Migration Facility. Files that have not been accessed in 10 days are moved offline
(migrated) to tape via two StorageTek 9310 robots. DMF automatically and transparently performs
the archival and retrieval of files from the tape robot system. When an off-line file is accessed,
DMF automatically retrieves the file while the process that is accessing the files waits. Under
normal circumstances it takes less than a minute for the robots to start streaming a file's data
back to the disk and for the user's process to continue.
Note: On all TACC production machines (Champion,Lonestar)
$ARCHIVE is only mounted on (available from) the front-end and NOT from any of the
compute nodes. As a consequence, users must be careful to migrate files to and from
$ARCHIVE before jobs are submitted or after completion of their run. Access of any
files located in $ARCHIVE through Loadleveler/LSF script options will result in the job
hanging in an idle state.
When moving files from archive to a local file system (home or work) use ftp
or rcp if the files are large (>100MB). The cp command
transfer rates from an NSF mounted file system are about 5 times slower than a transfer with
ftp or rcp.
rcp ${ARCHIVER}:$ARCHIVE/myfile $WORK |
For the fastest large-file transfer to a local work directory from archive, use the command above.
Don't forget to include the archive machine name ($ARCHIVER is defined for
you), else the rcp utility will simply use "cp" to do the
transfer. In some shells, curly braces may be required around the environment variables, e.g.
${ARCHIVER} when followed by a colon (:).
Last modified: July 19 2007 16:02:27.
For both the Intel clusters and the IBM Power5 SMP clusters, there are
a couple of programming models for consideration.
First, in the distributed memory, message-passing model one may use either
the SPMD or the MPMD programming paradigms. With the former, each processor
loads the same program image and executes but on different data sets. This is
dependent on the local portion of distributed arrays according to their distribution,
array sizes, and number of processors determined at runtime. With the MPMD
paradigm, each processor loads up and executes a different program image and on
different data sets, with a similar set of parameters as the SPMD paradigm.
This distributed memory message passing programming model strictly uses MPI for
its communication. This is advocated as the best model for optimal use
of the clusters and is also suitable for the Power5 nodes.
The other programming model that is always recommended on the Symmetric Multi-
Processor based nodes like the IBM Power5 nodes or Parallel Vector Processing (PVP) systems,
is the SMP (Shared Memory Programming)
model using either proprietary directives (or pragmas), vendor implementations of
standards like OpenMP, or explicit threaded implementations such as Pthreads.
Here, the major point of distinction is that all of the processors in the
same node share data, hence strict processor to processor message-passing
is unnecessary. Further, additional parallelism can be extracted by use
of threads or multi-threaded implementations of the compilers, libraries,
or tools. With the present state-of-the-art operating systems, compilers,
and tools it is
highly recommended to use OpenMP on SMP based nodes with MPI across
nodes in a hybrid programming environment. There isn't sufficient evidence
at this time to suggest that using OpenMP with MPI in a hybrid
environment on the clusters would be of great benefit from the performance
point of view. However, we will inform users if this changes in the future.
For further information on OpenMP, MPI and on programming models/paradigms,
please see the manuals and
packages sections for documentation.
Last modified: October 04 2006 09:47:16.
Compiling and Running Serial Programs
The compiler commands can be used for both compiling and loading (making an
executable from a ".o" object file). The table(s) below lists the syntax for
serial and parallel program compilation.
Compiling Serial Programs
| Compiler |
Program |
TypeSuffix |
Example |
| xlc_r |
c |
.c |
xlc -c [options] ibmc.c |
| xlC_r |
C++ |
.cpp, .C |
xlC -c [options] ibmcpp.C |
| xlf_r |
F77 |
.f |
xlf -c [options] ibmfor.f |
| xlf90_r |
F90 |
.f |
xlf90 -c [options] ibmf90.f |
Appropriate program-name suffixes are required for each compiler. By
default an a.out executable is created by the compiler
invocation. The executable may be renamed with the -o option,
and execution is performed by specifying the executable on the command
line. Also "options" denote additional compiler non-default
options and may be added during the compilation.
Note: For those users migrating from Longhorn Power4
system, for the sake of uniformity, we are now recommending two major
changes. First, use of the re-entrant version of the compiler (or use
of compilers with trailing"_r" symantic notation) for all types of codes:
serial, threaded or parallel. The second change is for building 64-bit
codes for all application. All the threaded and parallel codes see
a clear performance benefit from these changes. Additional details on this
subject is provided below.
| Compile/Link code |
prog.cpp or prog.f, naming the executable prog |
| C |
xlc_r -o prog -q64 [other options] prog.c [linker options] |
| Fortran |
xlf90_r -o prog -q64 [other options] prog.f [linker options] |
Here "linker options" denote zero or more linker loader options such
as library directories and their names. Also the "option" compiler
options take precedence over the "link loader options."
To run the above interactive program, execute:
The relative path expression "./" tells the shell to look in the present
working directory for the executable. It is often used to make sure that an
executable of the same name in another directory (as determined by the PATH
environment variable) is not executed. Also, if the "." is not in the PATH
variable it is necessary to use "./" for the shell to find the executable. If
it is in the PATH variable however, just typing the executable name is
suffient to run your program.
A brief discussion of compiler options, their syntax, and a terse explanation
is given below in the optimization section. In addition, typing in the compiler names
xlc or xlf
etc., will give additional information on all of the compiler options.
Online documentation and user guides for compilers are also available.
Compiling OpenMP Programs
All the nodes are eight-way symmetric multiprocessors using shared memory, and as
such are a perfect model for using Shared Memory Programming (SMP) paradigm. All the nodes
have an equal amount of shared memory, roughly 14 GBytes, available for application use.
Using any of the nodes as an SMP, they may be effectively used in
combination with thread-safe compilers with appropriate compiler options and
setting run-time variables, and libraries optimal for SMP environment.
The compilers are used similar to serial compilation, as given below, except in the slight change
in the naming syntax. As an example, a program compiled using the SMP environment by the
xlf90_r compiler is:
| xlf90_r -qsmp=noauto:omp -qnosave -o prog [options] progOMP.f [linker options] |
Use "-qsmp=noauto:omp" when compiling programs with OpenMP directives
or pragmas, to disable automatic parallelization by the compiler. The
"-qnosave" option sets all local variables as automatic, and is
required to ensure correct behavior of Fortran programs that
call routines within a parallel region. Also note that -qhot is
turned on, by "-qsmp=auto" but not by "-qsmp=omp, so make
sure to include the correct SMP flag for high order transformation.
The option -qnohot should be used for consistency with
directive/pragma based SMP runs.
Preferably any optimizations including -qhot should be tested
in a single threaded manner
before using -qsmp, whenever possible.
Note that unlike the serial programs,
any programs using threads, or parallel MPI calls, must use the reentrant
version of the compiler or the compiler using the _r symantic notation .
Here the xlf90_r is the xlf90 compiler which links to the
thread-safe components, as well as other threaded version of the libraries.
Similarly, any compiler can be used in its place with the same syntax change. The
compiler options are the same as the serial ones. The compiler options specifically
using the SMP environment either automatically and/or with/without directives
must be included as part of the linker options . A list of such
options with description are given below
Compiler Options. The use of SMP optimized libraries are given below
Performance Libraries .
Compiling Parallel Programs with MPI
The compiler commands can be used for both compiling and loading (making an
executable from a ".o" object file). IBM MPI shell scripts for Power5
servers are prefixed with mp. The table(s) below lists the syntax for
serial and parallel program compilation.
Compiling Parallel Programs
| Compiler |
Program |
TypeSuffix |
Example |
| mpcc |
c |
.c |
mpcc_r [options] ibmc.c |
| mpCC_r |
C++ |
.cpp, .C |
mpCC_r -cpp -q64 [other options] ibmcpp.C |
| mpxlf_r |
F77 |
.f |
mpxlf_r -q64 [other options] ibmfor.f |
| mpxlf90_r |
F90 |
.f |
mpxlf90_r -q64 [other options] ibmf90.f |
Appropriate program-name suffixes are required for each compiler. By
default an a.out executable is created by the compiler invocation. The
executable may be renamed with the -o option, and execution is performed by
specifying the executable on the command line. Also, "options" denote additional
compiler non-default options and may be added during the compilation. For C++
programs in MPI, the mpCC_r needs
to be used with -cpp option, along with any other options of the
user's choice.
| Compile/Link code |
prog.c or prog.f, naming the executable prog |
| C |
mpcc_r -o prog -q64 [other options] prog.c [linker options] |
| Fortran |
mpxlf90_r -o prog -q64 [other options] prog.f [linker options] |
Here "linker options" represents zero or more linker loader options.
Also the "option" compiler options take precedence over the "link loader
options."
As we briefly stated above,
from performance standpoint it is recommended that all MPI programs
use the 64-bit version of the MPI libraries. These 64-bit MPI libraries are
thread-safe, hence the re-entrant version of the compilers should be used in
conjunction. For those needing to use the 32-bit MPI libraries do not need to
use the _r semantic in the compiler invocation but will also see
poorer performance from the 32-bit versions of these libraries. There are
additional details on this issue later in the section for
MPI Parallel Programming Environment .
To run the above compiled program interactively, execute:
A brief list of compiler options, their syntax, and a terse
explanation is given below in the optimization section. In addition,
man mpcc or man mpxlf etc. provide additional details on all of the
compiler options. User guides and online documentation provide
additional resources.
Program Optimization for Serial and Parallel
Programming using OpenMP and MPI
The MPI shell scripts can compile programs using the normal serial compiler options. Thus,
as an example, any of the compiler flags normally accepted by the xlc command can
also be used on mpcc. Given below are some of the common serial compiler options
with descriptions.
| Compiler Options |
Description |
| -O3 |
performs some compile time and memory intensive optimizations in
addition to those executed with -O2. They also have the potential to alter semantics
of a user's program.
|
| -qarch=pwr5 (or auto) |
specifies instruction-set architecture for Power5 hardware |
| -qtune=pwr5 (or auto) |
produces object files optimized for Power5 architecture, including
instruction scheduling and memory hierarchy. |
| -qhot |
high order transformations to maximize efficiency of loops and array
language. |
| -qcache=auto |
specifies cache configuration for compiling machine or relevant
executing environment. |
| -qipa |
optimizes by performing analysis across procedures. |
| -bmaxdata: |
Specifies the maximum amount of space to reserve for the program data
segment for programs where the size of these regions is a constraint (units=bytes). |
Warning: It has come to our attention that certain codes are
giving erroneous results on use of the -qhot (high-order
transformation) compiler option. Note that, for optimization levels
above -O3, -qhot is used as default. Please verify your results for
correctness when compiling at that level or when using -qhot
directly. Using the option -qnohot avoids using any high-order
transformation even at -O4 or -O5 compiler option level. This bug has been
brought to the IBM compiler groups attention. We will apply any future patches
for this issue.
In addition, in a Shared Memory Programming (SMP) environment, some of the additional
compiler options, which must be used along with runtime environment variables and
SMP optimized libraries, are listed below.
| SMP Compiler Options |
Description |
| -qsmp |
Enables use of shared memory parallelization by user
directives, OpenMP or explicitly or by MPI calls. |
| -qsmp=omp |
Explicit or User directed OpenMP environment or IBM
proprietory directives (or pragmas for C/C++). |
| -qsmp=auto |
Automatic parallelization enabled. |
| -qsmp=explicit |
Explicit parallelization by use of Pthreads |
For further details on additional compiler options on both IBM serial compilers and their
respective MPI scripts, please see the
online documentation and man pages.
Limiting Memory Usage
On the Champion Power5 system, the default object mode is 64-bit, which means
that in the default environment you will build executables that run in 64-bit address mode.
Large-memory applications use the -q64 compiler/loader option to enable access to
essentially all of the available memory of any of the TACC Power5 nodes. All of the
third-party applications will be built 64-bit, by default. The 32-bit versions of these
applications will only be built for users who need them for dependencies in there legacy code.
The job memory-per-task limit specified in the ConsumableMemory
resource statement is used by LoadLeveler to assign an appropriate
queue to a job proportional to the number of parallel tasks and global
memory required by the job. It does not actually limit the memory used
by the executable on the Champion system.
Hence, large-memory applications (that use the -q64 option), in
particular those that use dynamic memory, should be compiled not to
exceed their "fair share" of physical memory per task (processor).
Compile with -bmaxdata=1887000000 option
Exceeding this limit will cause the executable to swap, and in some
cases the node will crash when the total virtual memory requested
exceeds the physical memory plus the available swap space (the swap
disk space has been configured to equal the physical memory size). To
avoid this problem, we ask developers to use the maxdata loader option
to limit the memory used by executables. Note that for the new Power5
Champion system this limit will be enforced where the node usage is
being shared. For users in dedicated mode, this memory limit will not
be strictly enforced and monitored, but users should be aware of the performance impact.
Loading Libraries
Some of the more useful load flags/options are listed below. For a more comprehensive list, check
out the ld manpage.
- Use the
-l loader option to link in a library at load time: e.g.
xlf90 prog.f -lname
This links in the library libname.a provided it is found in ld's library search
path.
- To add a directory to the library search path, use the
-L option, e.g.
xlf90 prog.f -L/usr/local/mass -lmass
In the above example, the MASS library linked in by the user is not in the default search
path, so the "-L" option must be specified to point to the libmass.a
directory.
- To change the name of an external symbol, use the
-brename flag at
loadtime. For example, the command:
xlc prog.c -brename:.fortran_func,.FORTRAN_FUNC
will rename the symbol fortran_func to its upper-case equivalent.
By default, all Fortran symbol names are converted to lowercase. In contrast, C and C++ are
case case-sensitive so upper-case function names remain upper-case. The above loader directive
is useful in insuring compatibility between Fortran and C. In the previous example, references
to FORTRAN_FUNC in a C code which calls a Fortran function will be renamed to
fortran_func.
- To produce a loadmap, use
-bsxref:filename as shown in the example below:
xlf90 prog.f -bsxref:map -lsomelib
This command generates address maps for the object file prog.o and places the output in
alphabetical order to the file "map" in the current working directory.
Last modified: October 04 2006 09:47:16.
Runtime Environment
Parallel Environment
MPI
OpenMP
Parallel Environment
The Parallel Environment (PE) from IBM is designed for the development and
execution of parallel C, C++, and Fortran programs and has the following:
- Parallel Operating Environment for submitting and managing jobs;
- Message passing libraries (MPI and MPL);
- Parallel debugger pdbx; and
- Xprofiler for analyzing a parallel application's performance.
Parallel Operating Environment
The Parallel Operating Environment (POE) is one part of the Parallel
Environment. It can be looked upon as a "user interface" to PE and has an
interactive parallel shell with a syntax similar to ksh.
Setting the Environment Variables
There are two ways to configure the way a parallel program is executed --
with environment variables or command-line arguments. The POE environment
variables are set using either setenv (csh/tcsh) or
export (ksh, bsh) commands and can be set:
- at the command prompt;
- in the "dot" files (.cshrc_user, .profile_user); or
- in the job script file.
The environment variables can be over-ridden with POE command line flags. The
environment variable names are the same as the command line option names, but
they start with "MP_" and are all in upper case. For example, the
command line flag -procs has the corresponding environment variable
"MP_PROCS."
The following are a few, typical POE command line flags:
- Set number of nodes or tasks
- procs - number of task processes
- nodes - number of physical nodes on which to run parallel tasks
- tasks_per_node - number of tasks to be run on each of the
physical nodes
Those interested in using POE command line flags exclusively,
instead of using them through job scripts, should learn about the remaining flags
viewed here.
Those interested in using the environment variables through the use of Loadleveler
job scripts, should go here.
MPI Parallel Programming Environment
The IBM Power5 nodes are connected to each other via the high
performance singly-linked Federation or High Performance Switch (HPS) interconnect. All
off node communication is available for use through the IP protocol
as well as the recommended User Space (US) protocol for better performance
over the HPS switch using both switch interfaces or adaptors on each
node (device is "sn_all"). Users can thus run their MPI or OpenMP
applications with the HPS switch interconnect with either protocol
submitted through their batch scripts. In addition, the MPI implementation
now uses threads instead of signals for asynchronization activities.
MPI can use several protocols for communication between tasks. IP
(internet protocol) can be used for communication between tasks on a
node or on different nodes. Aside from relatively high latencies, this
also incurs high overhead from use of the slow IP protocol. This
availability is a useful alternative on such systems which do not
have the SP interconnect switch, since it is the only mode of
communication protocol for ethernet, GigE networks etc., for MPI
tasks. Off node communication between tasks is fastest when the US
(user space) protocol is used over the HPS. Although use of this
protocol significantly reduces the overhead and latency when compared
to IP, on node communication may be more efficiently handled using
the shared memory protocol.
In addition, the MPI implementation is now layered over the
Low-level Application Programming Interface (LAPI) transport layer
, now available in the current version of the PE. In
contrast to the "reliable byte stream" approach of the previous PE,
LAPI provides a "reliable message" protocol, which uses much less
storage for jobs with a large number of tasks. It is based on an
"active message style" mechanism that provides a one-sided
communications model in which one process initiates an operation and
the completion of that operation does not require any other process to
take a complementary action. In addition, the 64-bit MPI library has
shown improved performance particularly with features such as RDMA,
so it is recommended for users to build there application in 64-bit, even
if they do not have any large memory requirements.
The Remote Direct Memory Access (RDMA) data transfer feature has also
been made available on the High Performance Switch interconnect. RDMA
allows the network adapter direct access to the user memory of the
application.This allows the network adapter to move data between
computational tasks with minimal CPU involvement, offering greater
potential for overlapping calculation and communication when using
non-blocking communication. RDMA enables large messages to
utilise both network adapters attached to each of the compute nodes
of the p5-575 system without utilising multiple CPUs. This promises a
substantially increased bandwidth for large messages, particularly
for the 64-bit applications. We show below
how RDMA use is enabled by environment variables.
The parallel operating environment or POE is responsible, among other
things, for managing all communication between the MPI processes whether
on-node or between nodes. The runtime behavior of MPI jobs can be modified
by configuring the many environment variables in POE.
OpenMP Parallel Programming Environment
Implementation of shared memory parallelization is done through the
creation of user threads, which are mapped to kernel threads by the
operating system.
OpenMP codes run initially with just one user thread, but a team of threads
is created when the first parallel region is encountered. Upon exiting the
parallel region, only one thread resumes execution of the serial portion,
while the rest consumes CPU time waiting for the next parallel section.
The maximum time a thread spends waiting is regulated by the environment
variable SPINLOOPTIME. After this maximum interval is exceeded,
a thread can either go to sleep or yield its place on the kernel to
another runnable thread, provided that a yield time has been specified as
part of the OpenMP environment. Reactivation of a sleeping thread is more
costly than one that is in a yielded state.
The behavior of threads can be modified by setting several POE environment
variables. Please check the reference for more details.
Running Interactive Programs
All parallel jobs run under the Parallel Operating Environment (POE). Use the
interactive session only to develop and debug your parallel programs. Use a
command similar to the one given below for interactive jobs.
| % poe your_job job_args -tasks_per_node n -nodes 1 -rmpool 1 |
n is the number of processors on which one wants to run. Interactive runs, which are
queued through the development class, are limited to up to 5 processors, for
4 hours and to upto 8 GBytes of total memory requirements.
Running Batch Code using LoadLeveler
What is LoadLeveler?
LoadLeveler Batch Resources
Building a Simple Job Command File
Building an Advanced Job Command File
LoadLeveler Keywords
Submitting a Job
Managing a Job
Summary of LoadLeveler Commands
What is the LoadLeveler?
LoadLeveler is a batch job scheduling software that provides the ability to
submit and manage jobs. It has three interfaces.
- Command line interface - provides access to basic job and
administrative functions.
- Graphical user interface (xloadl) - provides the same functionality as
the command line interface; however, the command line interface is
generally found to be more efficient.
- Application programming interface (api) - allows application programs
that are written by users and administrators to interact with the
LoadLeveler environment.
LoadLeveler Batch Resources on Champion
New Resource Limits
Following the recent upgrade with the IBM Power5 Node(s), the entire
system consists of the eleven eight-way IBM Power5 HPC nodes, and a
login node with 7 processors available for computation. The complete
system consists of 96 Power5 compute processors with an aggregate
memory of 192 GB, 7.2 TB of total disk and theoretical peak
performance of about 730 Gigaflops .
In order to optimize node usage or job throughput on the newly
configured Power5 system, a new resource policy
has been implemented. The
following table summarizes the new resource limits:
Table I: New resource limits on the TACC IBM Power5 system
| # cpus |
memory |
walltime |
class |
| 2-5 procs |
1.8 GB/procs |
4 hrs limit |
development |
| 1-2 procs |
1.8 GB/procs |
12 hrs limit |
serial |
| upto 32 procs |
1.8 GB/procs |
24 hrs. limit |
normal |
Users need to understand the logic employed by the filter which parses
the jobscript commands and schedules the job accordingly. Due to the limitation
of the total processor counts, and the homogeneous nature of the Champion system,
relative to the Longhorn system, the filter logic is simplified. If certain user options
are not allowed, then the filter provides the closest options
for the user to make the appropriate choices. The filter logic employs the
following general logic guidelines:
- User need to specify if they require node usage to be shared or not, unless
all eight processors of the node(s) are in use. Then the
node usage automatically becomes "not shared" or dedicated. The filter
will return the job to the user if this requirement is not fulfillled in
the script.
- All jobs requiring usage of a single node or greater, will require dedicated usage. In
this situation, users are requested to submit processor counts which are
multiples of eight, if feasible. If the user requests, for example 20 processors or 2.5 nodes, they
will be charged for using three nodes.
- In case of users requiring processor counts less than eight, their first option
is to use the development queue. In case of non-availability or if the necessary run time
or the memory requirements exceeds that offered in the development queue, then users
should require four processor count in normal queue for shared usage. Due to limited
resources and for fair useage, users cannot run on processor counts different from
four for sharing usage on a single node. They would need to resubmit the job with
the correct processor count of four.
- Application codes using threads for OpenMP code or a hybrid MPI OpenMPI code
would be allocated usage of a single processor for use of each thread. However, users
must make sure that the total number of processors and threads that are in use, does
not exceed the total number of "(processors) x (nodes)".
- Interactive jobs are currently queued through the development queue class only, and
not allowed to run through any other available queue class. Similarly, interactive serial
jobs are queued through the serial queue class. This policy will be enforced
for fairness to all users. Users who need special requests such as benchmarks, which fall
beyond the perview of the current class queues should make these requests well ahead of
their scheduled runs.
Building a Simple Job Command File
For all batch submitted jobs, a job script or job command file needs
to be created. This section is for the early or first time users
to the IBM system who need to start with a simple job script described
in this section. Users already experienced with Loadleveler job
scripts should go to the next section where the intricacies
for building more advanced and detailed job scripts are described.
Some of the general rules and information necessary for building a simple job
script are given below followed by examples of scripts for an MPI and OpenMP job.
- Keyword statements, which are case insensitive, begin with #@
followed by LoadLeveler keywords and
statement components; to comment out any statements begin with ##;
- Common functionality such as shell, initialdir, input, output, job
name etc., are common to all job scripts, and should be taken verbatim from the examples
below and modified according to specific user and job.
- Shell command statements can go anywhere except where LoadLeveler
keyword statements are being defined.
- A special keyword statement used in the simple job script is the total number of tasks for MPI jobs and number of threads for OpenMP jobs. Either of these options should be specified by the user in the job script as tasks or threads , depending on whether it is an MPI or an OpenMP job respectively.
- Users should also have an idea of the amount of memory the job consumes. If it
is not specified, the job filter assumes the maximum limit which may not be
sufficient for the job. The job will then be scheduled and may exit with
erroneous statements. The memory statement is another special keyword statement and can be specified as
This implies that the memory per task or thread is 2MB.The total memory is then computed by the job filter by multiplying it with the number of tasks or threads that are specified. Note that generally memory per task or thread should not exceed 2 GB or 2000 MB.
-
Finally, the user should have a rough idea of the runtime. This is for fairness of
how jobs are scheduled. Jobs which take a shorter runtime are scheduled ahead of those
that are longer, all other things being equal, so the user may be penalised in seeing
the expected turnaround time. The specification for the runtime, another special keyword statement, can be done as follows:
This specifies that the wall clock time is 10 minutes.
- Based on the user providing correct information on all of the above three special keyword statements, the job
filter will then infer the other keyword statements that are necessary for the job to be scheduled. In certain situations, these generated conditions may create contradictory options
for the filter, which is then returned to the user for making the appropriate choice.
- Another Loadleveler keyword statement
also has to be added in addition to the special keyword statements above for all
MPI and OpenMP jobs. Although
this keyword statement is not a special one, it is necessary for the job filter to
know the classification of the job.
- Users who belong to multiple projects should specify the project name to charge,
as shown below for the A-abc project:
Project names are listed in the accounting report at login.
- The environment variable OMP_NUM_THREADS has to be specified by the user
for OpenMP jobs. All other environment variables will be automatically generated
by the filter for both OpenMP and MPI jobs.
Note: Users should be aware that all of the special keyword statements can ONLY be used in the simple
job scripts. Any combination of the special keyword statements with some of the more
advanced keyword statements will result in incorrect filtering and will result in either the script being returned to the user or an erroneous
job script sent to the scheduler.
An example of a job command file for MPI job follows:
#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ notify_user = your_email
#@ job_type = parallel
#@ environment = COPY_ALL;
#@ walltime = 01:00:00
#@ tasks = 8
#@ memory = 100
#@ class = normal
#@ queue
poe ./a.out
|
An example of a job command file for OpenMP job follows:
#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ notify_user = your_email
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ job_type = parallel
#@ environment = COPY_ALL; LL_JOB=TRUE;
#@ walltime = 01:00:00
#@ threads = 8
#@ memory = 200
#@ class = normal
#@ queue
setenv OMP_NUM_THREADS 8
./a.out
|
Building an Advanced Job Command File
Every LoadLeveler job must be submitted through a job command file. The job
command file has LoadLeveler statement lines with LoadLeveler keywords that
describe the job. Some of the general rules/information are explained below,
in addition to explaining the more important and complicated LoadLeveler keyword
statements whose understanding is absolutely essential. Note that the special
keywords used for the simple jobscript CANNOT be used for the advanced version
of the job script described in this section.
Different MPI and OpenMP
jobs require separate job scripts and some examples of the different scenarios
are listed following the explanation of the general rules below.
- Keyword statements, which are case insensitive, begin with #@
followed by LoadLeveler keywords and
statement components; to comment out any statements begin with ##;
- Shell command statements can go anywhere except where LoadLeveler
keyword statements are being defined.
-
The fine grained resource management on node is done by the Work Load
Manager (WLM). The following resource statement
must be included in batch scripts.
| #@ resources = ConsumableCpus(CPUSpTask)
ConsumableMemory (MEMpTASK) |
where MEMpTASK = amount of MEMory required Per TASK,
and CPUSpTASK = number of CPUs required Per TASK.
For OpenMP jobs use the value of OMP_NUM_THREADS for CPUSpTASK.
For MPI jobs use the value of 1 for CPUSpTASK.
E.g. if ConsumableMemory(1024MB) is used with
tasks_per_node=4, then 4096 Megabytes of memory will
be reserved for the job. For OpenMP programs there is only 1 task,
so you must make MEMpTASK equal to the amount of memory you need
for ALL threads.
[A master thread and its spawned group of threads is a single
task and the ConsumableCPU's argument is the number of CPUs reserved
per (for the) task.]
- Parallel MPI jobs that run across multiple nodes have to specify
the use of the HPS Switch interconnect for off-node
communication. This involves setting MP_EUILIB to US and
MP_EUIDEVICE to sn_all.
A concise way of doing this is to set the #@ network.MPI_LAPI keyword to
the following
| #@ network.MPI_LAPI = <device>,<shared|not_shared>,<protocol>
|
| where <device> = |
sn_all |
uses dual-plane Switch adapters |
| |
en1 |
Gigabit ethernet adapter |
| where <protocol> = |
US |
to be used by the dual-plane Switch adaptors only |
| |
IP |
can be used by ALL of the interconnects |
- A number of environment variables , when used appropriately impact performance.
Some of the environment variables are set as default by the system. Other environment variables
need to be used under certain conditions. Here are a listing of some of them:
Default Settings
These settings are already set by default and the users need not declare them in the jobscript,
unless they need to be changed.
MP_SHARED_MEMORY=yes
MP_EUIDEVICE=sn_all
MEMORY_AFFINITY=MCM
Using RDMA
Users wanting to use the RDMA feature need to set the following environment in there job scripts, when
running there 64-bit application
MP_USE_BULK_XFER=yes
MP_BULK_MIN_MSG_SIZE=150K, (Note: this size can be increased upto 2MB.)
Special Environment Use
The following environments can be declared and used in most of the general conditions but not
under all circumstances. These exceptions are pointed out.
MP_SINGLE_THREAD=yes (Except for MPI-I/O or explict MPI threading routines)
MP_TASK_AFFINITY=MCM (Except OpenMP routines)
- All OpenMP jobs, whether "serial" or "parallel", and
hybrid MPI-OpenMP jobs which are
"parallel", are so termed simply from the point of the number of tasks
that are executed. But
each of these tasks spawns multiple threads per process or task. This is
done by the
|
#@ resources = ConsumableCpus(n)..
|
statement, where n
is an integer between 1-8 for p575 HPC nodes. But this spawning of threads is equivalent
to reserving n processes, not setting OpenMP threads. This is
done by the statement
- For all MPI, OpenMP or a hybrid combination jobs, the job filter
evaluates the number of processors to reserve per node by the following
formula
|
number of processors/node = tasks_per_node x ConsumableCpus(n)
|
For MPI jobs, n=1, while for OpenMP jobs, n=OMP_NUM_THREADS.
An example of a job command file for MPI jobs on one node of p575 HPC node, follows:
#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ job_type = parallel
#@ environment = COPY_ALL; MP_SINGLE_THREAD=yes;
#@ resources = ConsumableCpus(1) ConsumableMemory(100mb)
#@ network.MPI_LAPI = sn_all, not_shared, US
#@ wall_clock_limit = 01:00:00
#@ node = 1
#@ tasks_per_node = 8
#@ node_usage = not_shared
#@ notify_user = your_email
#@ notification = error
#@ class = normal
#@ queue
poe ./a.out
|
An example of a job command file for MPI jobs on multiple nodes on p575 nodes , follows:
#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ job_type = parallel
#@ environment = COPY_ALL; MP_SINGLE_THREAD=yes;
#@ node_usage = not_shared
#@ resources = ConsumableCpus(1) ConsumableMemory(100mb)
#@ wall_clock_limit = 24:00:00
#@ network.MPI_LAPI = sn_all, not_shared, US
#@ node = 4
#@ tasks_per_node = 8
#@ notification = never
#@ class = normal
#@ queue
poe ./a.out
|
Note: Comparing the two MPI job scripts for running application codes
on one node and for more than node, the main difference is specifying the appropriate
network/adaptor information. In the latter script, that is contained in the line
| #@ network.MPI_LAPI = sn_all, not_shared, US |
An example of a job command file for OpenMP job on a single p575 node, follows:
#@ shell = /bin/csh
#@ initialdir = /home/login_name/work_dir
#@ job_name = my_job
#@ input = /dev/null
#@ output = $(job_name).o$(jobid)
#@ error = $(job_name).o$(jobid)
#@ job_type = parallel
#@ environment = COPY_ALL; LL_JOB=TRUE;
#@ resources = ConsumableCpus(8) ConsumableMemory(1000mb)
#@ network.MPI_LAPI = sn_all, not_shared, US
#@ wall_clock_limit = 01:00:00
#@ node = 1
#@ tasks_per_node = 1
#@ node_usage = not_shared
#@ notify_user = your_email
#@ notification = error
#@ class = normal
#@ queue
setenv OMP_NUM_THREADS 8
./a.out
|
Note that unlike the MPI jobs, the ConsumableCpus(8) statement above is
critica |