Click here to go to the TACC Home Page Click here to go to the TACC Home Page
HPC Systems
Sun Linux Cluster Dell Linux Cluster IBM Power5 System TACC Stampede Cluster
Operations

Sun Constellation Linux Cluster

System Name: Ranger ranger
Host Name: ranger.tacc.utexas.edu
IP Address: 129.114.50.163
Operating System: Linux
Number of Nodes: 3,936
Number of Processing Cores: 62,976
Total Memory: 123TB
Peak Performance: 504TFlops
Total Disk: 1.73PB (shared)
31.4TB (local)
Description:

“Ranger” is the largest computing system in the world for open science research. As the first of the new NSF Track2 HPC acquisitions, this system provides unprecedented computational capabilities to the national research community and ushers in the petascale science era. Ranger will enable breakthrough science that has never before been possible, and will provide groundbreaking opportunities in computational science & technology research – from parallel algorithms to fault tolerance, from scalable visualization to nextgeneration programming languages.

Ranger went into production on February 4, 2008 using Linux (based on a CentOS distribution). The system components are connected via a full-CLOS InfiniBand interconnect. Eighty-two compute racks house the quad-socket compute infrastructure, with additional racks housing login, I/O, and general management hardware. Compute nodes are provisioned using local storage. Global, high-speed file systems will be provided, using the Lustre file system, running across 72 I/O servers. Users will interact with the system via four dedicated login servers, and a suite of eight high-speed data servers. Resource management for job scheduling will be provided with Sun Grid Engine (SGE).

Any researcher at a U.S. institution can submit a proposal to request an allocation of cycles on the system. The request must describe the research, justify the need for such a powerful system to achieve new scientific discoveries, and demonstrate that the proposer's team has the expertise to utilize the resource effectively.
     · 90% of the system is dedicated to the TeraGrid (http://www.teragrid.org)
     · 5% of the system is allocable to Texas higher education institutions
     · 5% of the system is allocable to industry through TACC’s Science & Technology Affiliates for Research (STAR) Program
To submit a proposal to request an allocation, please visit the TeraGrid website.


Researchers at Texas higher education institutions, please contact Chris Hempel.

For more information about using Ranger, see the Ranger User Guide.


Dell Linux Cluster

System Name: Lonestar lonestar
Host Name: lonestar.tacc.utexas.edu
 (lslogin1.tacc.utexas.edu
 lslogin2.tacc.utexas.edu)
IP Address: 129.114.50.31 & .32
Operating System: Linux
Number of Processors: 5,840 (compute)
Total Memory: 11.6 TB
Peak Performance: 62 TFLOPS
Total Disk: 106.5TB(local), 70TB(global)
Description:
The TACC Dell Linux Cluster contains 5,840 cores within 1,460 Dell PowerEdge 1955 compute blades (nodes), 16 PowerEdge 1850 compute-I/O server-nodes, and 2 PowerEdge2950 (2.66GHz) login/management nodes. Each compute node has 8GB of memory, and the login/development nodes have 16GB. The system storage includes a 70TB parallel (WORK) Lustre file system, and 106.5TB of local compute-node disk space (73GB/node). An InfiniBand switch fabric, employing PCI Express interfaces, interconnects the nodes (I/O and compute) through a fat-tree topology, with a point-to-point bandwidth of 1GB/sec (unidirectional speed).

Compute nodes have two processors, each a Xeon 5100 series 2.66GHz dual-core processor with a 4MB unified (Smart) L2 cache. Peak performance for the four cores is 42.6 GFLOPS. Some of the key features of the Core micro-architecture are: dual-core, L1 Instruction cache, 14 unit pipeline, eight pre-fetch units, Macro Ops Fusion, double-speed integer units, Advanced Smart (sharing) L2 cache, and 16 new SSE3 instructions. The memory system uses Fully Buffered DIMMS (FB-DIMMS) and a 1333 MHz (10.7 GB/sec) front side bus.

Get more information about using lonestar in the Dell Linux Cluster User Guide. For information about how to request an allocation on this system, go to the HPC section of the Allocations page.


TACC Stampede Cluster

System Name: Stampede stampede
Host Name: slogin1.tacc.utexas.edu
IP Address: 129.114.50.77
Operating System: Linux
Number of Processors: 1744 (compute cores)
Total Memory: 1800 GB
Peak Performance: 16 TFLOPS
Total Disk: 520 GB (local)
536 GB (shared)
68 TB(global, shared)
Description:

The current configuration of Stampede consists of 218 compute nodes, two login nodes, and a dedicated file server attached to one of the compute nodes. The nodes are interconnected using Gigabit Ethernet technology. Each compute node has two quad core Intel Clovertown processors, 8 GB of memory, and 600 GB of local disk space (of which 520 GB is available to the user) and 536 GB of shared disk. The dedicated file server provides 3.7 TB of storage to certain users and is mounted on all of the compute nodes. The system can access 68 TB of global, parallel file storage that is managed by the Lustre file system and shared with the TACC Lonestar system.

Get more information about using Stampede in the Stampede User Guide. For information about how to request an allocation on this system, go to the HPC section of the Allocations page.


Sun Microsystems® StorageTek Mass Storage Facility

System Name: Ranch
Host Name: ranch.tacc.utexas.edu
IP Address: 129.114.50.81
Operating System: Linux
Total Disk: 1 PB uncompressed data
(2000 tapes/ 15.5 TB disk cache)
Description:

TACC's long-term mass storage solution is a Sun Microsystems® StorageTek Mass Storage Facility, nicknamed Ranch. Ranch utilizes Sun's Storage Archive Manager Filesystem (SAM-FS) for migrating files to/from a tape archival system with a current storage capacity of 1 PB.

Ranch's disk cache is built on a Sun ST6540 disk array containing approximately 15.5 TB of spinning disk. This disk array is controlled by a Sun x4600 SAM-FS Metadata server which has 16 CPUs and 32 GB of RAM.

A single Sun StorageTek SL8500 Automated Tape Library houses all of the offline archival storage. Each SL8500 library contains 10,000 tape slots and 64 tape drive slots. Each tape is capable of holding 500 GB of uncompressed data, so when fully populated, a single SL8500 library can house 5 PB. Each SL8500 library also contains four handbots to manage tapes and move them to/from the tape drives. If necessary, up to four SL8500 libraries can be integrated into a single archival solution, allowing for an offline storage capacity of 20 PB.

The current ranch configuration has 2,000 tapes, and is capable of housing 1 PB of uncompressed data. However, future plans call for further population of the tape slots as well as upgrades of the physical media from 500 GB capacity to 1 TB capacity.

Get more information about using Ranch in the Ranch User Guide. For information about how to request an allocation on this system, go to the HPC section of the Allocations page.


IBM Power5 System

System Name: Champion
Host Name: champion.tacc.utexas.edu
IP Address: 129.114.4.52
Operating System: AIX
Number of Processors: 96 Power5 processors
Total Memory: 192 GB
Peak Performance: 730 GFLOPS
Total Disk: 7.2 TB
Description:

The TACC IBM Power5 System consists of 12 IBM P5 575 shared memory server nodes. Each server node contains 8 Power5 processors running at 1.9 GHz. In total, the 96 processor system has a peak performance of 730 GFLOPS with an aggregate memory of 192 GB. Each node is also supported by 36 GB of local disk, for a total of 432 GB, and a faster 7.2 TB GPFS file system. All server nodes are connected by a IBM high performance Federation switch. The Power5 systems run AIX, a scalable UNIX operating system with High Availability Cluster Multi-Processing (HACMP) capabilities.

The new IBM Power5 processor offers industry-leading performance on floating point calculations, including almost double the performance of IBM's Power4 processor. The key physical technologies of the chip are Silicon-on-insulator (SOI), copper connections, 130 million transistors per die, an on-chip L2 cache, and Multi-Chip-Modules (MCM). The key architectural features are a high-speed clock, 64-bit architecture, 3-tier cache hierarchy, superscalar with speculative branching, out-of-order execution, pre-fetch streaming and Simultaneous Multi-threading or SMT technology. Evolving from the Power4/Power4+ architecture, the Power5 chip architecture now has faster processor speed and larger L2 and L3 cache. In addition, the L3 chips have now been moved closer to the chip on the module, an on-chip memory controller has been added, and the number of registers has been increased.

Get more information about using champion in the IBM Power5 System User Guide. For information about how to request an allocation on this system, go to the HPC section of the Allocations page.


Operations

Daily Schedule and Operator Coverage

TACC resources are generally available 24 hours a day, seven days a week. Operator coverage is as follows.

Monday - Friday: 8am to 5pm (Central) - staffed
Other - not staffed
Saturday: Not staffed
Sunday: Not staffed

Preventive maintenance periods on TACC resources are scheduled as follows:

System Maintenance Periods Notes
archive each Tuesday from 0800 to 1200 hours USA Central Time Zone always scheduled but not always taken
lonestar each Tuesday from 0900 to 1600 hours USA Central Time Zone
champion each Tuesday from 0800 to 1200 hours USA Central Time Zone
mustang each Tuesday from 0800 to 1200 hours USA Central Time Zone

During hours when operations does not staff the center you may leave voice mail at 512-475-9498. Calls will be returned the next business day.

When software or hardware maintenance is required outside the above schedule, we will notify you of any scheduled interrupts via the message of the day, through the User News email list, and by posting alerts on the web at http://www.tacc.utexas.edu/services/usernews. To subscribe to the User News email list go to http://www.tacc.utexas.edu/services/usernews/#manage. Every attempt will be made to notify you of downtime at least 24 hours in advance so that you can plan your work schedule around the interruption.


Processing Modes

The normal production mode on our supercomputers is multi-user, that is, the supercomputers are available for batch and interactive processing in a way that equitably shares the computing resources among the users of the system. In multi-user mode on the SV1, we run a job mix scheduler (jobmixd) every three minutes that evaluates the current job mix and selects the best candidates for processing according to CPU time remaining, memory size, processing priority and service history.

At selected non-prime hours of the day when requested, we will suspend multi-user production and initiate a special production period that we call blocktime mode. This service allows you to run very large production jobs by dedicating the majority of the system's resources to one user or one project. During blocktime production, normal NQS processing is suspended and the blocktime queues are started. Batch requests in these queues are run serially, one request at a time so that the full complement of processors, memory and scratch disk space are available to each job. Blocktime production continues so long as there are requests in the queue or until the scheduled production period ends.


Unscheduled Interrupts

When an unscheduled interrupt occurs, we will log the outage in a timely fashion and, if the affected system will be out of service for more than a few minutes, we will note this fact in the User News Page. If you cannot log in to one of our systems, it may be worthwhile to check the user news page to see if that system is down due to an unscheduled outage. You might also check the current system status to see if that system is down.


Troubleshooting and Problem Reporting

If you cannot access one of our systems, this may be due to any one of several reasons:

  • your workstation or departmental server is not running properly.

    No one likes to think that their PC or workstation is the cause of the problem but it might be. Before you report a problem accessing one of the TACC supercomputers, make sure that your system is working correctly.

    • Is it responsive to keyboard input and mouse movements?
    • Do the usual commands and utilities seem to be working properly?
    • Are you able to connect to other computers on your local network and/or other machines in your department?
    • Are other ITS servers reachable from your system?
    The problem might well be elsewhere, but it is wise to make sure things are in order on your desktop first.
  • the network between your workstation and the supercomputer may be out of service.

    If your system has traceroute on it, run this diagnostic utility and see if the route from your workstation to the supercomputer is working properly, e.g.,

    traceroute champion.tacc.utexas.edu
    If you don't have traceroute, try ping instead:
    ping -s champion.tacc.utexas.edu
    If you do not get a positive response from either of these utilities, it is likely that there is a networking outage between your system and ours.
  • the system you wish to use is down.

    Try using traceroute and/or ping to another TACC server to see if it is up. If so, then your problem is probably not a networking problem. If champion is responsive, log in there and see if the interruption has been announced in the message of the day. If no mention of the outage appears there, it may well be because the TACC operator is busy, trying to get the down system back up. Report your issues using the web based consulting form located at https://portal.tacc.utexas.edu/consulting. If you ping'ed that system, include that response, if any, in the error submission.

    Above all, please be patient with us when things are awry because the duty operator has much to do when a system goes down: make an initial judgment as to what is wrong so he or she can notify the relevant staff and/or vendor personnel, log the interruption, perhaps take a system dump for later analysis, and attempt to reboot the system to get it back into production. They cannot do these things if they are constantly being interrupted by telephone calls inquiring as to what is wrong. We understand your desire to know what is happening and we promise we will exert every reasonable effort to get the system back up as soon as we possibly can.