Overview
All users with active accounts on any TACC HPC or SciVis system automatically have an account on
the data archive. Currently, there is no quota for user accounts on the archive system.
If you are interested in a storage allocation on the data archive but do not need an HPC or SciVis
account, please email info@tacc.utexas.edu.
SGI Origin 2000 Terascale Data Archive Server
archive.tacc.utexas.edu
To provide long-term, reliable data storage, TACC operates a four processor SGI Origin 2000 with
four gigabytes of fast, dynamic RAM and 1.3 terabytes of high performance, high availability fiber
channel RAID-3 disks. This system is configured for dedicated file service using SGI's Data
Migration Facility (DMF) for hierarchical storage management, and the disk farm on the Origin 2000
acts as a cache for recently accessed files. These files are permanently stored in two StorageTek
PowderHorn 9310 automated cartridge systems. The data archive is exported to each supercomputer at
the Center via Network File System (NFS), and high speed network access to the files stored in the
/archive file system is provided via a High Performance Parallel Interface (HiPPI) running at 800
megabits per second. Because the Origin 2000 is a dedicated server, direct user access is limited
to only those commands needed to manage that user's /archive area.
Storage Technology Corporation 9310 Automated Cartridge
System
The StorageTek PowderHorn 9310 silo is a fully automated cartridge system that is capable of
holding 6000 tape cartridges and accomplishing 450 tape exchanges per hour. With two silos TACC
manages small files in one silo with five STK 9840 tape drives and 2000 9840 cartridges, and large
files in the second with six STK 9940B tape drives and 5300 9940 cartridges. With the media on
hand TACC can provide an off-line storage capacity of over 1.5 petabytes. SGI's Data Migration
Facility (DMF), running on the Origin 2000 archive server, manages data movement between the 1.3
gigabyte online disk cache and the 9310 silos. Access to the archival store is through the
/archive file system, which is exported to each TACC system via the Network File System. This
combination of multiple STK automated cartridge systems and two high capacity, high performance
STK tape drive technologies provides the TACC scientific user community with fast, reliable access
to off-line data sets.
Storage Area Network
The TACC Storage Area Network (SAN) is intended to provide the user community a high-speed, shared storage facility that is available to all TACC computational and visualization resources. Currently the SAN is managed by a Sun V880 server configured with 8 UltraSPARC III processors, 16 GB of memory, and the Sun UFS file system. Approximately 5 TB of Sun T3 storage is shared among the TACC computational resources and available to each resource at Fibre Channel speed. This true file sharing is accomplished via software running on the V880 in conjuntion with Tivoli SANergy client software running on the computational resources, Fibre Channel interfaces in the V880 and client systems, and a Fibre Channel network fabric based on a 64-port QLogic switch.
Get more information about using the SAN in the appropriate
User Guide.
Operations
Daily Schedule and Operator Coverage
TACC resources are generally available 24 hours a day, seven days
a week. Operator coverage is as follows.
| Monday - Friday: |
8am to 5pm (Central) - staffed Other - not staffed |
| Saturday: |
Not staffed |
| Sunday: |
Not staffed |
Preventive maintenance periods on TACC resources are scheduled as follows:
| System |
Maintenance Periods |
Notes |
| archive |
each Tuesday from 0800 to 1200 hours USA Central Time Zone |
always scheduled but not always taken |
| lonestar |
each Tuesday from 0900 to 1600 hours USA Central Time Zone |
| champion |
each Tuesday from 0800 to 1200 hours USA Central Time Zone |
| mustang |
each Tuesday from 0800 to 1200 hours USA Central Time Zone |
During hours when operations does not staff the center you may leave voice mail at 512-475-9498. Calls will be returned the next business day.
When software or hardware maintenance is required outside the above schedule, we will notify you of
any scheduled interrupts via the message of the day, through the User News
email list, and by posting alerts on the web at
http://www.tacc.utexas.edu/services/usernews. To subscribe to
the User News email list go to
http://www.tacc.utexas.edu/services/usernews/#manage.
Every attempt will be made to notify you of downtime at least 24 hours in advance so that you can
plan your work schedule around the interruption.
Processing Modes
The normal production mode on our supercomputers is multi-user, that is, the
supercomputers are available for batch and interactive processing in a way that equitably shares
the computing resources among the users of the system. In multi-user mode on the SV1, we run a job
mix scheduler (jobmixd) every three minutes that evaluates the current job mix and
selects the best candidates for processing according to CPU time remaining, memory size,
processing priority and service history.
At selected non-prime hours of the day when requested, we will suspend multi-user production and
initiate a special production period that we call blocktime mode. This service allows you
to run very large production jobs by dedicating the majority of the system's resources to one user
or one project. During blocktime production, normal NQS processing is suspended and the blocktime
queues are started. Batch requests in these queues are run serially, one request at a time so that
the full complement of processors, memory and scratch disk space are available to each job.
Blocktime production continues so long as there are requests in the queue or until the scheduled
production period ends.
Unscheduled Interrupts
When an unscheduled interrupt occurs, we will log the outage in a timely fashion and, if the
affected system will be out of service for more than a few minutes, we will note this fact in the
User News Page. If you cannot log in to one of
our systems, it may be worthwhile to check the user news
page to see if that system is down due to an unscheduled outage. You might also
check the current system status to see if that system is down.
Troubleshooting and Problem Reporting
If you cannot access one of our systems, this may be due to any one of several reasons:
- your workstation or departmental server is not running properly.
No one likes to think that their PC or workstation is the cause of the problem but it might
be. Before you report a problem accessing one of the TACC supercomputers, make sure that your
system is working correctly.
- Is it responsive to keyboard input and mouse movements?
- Do the usual commands and utilities seem to be working properly?
- Are you able to connect to other computers on your local network and/or other machines
in your department?
- Are other ITS servers reachable from your system?
The problem might well be elsewhere, but it is wise to make sure things are in order on your
desktop first.
- the network between your workstation and the supercomputer may be out of service.
If your system has traceroute on it, run this diagnostic utility and see if the
route from your workstation to the supercomputer is working properly, e.g.,
traceroute champion.tacc.utexas.edu
If you don't have traceroute, try ping instead:
ping -s champion.tacc.utexas.edu
If you do not get a positive response from either of these utilities, it is likely that there
is a networking outage between your system and ours.
- the system you wish to use is down.
Try using traceroute and/or ping to another TACC server to see if it
is up. If so, then your problem is probably not a networking problem. If
champion is responsive, log in there and see if the interruption has
been announced in the message of the day. If no mention of the outage appears there, it may
well be because the TACC operator is busy, trying to get the down system back up. Report your
issues using the web based consulting form located at
https://portal.tacc.utexas.edu/consulting. If
you ping'ed that system, include that response, if any, in the error submission.
Above all, please be patient with us when things are awry because the duty operator has much
to do when a system goes down: make an initial judgment as to what is wrong so he or she can
notify the relevant staff and/or vendor personnel, log the interruption, perhaps take a system
dump for later analysis, and attempt to reboot the system to get it back into production. They
cannot do these things if they are constantly being interrupted by telephone calls inquiring
as to what is wrong. We understand your desire to know what is happening and we promise we
will exert every reasonable effort to get the system back up as soon as we possibly can.
|