Click here to go to the TACC Home Page Click here to go to the TACC Home Page
Data Storage
Overview Archive StorageTek 9310 SAN Operations

Overview

All users with active accounts on any TACC HPC or SciVis system automatically have an account on the data archive. Currently, there is no quota for user accounts on the archive system.

If you are interested in a storage allocation on the data archive but do not need an HPC or SciVis account, please email info@tacc.utexas.edu.


SGI Origin 2000 Terascale Data Archive Server
archive.tacc.utexas.edu

To provide long-term, reliable data storage, TACC operates a four processor SGI Origin 2000 with four gigabytes of fast, dynamic RAM and 1.3 terabytes of high performance, high availability fiber channel RAID-3 disks. This system is configured for dedicated file service using SGI's Data Migration Facility (DMF) for hierarchical storage management, and the disk farm on the Origin 2000 acts as a cache for recently accessed files. These files are permanently stored in two StorageTek PowderHorn 9310 automated cartridge systems. The data archive is exported to each supercomputer at the Center via Network File System (NFS), and high speed network access to the files stored in the /archive file system is provided via a High Performance Parallel Interface (HiPPI) running at 800 megabits per second. Because the Origin 2000 is a dedicated server, direct user access is limited to only those commands needed to manage that user's /archive area.

For more information about using the archive server see TACC Mass Storage System.


Storage Technology Corporation 9310 Automated Cartridge System

The StorageTek PowderHorn 9310 silo is a fully automated cartridge system that is capable of holding 6000 tape cartridges and accomplishing 450 tape exchanges per hour. With two silos TACC manages small files in one silo with five STK 9840 tape drives and 2000 9840 cartridges, and large files in the second with six STK 9940B tape drives and 5300 9940 cartridges. With the media on hand TACC can provide an off-line storage capacity of over 1.5 petabytes. SGI's Data Migration Facility (DMF), running on the Origin 2000 archive server, manages data movement between the 1.3 gigabyte online disk cache and the 9310 silos. Access to the archival store is through the /archive file system, which is exported to each TACC system via the Network File System. This combination of multiple STK automated cartridge systems and two high capacity, high performance STK tape drive technologies provides the TACC scientific user community with fast, reliable access to off-line data sets.

Get more information about using the data archive in the TACC Mass Storage System.


Storage Area Network

The TACC Storage Area Network (SAN) is intended to provide the user community a high-speed, shared storage facility that is available to all TACC computational and visualization resources. Currently the SAN is managed by a Sun V880 server configured with 8 UltraSPARC III processors, 16 GB of memory, and the Sun UFS file system. Approximately 5 TB of Sun T3 storage is shared among the TACC computational resources and available to each resource at Fibre Channel speed. This true file sharing is accomplished via software running on the V880 in conjuntion with Tivoli SANergy client software running on the computational resources, Fibre Channel interfaces in the V880 and client systems, and a Fibre Channel network fabric based on a 64-port QLogic switch.

Get more information about using the SAN in the appropriate User Guide.


Operations

Daily Schedule and Operator Coverage

TACC resources are generally available 24 hours a day, seven days a week. Operator coverage is as follows.

Monday - Friday: 8am to 5pm (Central) - staffed
Other - not staffed
Saturday: Not staffed
Sunday: Not staffed

Preventive maintenance periods on TACC resources are scheduled as follows:

System Maintenance Periods Notes
archive each Tuesday from 0800 to 1200 hours USA Central Time Zone always scheduled but not always taken
lonestar each Tuesday from 0900 to 1600 hours USA Central Time Zone
champion each Tuesday from 0800 to 1200 hours USA Central Time Zone
mustang each Tuesday from 0800 to 1200 hours USA Central Time Zone

During hours when operations does not staff the center you may leave voice mail at 512-475-9498. Calls will be returned the next business day.

When software or hardware maintenance is required outside the above schedule, we will notify you of any scheduled interrupts via the message of the day, through the User News email list, and by posting alerts on the web at http://www.tacc.utexas.edu/services/usernews. To subscribe to the User News email list go to http://www.tacc.utexas.edu/services/usernews/#manage. Every attempt will be made to notify you of downtime at least 24 hours in advance so that you can plan your work schedule around the interruption.


Processing Modes

The normal production mode on our supercomputers is multi-user, that is, the supercomputers are available for batch and interactive processing in a way that equitably shares the computing resources among the users of the system. In multi-user mode on the SV1, we run a job mix scheduler (jobmixd) every three minutes that evaluates the current job mix and selects the best candidates for processing according to CPU time remaining, memory size, processing priority and service history.

At selected non-prime hours of the day when requested, we will suspend multi-user production and initiate a special production period that we call blocktime mode. This service allows you to run very large production jobs by dedicating the majority of the system's resources to one user or one project. During blocktime production, normal NQS processing is suspended and the blocktime queues are started. Batch requests in these queues are run serially, one request at a time so that the full complement of processors, memory and scratch disk space are available to each job. Blocktime production continues so long as there are requests in the queue or until the scheduled production period ends.


Unscheduled Interrupts

When an unscheduled interrupt occurs, we will log the outage in a timely fashion and, if the affected system will be out of service for more than a few minutes, we will note this fact in the User News Page. If you cannot log in to one of our systems, it may be worthwhile to check the user news page to see if that system is down due to an unscheduled outage. You might also check the current system status to see if that system is down.


Troubleshooting and Problem Reporting

If you cannot access one of our systems, this may be due to any one of several reasons:

  • your workstation or departmental server is not running properly.

    No one likes to think that their PC or workstation is the cause of the problem but it might be. Before you report a problem accessing one of the TACC supercomputers, make sure that your system is working correctly.

    • Is it responsive to keyboard input and mouse movements?
    • Do the usual commands and utilities seem to be working properly?
    • Are you able to connect to other computers on your local network and/or other machines in your department?
    • Are other ITS servers reachable from your system?
    The problem might well be elsewhere, but it is wise to make sure things are in order on your desktop first.
  • the network between your workstation and the supercomputer may be out of service.

    If your system has traceroute on it, run this diagnostic utility and see if the route from your workstation to the supercomputer is working properly, e.g.,

    traceroute champion.tacc.utexas.edu
    If you don't have traceroute, try ping instead:
    ping -s champion.tacc.utexas.edu
    If you do not get a positive response from either of these utilities, it is likely that there is a networking outage between your system and ours.
  • the system you wish to use is down.

    Try using traceroute and/or ping to another TACC server to see if it is up. If so, then your problem is probably not a networking problem. If champion is responsive, log in there and see if the interruption has been announced in the message of the day. If no mention of the outage appears there, it may well be because the TACC operator is busy, trying to get the down system back up. Report your issues using the web based consulting form located at https://portal.tacc.utexas.edu/consulting. If you ping'ed that system, include that response, if any, in the error submission.

    Above all, please be patient with us when things are awry because the duty operator has much to do when a system goes down: make an initial judgment as to what is wrong so he or she can notify the relevant staff and/or vendor personnel, log the interruption, perhaps take a system dump for later analysis, and attempt to reboot the system to get it back into production. They cannot do these things if they are constantly being interrupted by telephone calls inquiring as to what is wrong. We understand your desire to know what is happening and we promise we will exert every reasonable effort to get the system back up as soon as we possibly can.