Corral User Guide

System Information

System Overview

Corral is a data applications resource; it primarily consists of 1.2 Petabytes of online disk in a DDN S2A 9900 storage system, along with 16 Dell server nodes which provide various access mechanisms to the data. Corral provides a high-performance Lustre filesystem with around 800TB of available storage, and it also provides storage using MySQL, Postgres, and Microsoft SQL Server.

Although we provide a login node for accessing your data stored on Corral, you may also store and access data on Corral from all other TACC resources, using either the file system interface or one of the database interfaces. You may also manage data on Corral through the specialized iRODS interface for managing data across multiple TACC systems.

Architecture

The architecture of Corral consists of the 1.2PB set of disk arrays, connected to a variety of Linux hosts which provide various mechanisms for storing and manipulating data. Servers which may be of interest, and their corresponding services, are shown in the following table:

Corral Services and Hosts

Host Service
db.corral.tacc.utexas.edu Postgres Database
MySQL Database
corral-login.tacc.utexas.edu Login/SSH/SCP
irods.corral.tacc.utexas.edu iRODS Master
web.corral.tacc.utexas.edu Web/iRODS WebDAV

All services on Corral run on the default ports.

User Environment

System Access

Corral is accessible through a variety of mechanisms depending on the service you wish to use. The most common type of Corral service is the Lustre file system; in this case you may access your storage space under the mount point /corral/< project path > on the login nodes of all major TACC systems; for users without access to other systems at TACC, access to a special Corral login node is available upon request. You may also store and retrieve data on Corral through the iRODS interface, through a MySQL or Postgres Database, or through other, custom services depending on your application need and your allocation arrangement.

When requesting an allocation on Corral, you will need to indicate what type of service you wish to use, and in most cases provide a name for the directory, database, or data collection you will be storing on Corral. This name will be used to identify a shared project directory or otherwise tag your service for access.

Login Information

The Corral login node is login.corral.tacc.utexas.edu (or corral-login.tacc.utexas.edu). This node may be accessed using any SSH client from your desktop or from other TACC systems. In addition to providing access to the /corral file system, this node also provides client software for the iRODS data management system, the Postgres database server, and the MySQL database server. It provides a basic C, C++, and Java development environment, and can also be used to run scripts to manipulate your data using most common scripting languages, including Perl, Python, and Ruby. Most Corral-specific applications are available in /opt/apps/, and will also be in your default login environment.

File Systems

Corral provides a single large file system which is accessible from multiple resources at TACC. This file system is mounted as /corral on the Corral server nodes, the Ranger login nodes, and the Lonestar login nodes, as well as all nodes of Champion. To access the /corral file system you must have an allocation. If you are a researcher at the University of Texas at Austin, you may apply for space on /corral through the TACC user portal. If your request is approved, a directory will be created for you with appropriate user and group ownership.

You may utilize the /corral file system just as you would any other file system, using standard unix commands such as cp and mv to copy data to your /corral directory. For systems which have access to /corral from the compute nodes, you may also read and write directly to /corral from within your compute job. The path to data on /corral will be identical on all resources, i.e. if you create a directory called /corral/utexas/myproject on Ranger, you will also be able to access your data in /corral/utexas/myproject on Lonestar.

Usage Policies

Use of any Corral file system or software service is by allocation or special arrangement only. Corral file system allocations are open to researchers at The University of Texas at Austin, and software services such as Postgres, MySQL, and iRODS are open to all TACC users. If you are a researcher at the University of Texas at Austin, you may use the TACC User Portal to request a Corral allocation. To access database or iRODS services, please contact TACC User Services to discuss your needs.

Data Transfers

File transfers to and from Corral can be performed using SCP/SFTP to the Corral login node, or using GridFTP to any of the Ranger and Lonestar GridFTP servers. In addition, the iRODS data management software can be used to transfer data to and from Corral and Ranch.

iRODS User Guide

Introduction to iRODS at TACC

iRODS is a data grid/data management tool. It allows you to store data in a unified namespace using multiple storage resources, to replicate data so that copies exist on multiple systems, and to store checksums and arbitrary metadata with a file. The TACC iRODS configuration supports accessing iRODS through either the native iRODS tools such as the UNIX i-Commands or through WebDAV. iRODS is configured to store data on a 20TB cache filesystem for relatively short-term storage, on the 750TB Corral Lustre file system, and on any of the 3 Ranch archive file systems. Each of these storage systems is referred to as a resource within iRODS. The resources names are documented in the following table:

Resource Name Storage System
cache 20TB cache filesystem
ranch1 Ranch /home1 archive
ranch2 Ranch /home2 archive
ranch2 Ranch /home3 archive
corral 750TB Lustre file system

Use of the Corral resource may be subject to allocation constraints - please limit yourself to using the Ranch archive file systems for long-term storage unless you have an allocation on Corral.

Setup for command-line usage

On Ranger and Lonestar:

The directory /opt/apps/irods contains an example configuration file, irodsEnv.

  1. cp /opt/apps/irods/irodsEnv ~/.irods/.irodsEnv
  2. edit ~/.irods/.irodsEnv to include your username in the indicated locations
  3. Add /opt/apps/irods/bin to your path so you can access the binaries, or use module load irods
  4. Run the iinit command to initialize the system - this will ask for a password, which you only have to enter once per session

Basic Command-line Usage

Once you have configured the ~/.irods/.irodsEnv file, you can use the i-commands to access data in the system. The i-commands are generally the same as the standard Unix file management commands, but with an i prepended: ils, imkdir, icd, and so on. The R switch is used to specify a target resource for commands that store data in the system. Other common switches include

  • -v for verbose output
  • -r for recursive operation
  • -h for detailed information on usage of any given command

The ils -l command can be used to see all the copies of files in the system. If a file has been replicated from corral to ranch2, for example, the file wil be listed twice, with each listing indicating the resource where the file or replica is stored, along with the replica number, which is used in commands like irm or iget, which can target a specific copy of a file.

Storing data into and retrieving data from the iRODS system

The iput command is used to store data into the system

iput -v < source_filename > /tacc/home/username/< target_subdirectory >

If you used the default .irodsEnv file, this will store data in the largest Ranch file system, /home3. To specify a resource target use the R switch. To see all the resources available, use the command ilsresc

Including the K switch will cause a checksum to be generated when the data is stored. This checksum can then be verified when retrieving the data.

The iget command is used to retrieve data from the system.

iget -v /tacc/home/username/< filename_to_retrieve >

As with the iput command, the -R switch can be used to select the resource from which to retrieve the data. Including the -K switch will trigger verification of the checksum the user may have generated when storing the data using the iput command.

Long-term storage of data from the cache

To replicate data stored in the cache file system into Ranch for long-term storage, use the irepl command as follows:

irepl -R ranch2 < path_and_filename >

The command above will replicate the cache file into the Ranch resource target ranch2.

Deleting data from one or all resources

Use the irm command to delete data entirely or from a single resource. irm without options deletes all copies of a file. irm with the n # switch deletes a specific replica. For example, if you have stored data initially in the cache resource and then replicated it to Ranch, replica 0 will be the copy stored on the cache, and the command:

irm -n 0 < path_and_filename >

will remove the file from the iRODS cache resource, while leaving the archived copy intact. Use ils -l before deleting a replica to ensure that you have a copy in more that one resource, and that you are deleting the correct replica.

Synchronizing a local directory with iRODS

The irsync command can be used to synchronize a local directory with iRODS, similar to the rsync Unix command. It can be used to make an exact copy of a directory hierarchy on a local disk within iRODS, or retrieve an exact copy of a directory hierarchy already stored in iRODS. It may also be used to create an exact copy of a file or directory within iRODS. iRODS paths are identified with an i: prefix in the irsync command. For example, if you have created a directory within iRODS called /tacc/home/joeuser/myproject, and you wish to retrieve an exact copy of that directory on Ranger, run the command:

irsync -r i:/tacc/home/joeuser/myproject /path/to/joeusers/work/directory

After editing the files on Ranger, you can then synchronize the data back into iRODS using the command:

irsync -r /path/to/joeusers/work/directory i:/tacc/home/joeuser/myproject

If you are storing or retrieving data to Ranch with the -R ranch2 option, you should also use the -s switch - this will use the size rather than the checksum of the file to determine whether synchronization is necessary, thereby avoiding the need to retrieve all the files from tape to compute checksums. This will greatly improve the performance of synchronization with Ranch.

References

The following links provide documentation for the various software services provided by Corral for storing, manipulating, and retrieving data.

Last updated August 18, 2011