Corral: Celebrating 100 Diverse Data Collections

TACC's high-performance storage system serves scientists and their "Big Data" needs


Published on October 3, 2013 by Aaron Dubrow

Corral: Celebrating 100 Diverse Data Collections

The world is full of fascinating scientific data from wind maps to water tables to brain databases. All of that information must be stored somewhere and cared for if it hopes to stand the test of time, just as documents in physical archives have.

Dropbox, Amazon or a university server might be sufficient to house small to medium-sized digital collections, or for short-term storage, but to maintain massive (hundreds of gigabytes to petabyte-sized) datasets for years or decades, special-purpose technologies and expertise are required.

Corral, the large-scale data repository at the Texas Advanced Computing Center (TACC), came online in 2009 to support the storing and sharing of research, data and results.  The system just achieved a milestone: Corral now hosts 100 unique scientific research collections from measurements of Earth's gravity field to whale songs to mass spectrometry data. And its usage is growing. Total usage has grown by 10% per month over the last six months, and recently Corral crossed the one petabyte mark in total data stored.

"We've seen ever-increasing growth in the number and diversity of collections on Corral over the past several years," said Chris Jordan, manager of the data management and collections group at TACC. "This shows how important a resource dedicated to data collections is to modern research practices, both for the researchers who are creating data and the worldwide community of researchers who use public data collections to further their own research."

Corral complements the existing suite of TACC resources for data-intensive computing, including the Ranch tape archive (more than 100 petabyte), Stampede, the newest petascale supercomputer (more than 15 petabytes of dedicated storage), and a scalable global file system (20 petabytes) released to users in the Fall 2013.  

Whereas Stampede and the shared file systems will be used for short-term data retention related to ongoing simulations and analyses, and Ranch is the long-term repository for archived work, Corral is where large collections that are actively serving the community reside.

At six petabytes — or six million gigabytes — and growing, Corral stores collections that can't live anywhere else because of their scale and complexity. Corral also makes it easy to share data, control access to information, and analyze large datasets. Connected to TACC's other advanced computing systems via a high-speed network, Corral is a critical part of the end-to-end research workflow for scientists.

Corral's data-crunching capabilities depend not only on powerful hardware, but also on the people behind the machine. These include the TACC team who designed and optimized the system for high-performance, data-driven science, as well as those who work closely with individual researchers to help them make the most of their data using this powerful machine. TACC has expertise in data management, metadata creation, data mining, data visualization and GIS (geographic information system) technologies, and in many cases, TACC works with scientists to design custom solutions, workflows or interfaces to maximize the impact of each dataset.

Increasingly, scientific breakthroughs will be powered by advanced computing capabilities like Corral that help researchers explore and manipulate massive datasets.