Attention Researchers: Protect Your Data!
TACC, UT Libraries and ITS partner to offer data management resources to researchers
Increasingly, government funding agencies, private foundations, and universities are asking faculty researchers to pay extra attention to their data.
"It costs a lot and requires a great deal of expertise to generate research data," said Maria Esteva, a digital archivist at the Texas Advanced Computing Center (TACC). "It must be preserved in a useable state to get the most out of the investment."
Easier said than done. Our ability to create data — through observations, experiments and computer simulations — has outstripped our ability to preserve and organize it properly. Decoding a single genome can result in terabytes of data, enough to fill 1000s of DVDs, and the data is often inaccessible to other researchers because of the difficulty of transferring or organizing it. This leads to duplication of effort, lost knowledge, and wasted resources.
Beginning in January 2011, proposals submitted to the National Science Foundation (NSF) have had to include a supplementary Data Management Plan describing how the proposal will conform to the agency's policy on the dissemination and sharing of research data. Similar requirements are in place for proposals to the National Institutes of Health, the National Endowment for the Humanities, and even private funders like the Mellon Foundation.
In the case of NSF, the agency provided guidelines to follow and questions for researchers to answer. For instance: What type of data will be generated (scientific data sets, software, published articles)? How can one validate and replicate the data gathered? How is the data described? Where will it be archived for the long-term?
Data Management refers to the systematic organization of data throughout the research lifecycle. A data management plan helps researchers properly manage data for their own use, to meet funding requirements, and to share information with others. The plan describes the structure and nature of the data as well as the activities used to gather, merge, transfer, organize, document, analyze and preserve it.
To address the growing need for data management plans, the Libraries at The University of Texas at Austin have begun providing new services to researchers. Over the last year, they partnered with TACC and Information Technology Services (ITS) at the University to develop an ecosystem through which faculty can manage data related to diverse research projects, with each organization offering distinct resources and capabilities.
"The NSF's policy opened up a new opportunity for collaboration," said Amy Rushing, head librarian of digital access services at The University of Texas Libraries. "We were able to draw on and leverage the resources and the expertise that we have on campus."
In December 2011, they launched a website to answer many of the basic questions about data management and to alert researchers in all fields to the resources at their disposal. Among other benefits are the long-term archiving of publications and papers; supercomputer-powered servers for hosting petascale data collections; and consulting services from knowledgeable digital archivists.
As part of the coordinated strategy, UT Libraries acts as the final resting place for datasets up to one gigabyte, including publications and papers. TACC will house larger (terabyte and petabyte-sized) datasets and collections that require complex architectures and functionalities, such as geographic information system (GIS) data and relational database management system services. ITS will provide hardware, co-location, network access, web services and technical assistance.
"Each organization in the partnership has an important role," Rushing said.
(For many researchers, domain-specific repositories like the Protein Data Bank or Inter-university Consortium for Political and Social Research (ICPSR) will still be the preferred location for data archiving.)
Organizing and storing information can be a challenge for anyone, but the data problems that UT researchers face are often daunting. Imagine trying to verify the locations of tens of thousands of records dating back almost a century.
Fishes of Texas project staff (clockwise from upper left are Doug Martin, Dean Hendrickson, Melissa Casarez, Ben Labay, Adam Cohen and Jessica Rains). |
This is the challenge faced by Dean Hendrickson, Curator of Ichthyology, and his assistant, Adam Cohen, a research associate, both at the Texas Natural History Collections. They created and continue to expand and maintain the Fishes of Texas Project database. Over the course of several years, they were able to incorporate records from museums around the world into a massive database, and to georeference (provide locations for) the fish taken at various collecting events over the years, creating a map of species throughout the state. Researchers use such maps to explore how fish populations have changed over time, and already the work has expanded the known historic ranges of many species. However, they and their colleagues needed to update the database frequently and found it challenging to do so.
"We're finding reasons to edit coordinates and, in order to get the GIS coordinates integrated into the website, we have to go outside of the current database system, extract the data and edit the georeferences, and put it back in," Cohen said. "We want to be able to do that on the fly."
With many more records waiting to be entered into the database, the Fishes of Texas project team hope to upgrade their data management system and enable a more intuitive, user-friendly experience for those accessing their data.
They turned to TACC for help. TACC already hosts the Fishes of Texas collection (as well as several other collections created by the Texas Natural Science Center) on the Corral system, which assures performance, web access and security. Working with the Data Management and Collections group at TACC, they are now developing solutions to make the database more dynamic and to automate time-consuming error checking. These types of collaborations — involving storage resources, advanced data management software, and expert consulting — represent the type envisioned by the archivists.
(L to R) Maria Esteva, Amy Rushing, Angela Newell (Coordinator, ITS), Colleen Lyon (Digital Repository Librarian, UT Libraries). [Photo Credit: Travis Willmann] |
"Data management in itself can be complex," said Rushing. "The goal is to pull relevant information resources together in one place and also to educate researchers on the need for data management. If we educate researchers before their data gathering starts and we give them good tips on managing their data, it hopefully will make the whole process easier."
The management of digital data is an evolving field. Many aspects of the project are still being discussed as the team develops protocols for the UT research community. They are tackling issues such as licensing, privacy, acceptable file formats, and the creation of metadata to assist in the organization and retrieval of information.
In addition, they will deploy a web-based template tool that will allow researchers to generate tailored data plans automatically and reuse and adapt previous plans.
The Office of Sponsored Projects is organizing and promoting training sessions for interested researchers in the spring, in association with TACC, UT Libraries, and ITS. To learn more about the project, explore the Data Management at UT page on the UT Libraries website.
"It's an exciting collaboration," said Rushing. "I'm really proud that UT was able to pull together to provide a strong resource for the campus."
January 24, 2012
The Texas Advanced Computing Center (TACC) at The University of Texas at Austin is one of the leading centers of computational excellence in the United States. The center's mission is to enable discoveries that advance science and society through the application of advanced computing technologies. To fulfill this mission, TACC identifies, evaluates, deploys, and supports powerful computing, visualization, and storage systems and software. TACC's staff experts help researchers and educators use these technologies effectively, and conduct research and development to make these technologies more powerful, more reliable, and easier to use. TACC staff also help encourage, educate, and train the next generation of researchers, empowering them to make discoveries that change the world.
- Government funding agencies, private foundations, and universities are asking faculty researchers to pay extra attention to their data, but for large and complex data sets this can be a challenge.
- To address the growing need for data management plans, the Libraries at The University of Texas at Austin have begun providing new services to researchers.
- Over the last year, the Libraries partnered with TACC and Information Technology Services (ITS) to develop a data management ecosystem through which faculty can manage data related to their research projects.
Aaron Dubrow
Science and Technology Writer
aarondubrow@tacc.utexas.edu

