Finding Meaning in Massive Datasets
Heavy lifting, in the form of storing, analyzing and curating big data, is enabling data-driven science across many fields
Look around and you'll notice data-collecting devices everywhere. From smartphones to thermostats, from car alarms to Twitter feeds, we're generating and collecting more data than ever before.
This is the case in science, too, where ever-more-powerful instruments—supercomputers, telescopes, gene sequencers—create more, and more detailed, data about our natural world.
This explosion of information has led some to announce the arrival of a new kind of science: driven by data, rather than theory or equations.
"The quest for knowledge used to begin with grand theories," Chris Anderson, editor-in-chief of Wired wrote in July 2008. "Now it begins with massive amounts of data. Welcome to the Petabyte Age."
Research centers like the Texas Advanced Computing Center (TACC) now dedicate expert staff and systems to explore data-driven science with the goal of finding needles of insight in digital haystacks. In traditional computational fields like astrophysics, as well as in emerging applications like smart grids and genomics, early projects are showing the benefits of using advanced computing to find meaning in massive datasets.
Take, for example, cosmology: the attempt to understand the evolution of the universe from its earliest moments.
For decades, researchers have used supercomputers to perform simulations of cosmic evolution. These simulations approximate the lifecycles of stars, planets and black holes as they have changed over eons, influenced by the initial condition of the Big Bang, and by their spinning, swallowing, exploding neighbors.
These images show the whole-home energy usage averaged by census block in the Mueller development on August 15th 2011 at 7 am, when the temp was ~79 F, and 7 pm, when the temp was ~100 F. Note the massive increase in energy usage due primarily to the AC units generating in the evening. New visualization tools can clearly represent energy usage to consumers, energy operators, and officials and help individuals make decisions about energy usage. [Courtesy of Paul Navratil, Adam Kubatch, David Walling and Chris Jordan, Texas Advanced Computing Center; Chris Holcomb and Bert Haskell, Pecan Street Inc.]
Over time cosmological simulations have grown in complexity and become unwieldy for analysis. Scientific visualization helps, but at such a scale, the complexity can easily overwhelm the insights.
This was the case for University of Sussex Professor Ilian Iliev, whose dark matter simulations of three billion years of the Universe comprised 216 billion particles (corresponding to almost five terabytes of data per output or roughly one-half of the printed collection of the U. S. Library of Congress).
"Universal dark matter simulations track the interactions of billions of particles to better understand the evolution of the modern Universe," said Paul Navratil, a research associate at TACC and a collaborator on the project. "The massive amount of data produced by these simulations overwhelms traditional three-dimensional visualizations."
In collaboration with researchers from TACC, The University of Texas at Austin, and Pervasive Software, Iliev developed a new way of data mining scientific simulations. The researchers used MapReduce, which Google created to search their databases, and Hadoop, an open-source software that enables applications to work with thousands of computing nodes and petabytes of data. Adapting these tools to their problem, the researchers developed a search mechanism by which they could identify regions of interest in the midst of chaotic visualizations. Specifically, the project focused on the emergence of dark matter halos, which play a key role in current models of galaxy formation and evolution.
Using the Longhorn visualization cluster at TACC, configured with Hadoop storage drives, the halo project programmatically identified the location and size of dense particle regions that indicate galaxy locations in the simulation. These locations were then used to guide visualizations of the halo regions.
"The vast amounts of data being produced by our simulations pushes the limits of even the largest computers," Iliev said. "Therefore, the development of memory- and CPU-efficient parallel tools to reduce, analyze and visualize this data is crucial."
Traveling from the cosmic plane to the everyday, TACC also supports the analysis of large amounts of data from an experimental smart grid developed by Austin Energy and The University of Texas Energy Institute and based at the Mueller Development in Austin. The project equipped 100 new homes with sensors to measure consumer energy usage with extreme granularity.
Power generation accounts for 40 percent of the U.S. carbon footprint, and energy efficiency is an increasingly important priority. More than ever, there is significant momentum from both the general public and government to develop "smart grids" capable of delivering electric power more efficiently.
The Mueller smart grid project gathers and analyzes large amounts of information about energy usage from their test site. However, the datasets they are collecting require powerful advanced computers to capture, process and analyze the information rapidly and productively.
"TACC has some of the world's fastest computers, so we're confident they can do any kind of crunching, rendering, or data manipulation," said Bert Haskell, technology director for the project. "They have the technical expertise to look at different database structures and know how to organize the data so it's more efficiently managed."
New data is generated every 15 seconds showing precisely how much energy is being used on an individual circuit. To date, the database contains more than 500 million individual power readings and continues to grow. TACC and Austin Energy are calibrating the data to develop an accurate baseline of energy usage in the city of Austin. They are also creating new visualization tools to clearly represent energy usage to consumers, energy operators, and officials.
"We're trying to create very rich resources for people to use in analyzing patterns of energy usage," said Chris Jordan, a member of TACC's Advanced Computing Systems group. "We're really interested to see what people can do with it, and how the data stream can be transformed into a decision-making device for city planners and individual consumers."
One of the areas where big data is expected to have the largest impact on our lives is in genomics. The explosion of DNA sequencing technology in the last five years means a researcher who used to be able to generate 1,000, or maybe 10,000, DNA sequences in a day, can now generate tens of billions of DNA sequences daily.
A multi-institute project spanning China, Canada and the United States, the 1000 Plant Genomes Project is an example of a large-scale genomics study using vast amounts of distributed information to research the genetic and genomic diversity in the plant kingdom.
With more than 1.2 petabytes of storage, data-transfer speeds up to 10 gigabytes a second, and direct connectivity to TACC's other computational resources, Corral is built to handle data-intensive problems.
Of the estimated 500,000 species of plants, only a handful of species have undergone genomic studies. The idea behind the project is to understand the structures of the major genes in 1,000 different species of plants to draw insights from their comparative structures.
"It's a very big, complex data management task and data distribution task," said Chris Jordan of TACC. "The data has to be distributed from the main principal investigator to all of the co-principal investigators who are working on different aspects of the science."
It's not just about shuffling and housing data. Data-driven science is about increasing scientific efficiency through reliable high-performance network speeds and effective user interfaces. This is where the researches rely on TACC's expert staff and advanced computing systems to do the heavy lifting in terms of storing, backing-up, distributing, querying, retrieving, analyzing and annotating the data. This saves a lot of time and effort for the researchers who are on the project.
"A couple of years ago, biologists were completely overwhelmed by big data, but our technological prowess has started to catch up," said Matthew Vaughn, a research associate in computational biology at TACC. "We're starting to understand best practices about how to curate and analyze genomics data."
Early on, computer cluster architectures weren't fully optimized for the types of analysis needed in biology, but this too is changing. "Biology applications are very file access-oriented, and a lot of computer cluster technologies weren't designed to with that in mind," Vaughn said. Data storage systems like Corral help scientists by providing fast, reliable access to genome sequence data no matter where they, or their HPC systems, are in the world.
Efforts like the 1000 Plant Genomes Project are making great strides. "Comparative genomics lets you can roll back the clock on evolution to find out how various traits have evolved, how the architectures of the genomes have evolved, and learn about the biology of plants in general, as well as the biology of an individual species," Vaughn said.
With assistance from TACC staff and computational experts at other research centers around the country, biologists are finding it easier to work with bigger and more complex data.
"What TACC has done differently is bring together a group of researchers who actually understand the domain sciences so we can apply these advanced computing technologies to the field," TACC Computational Biology Program Director Michael Gonzales said. "This is where TACC has really taken a leading role."
December 1, 2011
The Texas Advanced Computing Center (TACC) at The University of Texas at Austin is one of the leading centers of computational excellence in the United States. The center's mission is to enable discoveries that advance science and society through the application of advanced computing technologies. To fulfill this mission, TACC identifies, evaluates, deploys, and supports powerful computing, visualization, and storage systems and software. TACC's staff experts help researchers and educators use these technologies effectively, and conduct research and development to make these technologies more powerful, more reliable, and easier to use. TACC staff also help encourage, educate, and train the next generation of researchers, empowering them to make discoveries that change the world.
- An explosion of available information has led to the emergence of a new kind of science: driven by data, rather than theory or equations.
- In traditional computational fields like astrophysics, as well as in emerging applications like smart grids and plant biology, early projects are showing the benefits of using advanced computing to find meaning in massive datasets.
- TACC's systems support hundreds of projects and collections that are leading to the development of best practices in fields as diverse as archeology, epidemiology, and linguistics.
Science and Technology Writer