Race Car Code for Computational Biology

Performance optimization speeds algorithms for plant genetic studies

Banner Image
Variegated maize ears. Genetic difference in the maize genome lead to traits like coloration, growth rates, and hardiness. Scientists are trying to understand these genetic associations in order to create more productive crops. [Photo courtesy of Sam Fentress.]

Once upon a time, it was thought that genes were static. If you had this gene, you had brown eyes; if you had some other gene, you had a terrible disease.  But scientists are increasingly realizing that most traits are determined by a complex network of genes working together to control the biochemical processes that determine the emergence of a trait.

"We're now understanding that genes are literally like an online cybernetic control system that every minute of every day are turning each other on and off," said Steve Welch, professor of agronomy[1] at Kansas State University. "It really is like a computer control system."

Welch is part of a team using the pattern-detecting power of supercomputers to find important relationships among genes that may be responsible for traits in plants. The impetus for the project came from Pat Schnable at Iowa State University and Dan Nettleton, a statistician there with whom he collaborates. Their group was searching for pairs of genes involved in important traits. While the project is quite new, it is making important strides in terms of developing the methodologies for statistical association studies.

Few societal questions are as important as how to feed the world. Climate change, nutrient loss, and rising populations conspire to make this harder and harder. Scientists are racing to understand the genetic nature of draught resistance and insect hardiness so they can engineer tomorrow's supercrops, capable of sustaining a growing human population.

"It's getting cheaper every day to find out every letter in the DNA of all of the genes for many plants," he said. "But it leaves you with terabytes of information you have to sort through and not just find the single genes that may be controlling traits, but to also look for the combinations. That's the frontier right now: how do we start looking for combinations of genes?"

One of the main ways scientists "sort through" large datasets—whether genetic information or exoplanets—is by using high performance computers. In doing so, researchers transform their scientific problem into mathematical equations that are simulated via parallel computing.

Potential epistatic interactions in the maize genome are revealed by interactive visualization of gene expression, genome structure, and the outcomes of the pairwise interaction application. Screen capture from iPlantInteractionBrowser (Matt Vaughn, TACC)

The team developed an algorithm to find genetic associations, but early estimates suggested it would take 1,600 years using conventional hardware and software approaches to complete a simplified version of his problem. The project may not have progressed any further except for the performance optimization experts at the Texas Advanced Computing Center (TACC).

Schnable happened to run into Steve Goff, project director for iPlant, a large NSF program aimed at improving the tools for plant biology. Schnable told Goff about his problem, and Goff connected him with Welch and others involved with relating genes to their resulting traits.  This included TACC staff, who not only run world-class supercomputers, but who are also experts at improving scientific computing codes.

Working with Lars Koesterke, a performance evaluation and optimization expert at TACC, the team simplified the mathematics of the problem, converted the code from Python to MPI (Message Passing Interface: a language for parallel computing), and got it to run on the Ranger supercomputer.

In doing so, Koesterke made the code run 3.2 million times faster, according to Welch, reducing the time to solution from 1,600 years to 4.5 hours, while at the same time increasing the number of iterations by an order of magnitude, improving the accuracy of the studies.

Race cars and computer codes have many things in common. Both are created by engineers using intuition, trial and error, and logistical knowledge. And both go as fast as possible in repeated loops.

"Most of the computing time is spent in loops, and you have to organize the loops so that the loop body is executed at the highest efficiency on the hardware," said Koesterke. "That means that you exploit the hardware parallelism optimally, and that's very difficult, even if you write the code correctly."

Lars Koesterke, research associate in the performance optimization group at the Texas Advanced Computing Center, helped make the plant association algorithm run millions of times faster.

As in a race car testing, small changes in the design can produce big changes in the outcome, or none at all. Treating software optimization like an engineering problem, Koesterke created several different kernels — the inner, most important loop of the code — where the logic was changed in each, to see how the order of different procedures impacted the speed.

"It was just trial and error," he said. "If it runs faster, it's better."

Koesterke made the code run incredibly fast by eliminating unnecessary arithmetic and making aspects of the solution small enough to reside on the Level 2 cache (a shallow pocket of storage near the processor).

"If your code is three times faster, then you can solve a problem that is three times bigger," Koesterke said. "But if your code is a million times faster, then you can do transformational science."

The first round of analysis did not turn up any important associations. But, the innovations in the algorithm and code were significant enough to inspire a paper about the research that was accepted to the 2011 HiCOMB conference, an International Workshop on High Performance Computational Biology.

Association studies are hugely important, not only in plant biology, but also in studies of human health where they are expected to help uncover the genetic roots of most diseases. "Genes are genes," Welch said. "Improved food, feed and fiber production are among the direct benefits of working with plants.  However, better methods of genetic investigation, even if developed with plants, can benefit biomedical research as well."

Welch is serving as a scientist-in-residence at TACC during his sabbatical from Kansas State University. He will be working with TACC staff, including computational biologists Michael Gonzales and Matthew Vaughn, and other programmers, to improve several algorithms related to plant biology.

For Koesterke, the project was an exciting opportunity to translate his knowledge across disciplines while creating new codes from scratch.

"I'm an astrophysicist by training, but I'm more interested in successful collaborations where I can bring my expertise writing really fast code, to the table," he said. "If you have collaborators who are excited, then it's a recipe for success."

March 31, 2011


The Texas Advanced Computing Center (TACC) at The University of Texas at Austin is one of the leading centers of computational excellence in the United States. The center's mission is to enable discoveries that advance science and society through the application of advanced computing technologies. To fulfill this mission, TACC identifies, evaluates, deploys, and supports powerful computing, visualization, and storage systems and software. TACC's staff experts help researchers and educators use these technologies effectively, and conduct research and development to make these technologies more powerful, more reliable, and easier to use. TACC staff also help encourage, educate, and train the next generation of researchers, empowering them to make discoveries that change the world.

Share |
  • Researchers are increasingly using gene sequencing technology to explore plant genetics.
  • These devices generate large amounts of data that require high performance computers to identify connections between genes and traits.
  • Working with TACC staff, researchers were able to make the algorithms used to find associations among genes 3.2 million times faster, reducing the time to solution from 1600 years to 4.5 hours.
  • These advances will help plant, and human, geneticists find valuable insights in association studies.

Aaron Dubrow
Science and Technology Writer
aarondubrow@tacc.utexas.edu