Latest News

 

Race Car Code for Computational Biology

Published on March 31, 2011 by Aaron Dubrow

Variegated maize ears. Genetic difference in the maize genome lead to traits like coloration, growth rates, and hardiness. Scientists are trying to understand these genetic associations in order to create more productive crops. [Photo courtesy of Sam Fentress.]

Once upon a time, it was thought that genes were static. If you had this gene, you had brown eyes; if you had some other gene, you had a terrible disease. But scientists are increasingly realizing that most traits are determined by a complex network of genes working together to control the biochemical processes that determine the emergence of a trait.

"We're now understanding that genes are literally like an online cybernetic control system that every minute of every day are turning each other on and off," said Steve Welch, professor of agronomy[1] at Kansas State University. "It really is like a computer control system."

Welch is part of a team using the pattern-detecting power of supercomputers to find important relationships among genes that may be responsible for traits in plants. The impetus for the project came from Pat Schnable at Iowa State University and Dan Nettleton, a statistician there with whom he collaborates. Their group was searching for pairs of genes involved in important traits. While the project is quite new, it is making important strides in terms of developing the methodologies for statistical association studies.

Few societal questions are as important as how to feed the world. Climate change, nutrient loss, and rising populations conspire to make this harder and harder. Scientists are racing to understand the genetic nature of draught resistance and insect hardiness so they can engineer tomorrow's supercrops, capable of sustaining a growing human population.

"It's getting cheaper every day to find out every letter in the DNA of all of the genes for many plants," he said. "But it leaves you with terabytes of information you have to sort through and not just find the single genes that may be controlling traits, but to also look for the combinations. That's the frontier right now: how do we start looking for combinations of genes?"

One of the main ways scientists "sort through" large datasets—whether genetic information or exoplanets—is by using high performance computers. In doing so, researchers transform their scientific problem into mathematical equations that are simulated via parallel computing.

Potential epistatic interactions in the maize genome are revealed by interactive visualization of gene expression, genome structure, and the outcomes of the pairwise interaction application. Screen capture from iPlantInteractionBrowser (Matt Vaughn, TACC)

The team developed an algorithm to find genetic associations, but early estimates suggested it would take 1,600 years using conventional hardware and software approaches to complete a simplified version of his problem. The project may not have progressed any further except for the performance optimization experts at the Texas Advanced Computing Center (TACC).

Schnable happened to run into Steve Goff, project director for iPlant, a large NSF program aimed at improving the tools for plant biology. Schnable told Goff about his problem, and Goff connected him with Welch and others involved with relating genes to their resulting traits. This included TACC staff, who not only run world-class supercomputers, but who are also experts at improving scientific computing codes.

Working with Lars Koesterke, a performance evaluation and optimization expert at TACC, the team simplified the mathematics of the problem, converted the code from Python to MPI (Message Passing Interface: a language for parallel computing), and got it to run on the Ranger supercomputer.

In doing so, Koesterke made the code run 3.2 million times faster, according to Welch, reducing the time to solution from 1,600 years to 4.5 hours, while at the same time increasing the number of iterations by an order of magnitude, improving the accuracy of the studies.

Race cars and computer codes have many things in common. Both are created by engineers using intuition, trial and error, and logistical knowledge. And both go as fast as possible in repeated loops.

"Most of the computing time is spent in loops, and you have to organize the loops so that the loop body is executed at the highest efficiency on the hardware," said Koesterke. "That means that you exploit the hardware parallelism optimally, and that's very difficult, even if you write the code correctly."

Lars Koesterke, research associate in the performance optimization group at the Texas Advanced Computing Center, helped make the plant association algorithm run millions of times faster.

As in a race car testing, small changes in the design can produce big changes in the outcome, or none at all. Treating software optimization like an engineering problem, Koesterke created several different kernels — the inner, most important loop of the code — where the logic was changed in each, to see how the order of different procedures impacted the speed.

"It was just trial and error," he said. "If it runs faster, it's better."

Koesterke made the code run incredibly fast by eliminating unnecessary arithmetic and making aspects of the solution small enough to reside on the Level 2 cache (a shallow pocket of storage near the processor).

"If your code is three times faster, then you can solve a problem that is three times bigger," Koesterke said. "But if your code is a million times faster, then you can do transformational science."

The first round of analysis did not turn up any important associations. But, the innovations in the algorithm and code were significant enough to inspire a paper about the research that was accepted to the 2011 HiCOMB conference, an International Workshop on High Performance Computational Biology.

Association studies are hugely important, not only in plant biology, but also in studies of human health where they are expected to help uncover the genetic roots of most diseases. "Genes are genes," Welch said. "Improved food, feed and fiber production are among the direct benefits of working with plants. However, better methods of genetic investigation, even if developed with plants, can benefit biomedical research as well."

Welch is serving as a scientist-in-residence at TACC during his sabbatical from Kansas State University. He will be working with TACC staff, including computational biologists Michael Gonzales and Matthew Vaughn, and other programmers, to improve several algorithms related to plant biology.

For Koesterke, the project was an exciting opportunity to translate his knowledge across disciplines while creating new codes from scratch.

"I'm an astrophysicist by training, but I'm more interested in successful collaborations where I can bring my expertise writing really fast code, to the table," he said. "If you have collaborators who are excited, then it's a recipe for success."


Contact

Faith Singer-Villalobos

Communications Manager
faith@tacc.utexas.edu | 512-232-5771

Aaron Dubrow

Science And Technology Writer
aarondubrow@tacc.utexas.edu

Jorge Salazar

Technical Writer/Editor
jorge@tacc.utexas.edu | 512-475-9411