Click here to go to the TACC Home Page
A Boost for Bioinformatics in the Context of Evolution

Biology is a data-rich science--and Professor Daniel Miranker at The University of Texas at Austin (UT Austin) has a revolutionary new methodology for making the most of it. He and his research group use the resources of the Texas Advanced Computing Center (TACC) and collaborate with TACC researchers to develop and prove their concepts.

Screenshot of multi-colored text
This is a sequence alignment retrieved from the PlantsP database plantsp.sdsc.edu

Each line of letters is part of a sequence of amino acids belonging to a protein called calcium-dependent protein kinase (CPK). Each line is a slightly different CPK. Each letter stands for a different amino acid. The vertical alignment shows differences at each position (different letters in the column) and gaps in the rows where the alignment over the entire family just does not work for that sequence. This kind of map represents the best guess at aligning the different CPK proteins with the sequence given on the first line, as made by a computer program given certain constraints.

All of the sequences come from a family of CPKs found in an unassuming little green plant commonly called a mouse-ear cress, related to mustard, with the Latin name of Arabidopsis thaliana. Because its characteristics make it ideal for genetic studies, Arabidopsis was the first flowering plant whose DNA (genome) was sequenced. Small portions of that genome (genes) code for each of the CPKs whose amino acid sequences are shown here. Graphic courtesy of Michael Gribskov, Purdue University, principal investigator for PlantsP.

Biologists, like other scientists, want easy, rapid access to the mountains of data they have built. But biological data are not like most other kinds of data. They must be understood as cross-linked by complex temporal, spatial, and historical relationships. Every datum is part of a larger context that cannot be ignored. This is true whether the data in question are sequences of DNA; mass spectra from the proteins of an entire genome; or the shapes of active sites on proteins critical to drug design.

The larger context is evolution, the history of life on earth. The technology that led to the human genome sequence has been applied now to hundreds of organisms; in addition to "genomics," scientists study "proteomics," the proteins created from the instructions in the genes of the genomes. Key to interpreting these data is the effort to relate them to their origins in evolutionary history. The evolutionary "distance" from one DNA or protein sequence to another may be a single change in a nucleotide or amino acid or a long series of such changes; in either case, when it can be quantified, it supplies another, crucial dimension to the data.

"Our group is leveraging this fact to invent new ways to index, store, and access biological data whose meaning is rooted in the mathematics of evolution," says Daniel Miranker, professor of computer sciences at UT Austin. "We've been using computational resources at TACC to show that our methods deliver faster access, depict and detect relationships accurately, and can be easily generalized to many different types of biological data, ranging from sequences to mass spectra to volumetric models of the chemical fields around proteins." Miranker's group includes Willard Briggs, Rui Mao, Shu Wang, and Weijia Xu, plus M.S. students and undergraduates.

"We're also delivering ease of use--simple programmability and a high level of abstraction--by integrating our new algorithms into existing database languages and systems. We use Web-based tools to make our system workable and transparent for both biologists and computer scientists," he says. Miranker sees a crying need for this work. "Just look at any issue of Bioinformatics, one of the main journals in the field," he says. "Every discovery--even if it seems like a simple database search--calls forth a new computer program, or at least ad-hoc scripting and custom parameterization of existing tools."

The rather anarchic state of bioinformatics does not result from inadequacies of biologists or computer scientists, Miranker says, but rather directly from the multidimensional nature of biological data. Programs written to analyze certain data cannot cope when the biological context shifts or the amount of data grows hyper-exponentially--and this growth has been happening all across the field of computational biology.

"Some life science data--especially sequence databases for genes and proteins--are growing much faster than the growth in computer processor speeds," Miranker says, "which means that the performance of any exhaustive-search algorithm applied to ever-larger volumes of data is guaranteed to degrade, even though the computers run faster."

MoBIoS: A Unifying Biological Information System
Metric Space

"Metric space" is a term belonging to mathematical set theory. The "space" is not empty but is rather a mathematical set of objects, such as a set of short sequences or subsequences in a sequence database. For every three objects x, y, and z, the following conditions are true:

  • The distance between any two objects x and y is either zero (in which case x is identical to y) or greater than zero. This is the condition of positivity.
  • The distance from x to y is the same as the distance from y to x. This is the condition of symmetry.
  • The distance from x to z added to the distance from y to z is greater than the distance from x to y. This is the triangle inequality.

Intuitively, the triangle inequality says that if two objects are similar to a third object, they cannot be too dissimilar from one another. Algorithmically, the triangle inequality helps organize subsets of very similar data into clusters. If a new data element is sufficiently dissimilar to a given element in a cluster, then one can rule out similarity with the remaining elements in the cluster, without any additional similarity comparisons. This obviously saves computational time.

Some biological similarity measurements give high scores to similar sequences, but in metric space, similarity scores are all between zero and one, with zero (identity) being the most similar. This feature also saves computational time when the data to be compared or retrieved can be stored or indexed in metric space.

There are several categories of metric-space index structure, and the one used in Miranker's MoBIoS system is called a "multiple vantage-point tree structure" (MVP-tree), which partitions the data mathematically to enable the most rapid searching. The calculation of vantage points, done to accord with the memory and I/O characteristics of the computer or computers on which MoBIoS resides, again increases the efficiency of data mining.

What Miranker and his group have been developing is a database management system named MoBIoS (pronounced mobius), for Molecular Biological Information System. It relies chiefly on a methodology called metric-space indexing (see sidebar). Some researchers have applied metric-space indexing for finding images in image databases, but it has yet to be applied properly to the main types of biological data. "That is a pity," Miranker says, "because molecular biological data are related not only in their chemical structures but also in their temporal, that is, evolutionary distance from one another. They are thus inherently multidimensional, a property that can be captured by metric-space indexing. Evolution dictates the inherent clustering of similar data in metric space."

Most of the large, public biological databases (e.g., GenBank and TreeBase, to name just two) are not contained within proper database management systems at all. Instead, they consist of relational database tables plus programs or utilities (usually with a common look and feel) that perform searches.

A good example is sequence data. Nothing is growing as fast as the sequence data for DNAs (with only a four-letter alphabet) from the genomes of hundreds of organisms, their various corresponding RNAs (also a four-letter alphabet), and their product proteins (a much more complex 20-letter alphabet). How do biologists manipulate these data?

The very well known local alignment tool BLAST, for example, finds homologous (similar) sequences by taking a query sequence and doing a brute-force, pairwise comparison with each stored sequence, then calculating a likeness score. The higher the score, the better a stored sequence matches the target sequence. The many variations on this code are more sophisticated, often tailored to find specific kinds or lengths of sequence, but the underlying process is still an exhaustive search of all the data.

From the description of metric-space indexing in the sidebar, it can be seen that two random objects in a metric-indexed set would score closer to zero the more similar they were. Searching is simpler when the most similar data cluster together. Imagine that the data are points within a sphere. Choose a point, and its nearest neighbors will be located within some chosen radial distance from it. This clustering alone accounts for a vast increase in search speed in Miranker's system, and, indeed, some other computer scientists have experimented with metric indexing for various kinds of data.

"What is different about MoBIoS," Miranker says, "is that it is a database management system that can accommodate many data types so long as we take care to represent true biological relationships--evolutionary distances among sequences, for example--via the metric-space indexing. Our data structures and query language are based on the semantics of biology and are thus logical and scalable. MoBIoS can accommodate biology's data growth."

To test the performance of MoBIoS as a data management system against other algorithms, Miranker needed a computer with a large, shared memory. "Our initial tests were performed on Milagros, the Sun 6800 used by the TACC Scientific Visualization group," Miranker says, "and now we are moving our studies to Maverick, TACC's new Sun Fire E25K with 512 gigabytes of shared memory, which is very suitable for our purposes."

In sum, then, MoBIoS is a specialized, next-generation database management system targeting life science data. Its storage manager depends on metric-space indexing; it uses object-relational models of complex biological data types; and the group modified the query language SQL (the standard relational database structured query language), to encompass biological data types. "Just as geographic information systems have been enabled by spatial databases," Miranker says, "we argue that biological information systems will be enabled by metric-space databases."

A Test Case

One of the talents of MoBIoS is the ability to perform "whole genome joins" easily and to search rapidly on the results. Professor Randy Linder and his students Anneke Padolina and Ruth Timme in the UT Austin Department of Integrative Biology are interested in the evolutionary history of flowering plants, which is difficult to reconstruct because of hybridization and horizontal gene transfers. Their problem formed a test case for MoBIoS.

If two genes in two plant genomes have evolved from a common ancestor by hybrid speciation, they are likely to be found in highly conserved regions of the DNA of each genome. In the lab, reconstruction of hybrid speciation requires the comparison of multiple, independent, orthologous (diverging after speciation) DNA sequences to determine hybrid parentage.

To find such sequences, Linder, Padolina, and Timme wanted to take advantage of the fact that pairs of primers (short sequences that tag the ends of candidate ortholog sequences) for the polymerase chain reaction (which amplifies the DNA sequence between the primers) could uniquely identify the potential orthologous genes if the primer sequences could be found in each genome. To search for highly conserved primer-length regions between two genomes, it seemed a necessity to cross-compare the entire genomes themselves.

Close-up screenshot of multi-colored text
A portion of the sequence alignment seen in full above, centered on the letters that denote the active site of the CPK protein. Since all the variants are very much alike here, there are no gaps in this portion of the alignment.

Linder and students thus had a problem that challenged the speed with which MoBIoS could sort through entire genomes once the sequences were stored via metric-space indexing. Working with Miranker and students, they were able to find conserved primer pair candidates between two recently sequenced flowering-plant genomes, Arabidopsis thaliana and rice. (Arabidopsis is a small plant in the mustard family whose qualities have made it an exemplary laboratory plant, hence it was among the first sequenced, about four years ago. Rice is of great economic importance, with a larger genome, and several research groups completed the sequence shortly after Arabidopsis. The genomes are available in public databases.)

Using MoBIoS, Linder and Miranker joined and searched the two genomes. They found a large number (about 13,000) of 18-nucleotide-long primer pairs. In lab experiments based on the findings, the researchers found orthologous regions in other flowering plants (a sunflower and six orchid species).

"Dan Miranker's database system was essential in our search for conserved regions that might serve as primers for PCR," Linder says. "We needed to be able to quickly search two complete plant genomes (rice and Arabidopsis) and filter our results in biologically relevant ways. MoBIoS, with its extended SQL, made it all possible." Linder reported the results at the Intelligent Systems for Molecular Biology annual meeting in August 2004, in Scotland, and Linder's and Miranker's groups published them in the Proceedings of the meeting.

Future Plans

Miranker has conducted tests of the MoBIoS data management system on data from mass spectrometry of proteins, and the group has also been working with the TACC Data and Information Systems group on a number of projects. "We expect this collaboration to bear fruit in the coming year," Miranker says, "in line with TACC plans to expand the role of computation in biological research."

Miranker notes that the great generality of the system renders it, philosophically speaking, a kind of metasystem that can reduce the anarchy of bioinformatics. "What we are developing here," he says, "is essentially a body of software offering an integrated set of services for the management of biological data." Those interested in the details of MoBIoS and metric-space indexing may contact Miranker or visit the group's web site.

--Merry Maisel

Research Feature - December 23, 2004