ANALYSIS & OPINION

Supercomputing transforms study of evolutionary relationships

11 April 2011



A supercomputing resource created by researchers at the San Diego Supercomputer Center at the University of California, USA, is allowing scientists to study evolutionary relationships among large populations of living things in significantly shorter times – and without the need to understand how to operate large, complex computer systems. 

The new resource, called the CIPRES (CyberInfrastructure for Phylogenetic RESearch) Gateway, is an internet portal that allows scientists anywhere in the world to upload their data via a standard Web browser and perform phylogenetic analyses. The most time-consuming analyses use supercomputers, such as SDSC’s new Trestles system, that are part of the National Science Foundation’s TeraGrid, a collection of high-performance computing resources dedicated to academic research.

‘In addition to answering the age-old questions of how all living things are related to each other, understanding evolutionary relationships has some very important practical benefits,’ said Mark Miller, principal investigator in SDSC’s Research, Education and Development group, and leader of the CIPRES Gateway project. ‘For example, knowing the evolutionary relationships among a group of viruses or bacteria can help doctors understand where an infection came from, effectively treat patients who are infected, and work to contain the spread of disease during an outbreak.’

Furthermore, understanding how individual species adapt for survival in a specific geographic location can help scientists manage a species for long-term survival in that location, or engineer crops for higher productivity in a particular location. Evolutionary relationships are uncovered by comparing DNA sequences from individuals under study. Just as a single DNA sequence can be used to identify a criminal with a very high degree of accuracy, a group of DNA sequences can be used to determine just how closely related any group of living things are with great precision.

‘DNA sequences from individuals can be prepared so quickly and cheaply now, we can understand evolutionary relationships more accurately than ever before,’ according to Miller. ‘The problem is, the number of computations required grows quickly as the amount of data grows. There are only three possible relationships between any four individuals, but there are more than two million different relationships between 10 individuals. A computer that could analyse a million trees per second would require about 20 billion years to test all the possible relationships for just 22 individuals!’

Solving this problem is where the CIPRES Gateway and TeraGrid supercomputers come in. The power of supercomputers comes from parallel computing, in which large analyses are broken into smaller pieces that are run simultaneously on many processor cores. Under the TeraGrid’s Advanced User Support program, Wayne Pfeiffer, a scientist at SDSC, helped improve the parallel performance of RAxML and MrBayes, two widely-used phylogenetics codes.

‘Most RAxML analyses submitted to the CIPRES Gateway now run on 60 cores of Trestles,’ said Pfeiffer. ‘With a typical speedup over a single core of about 30, this means that analyses that would require a month on a laptop can be completed in a day via the gateway.’

Related internet links

San Diego Supercomputer Center
CIPRES Gateway
TeraGrid