A team of scientists is using the pattern-detecting power of supercomputers to find important relationships among genes that may be responsible for traits in plants. The impetus for the project came from Pat Schnable at Iowa State University, USA, and Dan Nettleton, a statistician at the university. Their group was searching for pairs of genes involved in important traits and while the project is quite new, it is making important strides in terms of developing the methodologies for statistical association studies.
‘It’s getting cheaper every day to find out every letter in the DNA of all of the genes for many plants,’ said Steve Welch, professor of agronomy at Kansas State University, and member of the project team. ‘But it leaves you with terabytes of information you have to sort through and not just find the single genes that may be controlling traits, but to also look for the combinations. That’s the frontier right now: how do we start looking for combinations of genes?’
One of the main ways scientists ‘sort through’ large datasets—whether genetic information or exoplanets—is by using high-performance computers. In doing so, researchers transform their scientific problem into mathematical equations that are simulated via parallel computing. The team developed an algorithm to find genetic associations, but early estimates suggested it would take 1,600 years using conventional hardware and software approaches to complete a simplified version of his problem.
Working with Lars Koesterke, a performance evaluation and optimisation expert at the Texas Advanced Computing Center (TACC), the team simplified the mathematics of the problem, converted the code from Python to MPI (Message Passing Interface: a language for parallel computing), and got it to run on the Ranger supercomputer. In doing so, Koesterke made the code run 3.2 million times faster, according to Welch, reducing the time to solution from 1,600 years to 4.5 hours, while at the same time increasing the number of iterations by an order of magnitude, improving the accuracy of the studies.
Treating software optimisation like an engineering problem, Koesterke created several different kernels — the inner, most important loop of the code — where the logic was changed in each, to see how the order of different procedures impacted the speed. He made the code run fast by eliminating unnecessary arithmetic and making aspects of the solution small enough to reside on the Level 2 cache (a shallow pocket of storage near the processor).
Speaking to Scientific Computing World, Koesterke said that if you can improve the speed of a code by a factor of two, scientists wait half the time to execute or solve something, ‘but if you improve the speed of the code by a factor of 100,000, they are able to tackle questions that would have taken exponentially longer. And improving codes to this degree enables scientists to take their research in completely new directions that they would have never thought possible before.’
Steve Welch, who is serving as a scientist-in-residence at TACC during his sabbatical from Kansas State University, added that the team was able to produce a working code in a relatively short period of time. ‘At that point in time none of us working on the biology side were really familiar with the power of supercomputing. What was really astonishing to us was that when the proper language is used, and an expertise is brought to bear from a supercomputing code optimisation perspective, enormous speed-ups are achievable that expand the realm of the possible.’