HPC facility
Jack Wells, director of science 
for the Oak Ridge Leadership Computing Facility
'The drug discovery community is a new one for us, but we have some indications that it will become a growing user base. ORNL is a production computing facility, but we try to be at the leading edge of what’s possible in terms of size and capability. In 2009 we had the world’s fastest supercomputer on the Top500 list, Jaguar, and in August 2012 we began to upgrade the system to Titan. We used the same cabinets and in order to get the 10x improvement in peak computational capability we went with GPUs. Although this has as meant a marginal increase of 15 per cent in our electricity consumption, it is drawing in these new communities as they are already exploring general-purpose GPU computing (GPGPU) on desktop machines, workstations or clusters.
'Pharmaceuticals is an aggressive industry that faces big problems. Getting potential drugs through the three-stage review process isn’t easy, in fact most fail in the third round of testing, after the investment of considerable amounts of time, effort and money have been made. If the cause of that failure, such as some unexpected interaction, could have been identified earlier, that investment could have been saved. This is one clear driver for high-performance computing resources being used in drug discovery. But there are a lot of changes occurring at the high-end of computing right now and it’s causing some users to be aggressive and others to be hesitant in terms of writing new software. Almost all our applications have parallelism implicit, but our codes are not always written to express that and that’s the test.'
GPUs
Sumit Gupta, general manager of the Tesla accelerated computing business unit at Nvidia
'GPUGRID.net is a distributed computing platform that was born out of the realisation that as millions of PCs all over the world spend much of their time idling, these cycles could be used to further biomedical research. The challenge is that when conducting molecular dynamics simulations, these large jobs must first be broken down into smaller pieces before they are distributed, the data received and the results aggregated. The issue is that work given to any one PC may not be completed as someone may turn it off or the network may go down. That’s the difficultly with distributed systems, but the genius of the software is that you can actually get useful work out of them. GPU accelerators play a significant part as instead of being able to run one simulation on a single PC, per day, researchers can run up hundreds simultaneously – an order of magnitude improvement that shortens the time to discovery.
'I would say that four discrete steps in the life sciences pipeline make use of GPU accelerators: gene sequencing; sequence analysis; molecular modelling; and diagnostic imaging. In fact, there are a number of gene sequencing machines that now contain GPUs within them. Gene sequencing, sequencing analysis and molecular modelling are emerging scientific disciplines within pharmaceuticals and nearly all large pharma companies have departments focused on computational methods.
'Many companies still focus on theory and experimentation, however, due to the fact that there are a lot of challenges associated with the use of computational methods. One reason why it’s not more pervasive in the industry is that it’s not detailed enough. As GPUs reduce the amount of time it takes to run a simulation, more complex and accurate models can be developed, and I believe this will lead to a wider adoption of computational methods within the pharma pipeline. I would say that the use of computer simulations have been held back in the pharma world because they simply haven’t been fast enough. GPUs are changing this.'
Research
Dr Gianni De Fabritiis, group leader at the Computational Biophysics Laboratory of the Research Programme on Biomedical Informatics (GRIB) within the Barcelona Biomedical Research Park
'The ability we have to run complex molecular simulations is largely due to GPUs, as before them computers were far too slow to be of use. The work we are currently conducting into the maturation of the AIDS virus is exactly the same as we would run on a supercomputer, but by using a distributed computing platform, GPUGRID.net, we are able to achieve around 50 or 60 microseconds of aggregate data per day for a system of roughly 100,000 atoms. The level of detail we are getting is unprecedented and it provides us with a new perspective as we are able to do ligand binding in terms of kinetics.
'The program we use for molecular simulation is designed to run on GPUs. Called ACEMD, Acellera developed it in 2008 and I would say that it’s the fastest of its type. We started to create software in-house and created the distributed system because we needed more power, we then developed a cloud system and have gone so far as to design and manufacture GPU cases. The advantages of this approach are that we have full control over the solution and can guarantee that any patch will work in future generations. The disadvantage is that we have had to undertake all the work ourselves, which takes time away from running simulations.
'Molecular dynamics’ impact has so far been limited because there is no systematic use of simulation within the industry. The tools have been around for a while but only now with GPUs can they be used with some level of interest. If simulation is adopted on a wider scale, its impact will be considerable given that it will practically replace part of the pipeline. There are problems with this, however. We have 200TB of storage and are producing 30TB per year of data and we think that in five years, with the current rate of growth, we will be close to 1PB per year of data. Even handling our analyses is incredibly difficult. Our best solution is to get the data out of the server, use standard scripting to filter it to reduce it to smaller amounts of around 700GB and then analyse that. Data is arguably the biggest challenge – one we have yet to solve.'
Storage
Jeff Denworth, VP of marketing at DataDirect Networks
'With the advent of high-throughput sequencing, there is now a tight alignment between analytics and HPC. The generation of huge amounts of genomic data means that pharmaceutical companies are evolving beyond their traditional file storage capabilities and are looking to scale-out NAS to accelerate their drug discovery pipeline. Storage costs are a necessary part of the strategy for these companies, but the challenge is the scaling up of capacity to support growing sequence analysis farms. Everyone wants to deploy HPC-style technologies because once they get on that scalability curve, they can make business decisions without needing to make possibly disruptive IT decisions. Ultimately, the industry is moving away from a standard infrastructure and towards parallel file systems. From an IT perspective, there is a reconciliation that has to happen in order for them to learn how to use these tools and how to scale them up effectively.
'In terms of where investments are being deployed, it’s no longer standard IT initiatives – budget cuts are forcing organisations to make smarter decisions on optimising and consolidating their infrastructures. New technologies, such as Apache’s Hadoop software stack, are really being accepted by industries, sooner rather than later, and one of those is pharma where the combination of parallel computing and enterprise business intelligence is coming together for Big Data analytics.'
Research
Jerome Baudry, assistant Professor at University of Tennessee, group leader at UT/ORNL Center for Molecular Biophysics
'HPC has allowed us to scale up the complexity of the problems we can tackle, and in the world of drug discovery we are now able to calculate not only the behaviour of the drug candidate in a test tube, but how it behaves in a cell. While statistical approaches exist to predict the toxicity of drug candidates, we can now systematically reproduce what happens in the cell of the patient on the computer. In addition to aiding drug discovery, this reproduction facilitates the repurposing of drugs and enables us to find new applications for existing compounds. In a few years, when we have reached the next level of computing, we will be able to reproduce what happens not only within a cell but in the entire body of a patient, and within a group of patients. This computational power will be used to calculate what will happen, rather than simply extrapolating possibilities from what we observe in a test tube. Computing beyond what would actually happen in a test tube is a revolution in the way we approach drug discovery.
'The challenge is that in order to make efficient use of these resources, each discipline has to embrace the cultures of the others and essentially translate the foreign languages of science. Being able to explain the biology to a computer scientist or the computational difficulties to a biologist, for example, is fundamental if we are to catalyse the use of HPC. I’m a computational biophysicist, not a computational scientist, and I came to HPC because it was clear that it would allow us to make a quantum leap in the kind of scientific questions we are able to answer. I somewhat naively thought that when moving our work to supercomputers we would simply upload the program and run it, but of course that’s not what happened. It’s very important that the code is optimised to run on these incredible machines and to my surprise I found myself spending the first few years of my research optimising the program with computer scientists and mathematicians who could translate the equations into efficient computer programs. The fundamental science is here, the hardware is here, but we had to do a lot of work on the software to parallelise it efficiently.'
Interviews by Beth Harlen