Phil Butcher, the head of IT at The Wellcome Trust Sanger Institute, discusses how HPC is being used to tackle enormous amounts of genome data
The Wellcome Trust Sanger Institute is a large-scale genome research organisation that is centred around DNA sequencing, but focused on understanding the role of genetics in health and disease. Set up in 1993, one of the Institute's earliest scientific endeavours was our participation in the Human Genome Project and at the time we were hoping to sequence one sixth of a single human genome within 10 years.
This was a massive undertaking but not only did we succeed, we in fact sequenced approximately one third within eight years. Sequencing techniques have changed and instead of looking at single genomes, scientists are now planning to sequence in the order of 100,000 genomes in the next five years – half the time it took for the human genome project to sequence the first genome. Advances in high-performance computing (HPC) and IT have played an enormous part in this, but as we try to progress further we are facing major challenges.
Moore's law has underpinned and aided the development of our methods of handling the massive amount of data associated with a single genome, but the speed of change in life sciences is certainly running far ahead of it now. I would go so far to say that DNA sequencing techniques are advancing at a rate of 20 times that of Moore's law, which means that we are pushing the limits of the HPC solutions to achieve our goals. However, there is a lot of innovation and effort to make those goals a reality.
In 2005 we had a storage capacity of 300 Tbytes and today we are running 12 Pbytes – a phenomenal growth in our HPC that resulted from the need to accommodate a huge increase in scientific output. That expansion has included the deployment of large-scale clusters and we currently run 14,500 cores. We also use Platform Computing’s LSF tools to distribute workflows across these platforms, which in turn enables the science.
The main problem we have is not just the fact that we need to keep scaling up and finding new solutions, it's that the industry itself is evolving. There has been a significant consolidation of vendors over the past few years, and this had led to a big reduction in our choices. Looking at the hardware, we have to work very closely with organisations that are willing and have the ability to provide solutions that fit our business in life sciences. It sounds obvious, but it's harder than you might think!
For example, we use a particular network attached storage (NAS) product that has been bought by two separate organisations and is now a much smaller part of a big storage group. Because the product is on a much smaller scale compared to the rest of the products they offer, these companies will no doubt stop providing it. Consolidation can be a good thing for the industry, but it can also mean that we, as users, can lose products that are useful to us. And there isn't always a lot of choice to begin with at the scale of computing our scientists require.
We work very closely with the scientists here at the Sanger Institute to provide them with solutions, and view our relationship as a partnership rather than a service. We are here to recommend hardware that will address their needs, but at the same time they have to ensure they develop software that can take advantage of those platforms. We attempt, as much as possible, to build homogeneous systems, but the number of different projects we have at the Sanger Institute at any given time has led to a lot of complexity. During the Human Genome Project, for example, we were predominantly 64-bit alpha and single vendor based and therefore the solutions we ran were much easier to manage. Now that we have more complex systems from a few vendors, it has become a very diverse HPC environment, albeit with a fairly defined architecture.
The scale of our IT group has had to reflect these changes. When the Institute first opened we had a relatively small IT team, whereas we arguably have quite a large one now. Each individual area, such as the databases and infrastructure, has a dedicated team of five people, so we have ended up with small distinct groups of people who manage a diverse range of systems. And we expect to keep expanding those numbers to reflect the continual growth in sequence data output.
At the moment we have 12 Pbytes of storage, but we estimate that we will have more than 25 Pbytes within the next five years. The number of cores in our compute farms will also rise in order to cope with the analyses being done. This is of course possible, but the question becomes how do you build ever larger IT systems that can sustain that output? These are not one-off projects that are being conducted here and at other similar institutions, and so high-performance computing has become a more general part of life sciences. However, this is what we must provide in order to continue supporting and enabling scientific research and I do believe that we will evolve the solutions to keep pace, but it will be a major challenge.