Meeting the big data needs of life sciences computing
George Vacek discusses why heterogeneous computing in the cloud will become a necessity in life sciences
The combination of heterogeneous computing and cloud computing is emerging as a powerful new option for researchers who require high-performance computing (HPC). Neither cloud computing nor the use of hybrid computing architectures is completely new, of course, but the rise of big data as a defining feature of modern life sciences and the proliferation of vastly differing applications to mine the data have dramatically changed the landscape of computing requirements.
Heterogeneous cloud computing offers the potential to flexibly shift from one HPC architecture to another in a secure public or private cloud environment. As such, it meets two critical needs for life sciences computing: crunching more data faster, and increasing access to accelerated performance. As data sets have ballooned, HPC approaches have evolved and diversified to better match specific problems with their most effective architecture. Indeed, some problems (for example, de novo metagenome assembly) are essentially intractable unless tackled with special-purpose solutions. Today, no single approach is optimal for all analyses. Heterogeneous computing embodies the use of multiple approaches to computational processing (CPUs, GPUs, FPGAs, etc.) to achieve superior throughput for each big data workload.
While the scope and complexity of HPC resources have grown, the ability of research groups to identify, afford, and support them has diminished. Budget constraints are limiting access to necessary compute resources at the very time when the explosive growth in life sciences data makes access increasingly desirable. HPC-oriented clouds supporting the latest heterogeneous architectures can provide even small research groups with affordable access to diverse compute resources.
Heterogeneous computing has become a necessity in life sciences, where the output from high data rate instruments, including next-generation sequencers (NGS), represents a data tipping point. This data deluge has outpaced even the steady performance-doubling of Moore’s Law.
Recently, systems based on field-programmable gate arrays (FPGAs) have been gaining momentum throughout genomics and life sciences. Programmable ‘on the fly’, FPGAs are a way of achieving hardware-based, application-specific performance without the time and cost of developing application specific integrated circuits (ASICs). FPGAs work well on many bioinformatics applications – for example, those that do searching and alignment. Such applications rely on many independent and simple operations and are thus highly parallelisable.
The vast majority of bioinformatics and healthcare applications have been run on standard clusters. That model is beginning to change as research organisations hit technical and financial roadblocks which prevent the obtaining of sufficient HPC resources for analysing all the output of high-data-rate experimental instruments. An example would be sequencing centres dealing with large data sets coming from NGS. This big data is forcing changes in attitudes and driving a need for faster, more power-efficient computing.
A confounding aspect in life sciences and healthcare is that biologists, physicians, and other users are usually not IT or HPC experts. For many, it is challenging enough to choose the best application for a given problem, let alone determine which HPC architecture would run it most efficiently.
Another concern with special-purpose HPC architectures is the need to adapt existing software or develop new software to take advantage of the specific approach, which can consume both time and resources. Increasingly, systems makers are tackling this problem and working to ensure their application coverage is attractive. User groups are also springing up around particular architectures, developing their own accelerated applications and making them available to others. The emergence of more high-level tools for FPGAs and GPUs is also helping to speed and simplify applications development.
Moving heterogeneous HPC assets into a cloud computing environment is a natural step. It provides the widely discussed benefits of cloud computing such as lower costs and rapid scalability, and magnifies them in cases where heterogeneous HPC resources entail greater cost, integration, and management challenges than standard cluster-based resources. A few of cloud computing’s substantial benefits include:
• Pay as you go. With cost-containment an increasing priority, many research organisations are focusing funds on core competencies, preferring to outsource where practical. Cloud environments make it possible to rapidly scale jobs up (or down) as needed, and heterogeneous HPC cloud charges typically vary based on which resources are used.
• Different budget. It’s often easier to tap variable operations budgets than go through a lengthy approval process for scarce capital equipment funds. This also holds for grants where funds can be used for analysis of the data generated in an experiment, including renting the necessary compute cycles but not for the acquisition of servers.
• Reduced IT support. Many computer administration costs (and worries) are shifted to the services provider, rather than burdening the principle investigator who is leading the research project.
• Technology upgrades. Pushed by competition and customers, cloud providers can be expected to be earlier adopters of new hardware and software technology. This enables the user community to benefit from the latest advances without undertaking technology refreshes every couple of years.
• Public or private. Cloud providers increasingly offer secure public clouds, used by many clients, or private clouds firewalled off and dedicated 24/7 to a single organisation.
Heterogeneous HPC cloud computing is quickly maturing. Advances in virtualisation technology and the optimisation of key algorithms on high-performance hardware are important enablers of cloud-based, heterogeneous computing. While challenges remain, heterogeneous HPC in the cloud is demonstrating the potential to broadly enable researchers and clinicians.
George Vacek, Ph.D., is director of Convey Computer’s Life Sciences business unit