HPC in the cloud - is it real?
Optimising high-performance computing applications is all about understanding both the application and the target platform. HPC developers worry about memory bandwidth, data placement, cache behaviour and the floating point performance of the target processors and compute accelerators in order to deliver the very best performance. Cloud computing, on the other hand, is all about virtualisation, which hides details of the target architecture from the application. Cloud offers both resource usage and business model flexibility, which is good – but perhaps not for HPC applications, or is it?
The transition from grid to cloud
The term grid computing became popular in technical computing circles around the turn of the century to describe an on-demand model for the use of distributed compute resources – the word ‘grid’ being chosen as it is analogous to the way that the power grid delivers pervasive access to electricity. Grid approaches could be used at many levels. One characterisation, proposed by Wolfgang Gentzsch while director of grid computing at Sun Microsystems, comprised cluster grids, enterprise grids, and the global grid. Cluster grids are now simply called clusters, for which grid middleware is used to ensure effective resource management.
An enterprise grid has much in common with what is now called a private cloud, while public clouds and the global grid have many similarities. There are many definitions of both grid and cloud computing. The exact terminology doesn’t really matter, the key points being that the emergence of cloud computing owed much to the work of the pioneers in grid computing, and that a major difference between grid and cloud is the exploitation of virtualisation in the cloud, although virtualisation is often not included in HPC-specific cloud offerings.
HPC systems deliver insight to scientists and engineers through the simulation of everything from subatomic particles to stars and galaxies, for purposes as diverse as fighting disease, climate modelling, and designing complex products such as cars and aircraft. But few organisations can keep large HPC facilities busy full time, so many cannot justify buying their own personal supercomputer. This is where the cloud comes in. By accessing HPC facilities in the cloud an organisation can use (and pay for) the facility it needs, when it needs it – with no responsibility for paying for or managing the resource beyond the period when it is required by a project. The utility business model of cloud computing is very powerful, but can cloud deliver the same value for HPC that it does for outsourced email, web applications or business applications delivered through software-as-a-service (SaaS)?
HPC cloud offerings can be described – using very broad brushstrokes – as falling into one of two camps. On the one hand are simple clusters, while on the other hand are highly tuned facilities, adding compute accelerators and high-performance interconnects to standard clusters. HPC applications deliver high performance through exploiting a high degree of parallelism. Different classes of application behave in different ways. Some, for example a Monte Carlo simulation of financial risk analysis, can compute many independent calculations before combining all of the results at the end of the run. Others, such as weather forecasting, require the regular exchange of data between the many parallel threads of computation. So some HPC applications can run effectively on standard cloud infrastructures, while others require specific HPC capabilities.
It’s the latency, stupid
Some HPC applications are highly scalable and can run efficiently in a standard cloud environment, while others are so dependent on fast communication between nodes that they need a dedicated cluster designed with HPC in mind. High latency (or indeed low bandwidth) of communications can limit performance in several ways. First, it can just slow the application down. Second, it can limit scalability. Latency that may be acceptable when 12 nodes are communicating may become a limiting factor at hundreds of nodes. Finally, the latency effect of cloudbursting can mean that an application that runs effectively in a private or public cloud won’t work if it cloudbursts to a hybrid cloud as the latency between a private and public cloud is often orders of magnitude more than for communications within a cloud.
Many cloud providers appreciate that some HPC applications are more demanding than mainstream applications, and provide special offerings to meet the needs of the HPC community, including non-virtualised solutions, HPC platform-as-a-service, HPC applications in a SaaS model, systems architected to meet the needs of HPC applications and a range of HPC services. This section highlights a cross section of these, but is far from complete as offerings are evolving all the time, and the flexibility of cloud means that new companies can appear out of nowhere and quickly deliver sophisticated solutions leveraging the cloud.
Advania is a Nordic IT company that hosts facilities for corporate clients worldwide. Among its offerings is an HPC cloud service for the academic community in Sweden, Denmark and Norway, which it delivers from its energy-efficient Thor data centre in Iceland.
The Amazon Elastic Compute Cloud (EC2) provides two types of cluster specially configured for HPC needs, cluster compute and cluster GPU instances, which both use 10 Gbps Ethernet networks. The nodes in the GPU instances include Nvidia GPUs and support both Cuda and OpenCL development tools. Other cluster configuration and management tools available include Adaptive Computing, Bright Computing, Cycle Computing, Intel, Platform Computing, StackIQ and Univa.
As part of its cloud computing offering, BT has a number of HPC options, including systems pre-loaded with the Open, Oracle or Univa version of Grid Engine and the BT Life Sciences HPC virtual machine template. This is supported by a growing library of tools that are often used to manage life sciences workloads.
Bull’s ‘extreme factory’ delivers a range of engineering applications in the cloud using a SaaS model. The applications include Altair Hyperworks, Ansys CFX and Fluent, EXA Powerflow, LS-DYNA, OpenFOAM, STAR-CCM+ and VPS2012 to support design simulation; BLAST, FASTA, GROMACS and NAMD in life sciences; and EnSight, ParaView, Tecplot 360 and VISIT for visualisation.
As a component of its cloud services, Colt offers a managed grid service, leveraging Tibco’s DataSynapse middleware, which can be used as a stand-alone facility, or as an extension to an in-house grid capability – i.e. in cloudbursting mode.
Cycle Computing helps maximise the use of HPC resources with CycleServer, and also exploit HPC in the Cloud with its CycleCloud offering, which automates the complex process of building a cluster in the Amazon cloud and handling the management of the workload. Using Condor, GridEngine or Torque clusters, Cycle has demonstrated delivering a 50,000-core cluster for HPC workloads on demand.
Dell views the cloud as being a commercial instantiation of grid computing and offers private and public clouds, as well as desktop and other applications as a service. The Dell HPC Cloud Bursting solution offers a leasing model for HPC capacity in 24-hour increments rather than strictly on demand.
As part of the Google Cloud Platform, the Google Compute Engine delivers infrastructure as a service (IaaS) well suited to the needs of some classes of HPC workload. Highly scalable applications are catered for, but latency sensitive applications require a higher-performance interconnect.
HP builds many of the components that others deploy in their data centres when delivering HPC in the cloud, including HPC servers and storage, cluster platforms, cluster software data centre infrastructure (including its popular, energy efficient PODs) and HPC Services and Support.
IBM has a wide range of platform and software offerings that support HPC in the cloud, with many of the software components coming through its acquisition of leading middleware supplier Platform Computing. Platform LSF is a widely used workload management tool for distributed HPC systems, while Platform HPC is an integrated management solution for HPC environments. Another key component is IBM’s GPFS parallel file system. IBM also has many hardware platform options and services to address a range of HPC in the Cloud needs, and the IBM SmartCloud offering can provide IaaS for HPC requirements.
Microsoft provides an integrated stack of HPC software tools including its Compute Cluster Pack, MPI and other libraries. Crucially for Azure Cloud users, there is an HPC version of its scheduler for compute intensive, parallel applications that also includes runtime support for MPI communications. Microsoft includes the capability to burst from your private HPC cluster to Azure if your demand exceeds available resources – all under the control of your scheduler configuration.
Cloud provider Nimbix has partnered with HPC system designer Convey to deliver Nimbix Accelerated Cloud Computing, which is SaaS for a wide range of applications including bioinformatics, CFD, rendering and animation, computational finance and geophysics and seismic processing.
Peer1 Hosting has two HPC Cloud offerings that deliver on-demand HPC. For longer-term use the Managed HPC Cloud service is preferred, while the Self-Service HPC Cloud option is great for meeting short-term requirements. Based in Toronto and in the UK, both services can leverage Nvidia Telsa GPUs.
Penguin Computing builds private clouds for in-house use by its clients, and also delivers HPC facilities on demand via its Penguin On Demand (POD) service. POD systems can be configured to suit specific needs, with options including Nvidia GPUs, InfiniBand interconnects and the Lustre or PanFS parallel file systems. Penguin also has a Hybrid cloud offering that enables POD to augment a user’s internal HPC capacity.
Rackspace offers private, public and hybrid cloud solutions utilising the OpenStack scalable cloud platform. The company’s HPC offering is a cluster supporting the MPI communication library for parallel applications. OpenStack is proving to be popular with academics, as they can not only use it to run highly parallel workloads, but they can also contribute to the open-source community. T Systems has a wide range of mainstream hosted and cloud services, including a shared grid service, and an HPC-cloud offering that provides dedicated resources on an on-reservation basis. T Systems participates in the European Commission-funded Helix Nebula project.
Cloud offerings provided for HPC users are evolving rapidly, as the previous section demonstrates. But is the user community also evolving in order to be able to exploit these offerings? Two initiatives, the Helix Nebula project and the Uber-Cloud Experiment, suggest that good progress is being made. Helix Nebula is a pan-European project aimed at building what it calls the Science Cloud. The project is supported by both academia and industry and is part funded by the European Commission with the objective of developing and exploiting a cloud computing infrastructure for use by European scientists (academics, government and business). The project is supported by three of Europe’s major research centres (CERN, EMBL and ESA) and many European providers of cloud components, including Atos, CloudSigma, Logica, SAP, T Systems, Terradue and The Server Labs. The project uses the OpenNebula cloud management platform, a popular tool for building HPC clouds and for providing users with ‘HPC as a service’ resource provisioning models. Open Nebula is also used by Dell, IBM, Santander, SAP, Telefonica and many hosting and cloud service providers to deliver HPC Clouds.
The motivation for the Uber-Cloud project (www.hpcexperiment.com) came from a series of conversations between Wolfgang Gentzsch and Burak Yenier, who wanted to better understand how real are the perceived problems that constrain running HPC in the cloud. These problems include concern about privacy and security, unpredictable costs, ease of use, application performance and software licensing. (The last of these is an issue that many independent software vendors (ISVs) still need to address as cloud-based licenses for HPC applications are the exception rather than the rule.) The experiment was planned to help address these concerns. The project has no funding and is backed by no commercial or governmental organisation. It is a labour of love, and an opportunity to build a community that may change the way that high-performance computing delivers value to businesses. The objective of the experiment is to explore the end-to-end process for scientists and engineers as they access remote HPC facilities on which to run their applications. The first round of projects ran from August to October 2012, with the project now considering the 4th round of proposals.
The requirements of users on HPC cloud facilities vary enormously, as do the capabilities of HPC clouds, so it is important to pick the right cloud – not all clouds are equal. Each case study has been supported not by an individual user or a single company, but by a team comprising users, software providers, resource providers and experts in HPC and cloud computing. This approach has been invaluable in ensuring that someone in the team is always able to understand issues that arise such as scalability, software licensing or application performance. The case studies reported by the experiment (the report is available on their web site) are extremely valuable to anyone wanting to understand the big issues before taking their first steps in the cloud.
Grid is dead, long live the cloud
Grid and cloud are imprecise terms. More than a decade ago people were using the term grid to describe many flavours of distributed computing, but suddenly the word seemed to disappear. Today the term cloud is fashionable. Everything is ‘in the cloud’ even if what is really meant is that something is accessible over the internet. One of the interesting aspects of the evolution of HPC in the cloud is that some of the solutions being offered look more like grids than clouds.
But the terms used don’t matter, what is important is whether HPC in the cloud actually works. The question asked in the title of this article was ‘is it real?’ The answer, with some caveats around topics such as virtualisation, latency and software licensing, is a resounding yes. Clouds have evolved to provide better support for HPC, and the HPC community has responded by embracing cloud. There are still problems to be solved and lessons to be learned, but HPC is well on its way to being a first-class citizen in the cloud.
Compute resources available for anyone to use on an on-demand, pay-for-use basis.
Compute resources available to many constituents within a single organisation.
A cloud comprising both private cloud and (during peak usage times) public cloud resources.
A cluster with specific capabilities often required by HPC applications (e.g. compute accelerators and high-performance interconnect), delivered using a cloud business model. This type of resource often requires reservation rather than being available on demand.
Infrastructure-as-a-service: real or virtual machines; storage; and network.
Platform-as-a-Service: IaaS plus operating environment; database; web server; etc.
Software-as-a-Service: typically email; CRM; etc., but some suppliers are offering complex HPC applications using an SaaS model.
Most cloud offerings virtualise access to servers as it makes better use of resources, but this is not good for HPC applications as their performance could be compromised by sharing resources with other applications.