HPC PROJECTS: CLOUD COMPUTING
Looking for the silver lining
If you believe the hype, then 'The Cloud' will be the next big thing across all strata of computing. Stephen Mounsey asks what it can bring to the HPC party
Cloud computing means many different things to many different people, with so-called ‘cloud’ offerings ranging from simple online storage services to a handful of providers of supercomputing-as-a-service, with many variations in between. While the space is poorly defined, cloud-based services are rapidly becoming an established part of the HPC hardware used by a broad spectrum of users, from science and engineering to big business.
Like a grid?
In order to pin down the definition of cloud computing as applied to HPC, it is useful to compare it to its closest cousin – grid computing. Ian Foster is director of the computation institute run by the Argonne National Laboratories and the University of Chicago, and has worked extensively on research-level computing grids. ‘The term “grid computing” is in reference to the electrical power grid,’ he notes, referring to the goal of on-demand computing analogous to the on-demand electricity supplied by national power grids. ‘On-demand computing is, of course, the same concept that underpins cloud computing.’ Grid computing came to refer to technologies for sharing resources, such as computer systems and storage systems, located at various geographical locations. ‘It’s like the electric power grid; not only does it link power to consumer, it also links many generators together,’ he notes.
‘Grid computing was pioneered in the scientific community, where people not only need a lot of computing power, but also need to federate data sources.’ An illustration of this can be found in the grid computing system currently under development to deal with the data produced by the Large Hadron Collider (LHC), at CERN, Geneva: ‘It’s a rather curious example; it links together computers at literally hundreds of universities, to enable analysis of LHC data. If you were designing something from scratch to analyse data, this isn’t necessarily the way that you’d do it, but the way that physicists like to do things is that everybody comes to the party with their own computers – their own contributions if you like.’
While a large organisation such as CERN may have tens of thousands of idle CPUs on its internal network, most HPC users do not. National and international grid initiatives have gone some way towards ensuring that academic users have access to the resources they need, but these solutions are not extended to business users and they lack the true on-demand component of cloud computing. According to Foster, a grid makes the most of what its users contribute to it, whereas cloud computing is designed from the outset as a service. ‘The cloud is about out-sourcing, and grids are about federation.’
Cloud services offer users processors and storage space to be billed per unit time. The best-known current example is the Amazon Elastic Compute Cloud (EC2), which offers users a wide choice of configurable virtual clusters with various pricing structures. ‘The term “cloud computing” is used to mean a few different things,’ says Foster, ‘but I think that the most interesting aspect of it is the emergence of real commercial providers of computing services – the emergence of Amazon [EC2] in particular, although others are also getting into the space. That aspect of cloud is about providing the reliable sources of computing that were perhaps lacking in the early days of grid computing. We ended up building our own supplies of storage, computing and other things, but we could never operate those as efficiently as a company like Amazon could do. It’s a very exciting development.’
Screening millions of compounds for drug-receptor interactions is a compute intensive process. Customers of software provider Schrödinger are able to use the flexibility of cloud computing to increase their screening capacity as and when they need to. Here, the interaction of a staurosporine molecule and a CDK2 receptor has been simulated by the software. Image courtesy of Schrödinger
Jason Stowe started Cycle Computing in 2005 with a view to helping to manage internal clusters across a wide range of HPC-using organisations. ‘There wasn’t really a cloud at that time,’ he notes. ‘The application environments ranged from small life-science clusters of tens of nodes to a university with 32,000 cores. We noticed that a lot of our customers had a peak vs. median usage problem; essentially, they couldn’t provision enough nodes to make computing power available when they needed it the most.’ In 2007, Cycle’s answer was to develop cluster management software that would re-size the cluster, depending on the work that was put into the queue. The software not only makes use of the on-demand processing offered by Amazon and similar cloud service providers, but it also allows the user to harvest and use internal resources efficiently.
Peter Shenkin is vice president of US-based software company Schrödinger, which he describes as the largest supplier of ‘hardcore scientific simulation software and molecular modelling software into the pharmaceutical industry’. Pharmaceutical companies use the company’s software in drug discovery, to evaluate drug-receptor interactions in silico. ‘When a drug company decides to target a particular disease, there are many pathways in the body that lead to that disease, and by enhancing some and knocking out others, you can attack, control, or in some cases cure that disease,’ explains Shenkin. ‘Once a pathway has been chosen, and once the developer has decided which particular receptor (usually a specific protein in the body) to target, that’s when our software comes into play.’ Most pharmaceutical companies already have large in-house compound libraries, often running into millions of compounds acquired through R&D and through IP purchases. ‘[The pharma companies] want to figure out which of these millions of compounds is the most promising starting point for developing a drug. Our software tests these millions of compounds to see how strongly they will bind against an identified receptor. It narrows the list down from 10 million candidates to something more like 2,000, for example.’
In terms of computing requirements, the software is moderately hungry: ‘Typically, to screen three million compounds takes about a year of CPU time. On the cloud we could do it in 18 hours, using 480 processors, which is a typical, non-maximal size for a virtual cluster. This is why the cloud is such a win for our software and for our industry,’ says Shenkin.
Currently, Schrödinger’s customers request additional usage of the software from the company, with Cycle supplying the compute power, courtesy of Amazon. Shenkin explains that there were several reasons why the company chose to use Cycle as a supplier, rather than to deal directly with Amazon’s service: ‘Cycle’s infrastructure is cloudagnostic; on the back end it knows how to talk to Amazon, but it also knows how to talk to several other cloud vendors. If a customer should, for example, have a close relationship with IBM, Cycle just points the back-end towards IBM’s cloud service, and we don’t have to worry about porting our software to a different set of standards.’ Another reason for choosing to access the cloud via Cycle was ease of integration: ‘Cycle takes care of making the Amazon cloud look like a standard cluster, while maintaining its elasticity; compute nodes on the virtual cluster go in and out of existence dependant on the workload,’ explains Shenkin.
Shenkin is quick to point out that Schrödinger does not intend to move to a software-as-a-service model, although billing may become automated in the future. ‘We want to maintain a close relationship with our customer base,’ he notes, adding that the company’s software is complicated, and that cloud computing can be expensive if used incorrectly. ‘It’s not a huge expense as these things go, but somehow or other, people’s perceptions change when it’s real money, rather than amortisation of a capital expense [internally owned servers].’
A sprint or a marathon
According to ANL’s Foster, Amazon’s cloud services were initially designed to support the specific set of e-commerce workloads coming out of the company’s main business. ‘They provided services for queuing and content distribution and so forth, and some of the main users have been companies with these kinds of demands – [online video rental firm] Netflicks, for example, now runs its business on EC2,’ he notes. Foster states that the nodes offered by EC2 are of a good standard, but that they’re not well interconnected. ‘If you’re running an application that requires a high-speed interconnect, it’s not going to run well on a service like Amazon’s. On the other hand, if the application is fairly compute-intensive, and if there isn’t communication between processors, then these sorts of systems work quite effectively.’
Cycle’s Jason Stowe categorises the potential users of cloud computing: ‘There are two types of HPC – there’s the sprinter type, where we have users trying to run a highly parallel application, and then there’s the marathon runner type of HPC, in which applications are pleasantly parallel.’ In the sprinter type applications, he says, latency is of key importance, and performance must be optimised at every level to get results. In contrast, Stowe describes marathon applications (which he says form the bulk of the demand for HPC) as ‘high throughput computing, and not necessarily high performance computing’.
Scale-up applications -– those requiring the most sophisticated compute hardware – can now be catered for alongside parallel ‘marathon’ tasks by HPC-oriented cloud services
SGI’s recently launched Cyclone cloud service aims to cater for both scale-up (sprint) and scale-out (marathon) applications. Christian Tanasescu, general manager of Cyclone and VP of software engineering at SGI, describes the company’s offering: ‘As opposed to other cloud offerings, mostly based on established, older technologies, our approach aims to offer a choice of the best available technology. We offer scale-up systems (large, shared-memory machines for data analytics) and scale-out machines (for scalable applications).’ As well as offering virtual clusters at a range of sizes (2,000, 512, or 128 cores) with high-speed InfiniBand interconnects, Cyclone clusters can be configured to include HPC-specific components such as GPU accelerators (both Nvidia Tesla for DP and ATI for SP) and the new line of integer accelerators from SGI’s partner company Tilera. Tanasescu notes that both IBM and Penguin Computing also offer services along similar lines.
Foster speculates as to factors preventing more customers from making use of cloud processing: ‘People have been concerned about security, and there are a variety of worries: If a user’s data is stored somewhere external, there’s a fear that it could be stolen – if somebody breaks into Amazon for example. People also worry about the possibility of cross-talk between applications running on one server, although this is just theoretical really. In the US there’s also a concern about legal issues; data on Amazon is more easily accessible to the government than data on your personal hard drive, and a search warrant is not required to access it.’
Cycle’s Stowe states that his company has been careful to ensure security for its clients from the outset: ‘Actually, we thank our lucky stars that we got started with finance and defence customers, as they tend to have exceptionally stringent security requirements. When we built the cluster systems that we have, we built in a number of security measure that the clients use internally,’ he says.
Despite the hype surrounding cloud services, will users make use of them? Songnian Zhou is CEO of Platform Computing, a company that has recently added the cloud management software to its existing range of cluster and grid management tools. Zhou believes that cloud computing represents a particularly attractive opportunity to users within scientific computing, in that the cloud is ‘complementary and supplemental’ to the work that users have been carrying out using clusters and grids. ‘I think cloud is a significant advance,’ he says, adding that making the switch to the cloud need not be difficult: ‘Through configuration, and not programming, these customers can build their own cloud. It’s new, but it’s not a revolution for them – they all know supercomputing, and they all know how to run an application on commodity clusters. Ideally, all of this is mostly for the IT people to worry about anyway! The end users should get the benefit without ever knowing that they are running their applications in the cloud; the whole idea of cloud computing is to virtualise and insulate the resources away from users and their applications.’
Foster summarises the indistinct nature of cloud computing when he describes it as ‘partly a business model and partly a usage model. Cloud providers have a set of services that are good for a particular set of applications, and I think that the users are still learning in order to fully understand what those [applications] are.’