HPC finally climbs into the cloud
Although commerce and consumers have been computing in the cloud for years, the high-performance computing sector has been more hesitant. But all that may now be changing.
The cost of cloud computing for HPC is falling, while new programming models that will allow HPC workloads to run more efficiently in the cloud are becoming available. ‘Public cloud’ providers are installing hardware configurations that are more suited to HPC, while private clouds are giving users experience of how to run their jobs in a cloud environment.
Nonetheless, HPC in the cloud has not got off to a flying start. The market research company Intersect360 reported last year that the HPC users they surveyed spent only about three per cent of their budgets on cloud computing – and the percentage has not changed much over the past five years.
One of the barriers is simply cost – initial claims that the cloud could lower the cost of high-performance computing owed more to marketing hype than to reality. According to Andrew Jones, leader of HPC consulting and services at NAG: ‘If you can keep your machine busy enough, a dedicated solution is cheaper than the cloud. One general rule of thumb is that, if you are going to keep a machine more than half busy more than half the time, then it’s more economic to do it in-house, rather than on a cloud infrastructure’.
Deepak Khosla, founder and CEO of X-ISS, said that, in the past: ‘If you had a predictable workload and a predictable backlog of business, then it did not make any sense to go to the cloud. It was 10 times more expensive.’ Compute cycles were not the only cost issue, he added: ‘There is a pretty significant cost to just leaving data in – or egress of data from – the cloud. Cost has always been an issue.’
However, Khosla pointed out that the situation is changing: ‘Over the last couple of years, [the public cloud providers] have brought their prices down significantly – both compute and storage’. Nonetheless, as an HPC consulting company, the feedback that X-ISS getting from its customers is that for regular, predictable workloads, the cloud is still at least twice as expensive as in-house solutions. Jones also believed that the financial calculus will change over time as the costs of both cloud and in-house solutions evolve.
The cost of software licences
The price of compute-cycles in the cloud, and of getting and keeping data there, are not the only factors impeding wider use. Khosla said: ‘Licencing has been a pain because the idea behind the cloud is pay-per-use but the major independent software vendors (ISVs) haven’t changed their model for licencing to that. These guys are expensive. They have a pretty tight hold on their customer base. My gut feeling is that for as long as they can maximise that revenue, they will.’
A significant sign of change is that some of the major ISVs are now offering their software over the cloud – or even acting as cloud providers themselves. In January 2016, the French ‘virtual prototyping’ company ESI Group, announced that it had started delivering advanced engineering modelling and simulation in the cloud, across multiple physics and engineering disciplines, using Amazon Web Services. Almost a year earlier, in May 2015, Ansys also chose Amazon Web Services for the launch of the Ansys Enterprise Cloud for engineering simulation. On the other hand, the Mathworks has imposed geographical restrictions on the availability of Matlab on the cloud: Matlab Distributed Computer Server for Amazon EC2 is confined to North America and some European countries – excluding Poland, the Czech Republic, and Slovenia, for example.
In Khosla’s view, the ISVs ‘are going to have to change, or they will start losing. As people like ESI, or Open Source competition starts to come in, customers will start to say “we cannot afford to pay”. Those that have locked-in their customer base – because it’s pretty hard to move – are going to be slower, because they want to make the most they can.’
Cloud benefits developers, not just users
But cloud computing could actually be a good ally for the ISVs in the development of their own software, in the view of David Lecomber, founder and chief executive of Allinea. ‘If you are an ISV – let’s say an automotive software company – the chances are you don’t have a million dollars of supercomputer for your development team to use, whereas the chances are that your customer does. So if your customer reports a problem on 500 nodes, or 100 nodes, or whatever it is, then you don’t have a machine-room with that kind of machine in, but you can reproduce that same problem on the Amazon cloud. In a morning, you can get that machine, pay for it by the hour, and turn it off when you are done. So I think that’s going to be really interesting for more software companies. Even larger ISVs wouldn’t have the size of system that their main customers have. The economics wouldn’t work out for them.’ In perfect accord with the judgements of Jones and Khosla, he remarked that if a machine is going to be 95 per cent busy: ‘It makes sense to own it. But for intermittent use, then it makes total sense to rent.’
A subtly different application of cloud techniques in internal software development was identified by Matthijs van Leeuwen, chief executive of Bright Computing. In this case, it was not reaching out to a public cloud, but using OpenStack within the company’s own data centre to provide facilities flexibly for the company’s own development team. OpenStack is open-source software for cloud computing that controls pools of hardware resources for processing, storage, and networking throughout a data centre.
‘We have HPC, Big Data, cloud developers, and they need to do testing, play, and try,’ he said. ‘They need to stand up a cluster very quickly, use it for a while, and shut it down. They need to very dynamically change the size of the cluster. In the early days, they would really be messing about with physical servers and spending quite a bit of time in the data centre downstairs in the basement of our office, unplugging servers and almost fighting over them. As soon as we started playing with OpenStack, we realised that this would be a fantastic tool to allow our developers to reserve resources and build a cluster-as-a-service model. They start them up – all virtualised – and then shut them down. Having used it for one and a half years, it’s become stable and versatile.’
Because of their own experience in using the package, Van Leeuwen claims that the product the company has developed from it, Bright OpenStack, ‘is the most HPC-optimised OpenStack distribution on the market.’
Different clouds, different uses
Van Leeuwen makes a useful distinction that clarifies the different types of cloud technologies currently being used, by pointing to the difference between ‘cloud bursting’ and running HPC workloads on OpenStack in data centres. The difference is, he explained, that: ‘the original cloud-bursting scenario is about adding more capacity in a flexible way; whereas, if you have 100 servers in your data centre, by adding OpenStack to it you don’t get a single extra CPU cycle, you’re just redistributing it in a more effective, more efficient manner.’
Even within cloud bursting, which Bright has been offering for several years, there are two distinct scenarios. The first is to set up a ‘cluster in the cloud, an independent but complete HPC cluster in, for example, Amazon. The second is where you extend an on-premise existing cluster into the cloud. Your control and management node is still in your data-centre, but some of the additional servers are in the cloud.’ The hybrid of traditional, on-premise, bare metal extended into a virtualised public cloud is the more popular option, and the use case is straightforward, he said, because it is a way of responding to varying demands throughout the year for HPC resources.
What surprised Bright, however, was the degree of interest among HPC customers for the OpenStack option. ‘With Bright OpenStack they can build a data-centre infrastructure where they can offer their own internal users a choice: today, do they want to run today an HPC application, a big data application, or other workloads? They can also offer a choice of infrastructure: does this user want to run bare metal; does the other user want to run a virtual machine; and yet another user in a container?’ The latest release of the software, in January, offers a scenario for bursting not just from bare metal, but also from a private to the public cloud.
Failure to observe Van Leeuwen’s helpful clarification may contribute to some of the current confusion surrounding the exact role of the cloud in HPC. While Intersect360 reports only a three per cent uptake, IDC, another market research organisation, reported last year a significantly higher take-up of the cloud, at around 25 per cent, a doubling since 2011, in contrast to Intersect360’s view that demand has hardly increased, as yet.
Embarrassingly parallel jobs suit the cloud
Although it may not yet have been translated into significant increases in usage, attitudes among end-users are changing, and interest in the cloud is growing. David Power, head of HPC at Bios-IT, said: ‘We are starting to see a bit of a shift in our job users’ perceptions. Some of the initial barriers to the cloud have been around the spec of the hardware from the larger cloud providers’. A few years ago, he continued, the cloud did not offer a performance benefit since most of the hardware ‘was a generation or two old, with no fast interconnect and relatively slow storage, and it was all virtualised’.
Both Power and Khosla cited Cycle Computing as one of the pioneers, offering a service using Amazon Web Services to big pharma companies. According to Power, such genomics jobs are embarrassingly parallel, and so do not require heavy MPI communication between threads, cores, and different jobs. ‘That is where people began to realise there is some merit here’. Khosla took the same view: ‘What has worked well in the cloud are massively parallel applications that are not running for a long, long time and do not have sensitivity to storage or other compute nodes. Bio apps and pharma have worked well, mostly in the burst capability.’
In contrast, as Van Leeuwen pointed out, oil and gas companies and those doing seismic processing require huge amounts of data and it would be far too time-consuming to upload it all to the cloud. Power reiterated the point: ‘We are starting to see the low hanging fruit of HPC workloads accepted as a decent fit for the cloud. Anything that’s loosely coupled, without too much heavy I/O, without having to move too much data in and out – I think they are decent candidates for cloud workloads. Whereas, if you look at the very high-memory requirement workloads, or highly parallelised jobs that run on 10,000 cores and above, they’re probably not good candidates for cloud workloads today.’ Khosla’s assessment is similar; the cloud is not suitable for those cases where there are affinity-type requirements, and users are running on InfiniBand, requiring very high performance and low latency. ‘That’s been a show-stopper for most people. Other than Azure, which has just announced InfiniBand, that’s just not available.’
Pitfalls in the cloud
There are also synergistic hardware/software issues. Khosla pointed out that, for companies and organisations that have written their own applications in-house, ‘going to the cloud is not very easy. You have a lot of dependencies that the developers write in, that assume a static environment. So to go to the cloud, there has to be an effort to decouple them,’ and the code has to be re-written.
Over the past year, the ability to spin up resources in the cloud has got a lot better than, say, four years ago, Khosla observed, and X-ISS is seeing a number of organisations trying to get round the cost issue by opting for ‘spot pricing’. But this in itself can present challenges, he argued, because ‘the cloud is not always there. Nodes can disappear, because someone using spot pricing took them away. So you have to have applications handle the fact that this can happen more often than in your own environment – when they only go away when you have a hardware issue, which is not that often. So now you have to write your check-pointing, and your failure recovery, in a way where you are not losing too much momentum as the application progresses, and you can recover’. The challenges of disappearing nodes and of removing dependencies ‘are not that easy for people who have written their own applications and are trying to provide more features. I think it is right for them long-term – to provide more recovery features – but these are the challenges that they have seen,’ he said.
Allinea’s David Lecomber said that the company had started using the Amazon cloud three or four years ago, to deliver training: ‘We could spin up a cluster in the training room for everyone to play with. They could debug; they could profile; without interfering with anyone else. We basically had on-demand access to clusters out in the cloud, rather than having to configure something locally or disrupt a local service while training was going on.’
Originally, back in 2010, it was hard work to get started in the cloud, he observed: ‘It was like building your own HPC system, but with virtual hardware.’ However, like Khosla, he believes that it is now easier than before. ‘There is a good package on Amazon called CFN cluster. Within 10 minutes you can create a one-time set-up there and boot up a cluster in less than five minutes.’ All the usual software stack, job schedulers and queues are included and ‘it even comes with a dynamic number of nodes. So, as someone asks to do more work on a system, it will spin up another couple of nodes and add them to the cluster – not something you can do in a regular HPC centre.’
Cloud providers interested in HPC
The major cloud providers themselves are showing more interest in HPC. According to David Power: ‘We have started to see some cloud providers put together higher-end offerings for HPC. You’ve got 10G systems; VM that can go up to 16 cores with a decent amount of RAM in it; they’re beginning to look a bit more like your traditional HPC compute node.’ (Although InfiniBand interconnects dominate the Top500 list of the world’s fastest supercomputers, there are still nearly 200 machines on the list with Ethernet/10GigE interconnects.)
Bios-IT is now putting together its own cloud solution. ‘The ability to burst – get additional capacity for short periods of time – is something that a lot of our customers are now interested in investigating, so we have set up a proof-of-concept cloud-based system in our lab. We put in all the usual HPC hardware – parallel file system, InfiniBand interconnects. We were able to use Docker and Ironic to do bare-metal provisioning, so you didn’t have the performance hits from virtualisation.’
Acceptance of the cloud in HPC is growing, Power believes: ‘I think in the future you’re going to see more and more people using this sort of approach towards HPC’. There has been interest in Bios-IT’s trial cloud service from two categories of customers. One group are those who want to burst out into the cloud, who want extra capacity for a short period of time ‘because of some research, a grant proposal, or conference coming up where they needed urgent access to something and were willing to pay us for that instead of waiting in queues for the local resource’.
Leapfrog into HPC
However, both Power and Khosla highlighted one category of user for whom the cloud may offer the opportunity to ‘leap-frog’ into HPC direct from a workstation-based infrastructure, without having to invest in the hardware of an HPC cluster (or the management overheads of running it). Khosla said: ‘We are seeing people on workstations wanting to move to HPC, but they don’t have an existing HPC infrastructure. Now it becomes a serious conversation for them to say “should we look at the cloud and see how it goes because we’ve heard that getting up and running in the cloud is a lot easier? I don’t have to buy facilities – and who would manage it for me?” We’re beginning to see more interest there,’ he observed.
According to Power, for the infrastructure that Bios-IT is creating, ‘we have had a number of requests from people just to do hosted HPC. They don’t want to look after the data centre themselves, just submit their codes and get the information out of it.’
What makes the cloud so attractive in this scenario, Khosla continued, is that the return on investment (ROI) is easier to assess. In general, it’s hard for people to say what their ROI is if they buy their own cluster, he explained; they have to factor in not just the cost of the hardware, but power requirements, support and management costs. It is one of the areas on which X-ISS provides specialist advice. Whether for cloud or cluster, however, ‘getting the right expertise is a big challenge. We have a range of HPC solutions just for that issue, and at predictable cost. The smaller guys don’t need a full-time person to manage their cluster, so we provide the service and they can easily assess the ROI.’ It may turn out that, with experience gained by leap-frogging to HPC in the cloud, a company will decide that HPC is worth the investment and buy its own dedicated cluster.
But Khosla warned that there are still pitfalls in trying to use the cloud. Specialist expertise is needed to meet the challenge of different types of users coming in with Open Source variants, or other commercial code that they want to try and test: ‘The Open Source stuff is very, very painful, because each application has its own dependencies and often they conflict. Some of the cloud technologies – OpenStack, Containers and so on – are now beginning to be leveraged for private clouds to provide resources to these researchers on a quick basis, so they can test and validate their stuff before they go into production. The nice thing about that is it starts making sense; once their signature is understood, they will maybe be able to move to a public cloud also.’
The future: federated clouds
The importance of the cloud in HPC is growing therefore, and there are further benefits to come. Allinea’s Lecomber said: ‘One thing I do find nicer about it is that it’s constantly improving. If you have an HPC system you are tied to it and its architecture for three to four years until the next one comes along, whereas Amazon does a refresh at least once a year in terms of the kinds of processors that you can pay for, so you’ve always got the ability to try something new. It may not be the absolute leading edge, but you do always get the ability to pay for access to some decent new hardware.’
Bright’s Van Leeuwen thinks the reach of the cloud will extend still further: ‘The next step is to burst from one private cloud to another private cloud – the term for that is ‘cloud federation’. This development will be really interesting for multinational companies that have data centres in multiple locations, he continued, so that they can balance compute resources across all their data centres and, rather than just sharing workloads within one data centre, can do so between multiple data centres.
Could cloud computing become not just an alternative but a dominant way of executing HPC workloads?
As the main article makes clear, hardware configuration has been a barrier to the wider adoption of HPC in the cloud; the main ‘public cloud’ providers, such as Amazon, Google, and Microsoft, have – understandably -- invested in commodity hardware that is adapted to commercial and business computing needs, but is less well configured for many tightly coupled high-performance computing workloads.
There are parallels with the way in which Beowulf clusters started to change the face of high-performance computing just over 20 years ago. Clusters used commodity components, lowered the financial cost of owning a supercomputer, and thus widened the pool of users who could run high-performance workloads. But the configuration of the hardware was different from what people were used to, and they had to change the way they wrote their programs.
Just as clusters changed the way supercomputers were programmed, so new developments are adapting HPC to the cloud. Marcos Novaes, cloud platform solutions engineer with Google, has no doubts: ‘I think this is the year in which we will see a radical shift in HPC’.
As a self-confessed ‘HPC old timer’ – he worked on the design of the IBM SP2 computer, that was once the second largest in the world, at the Lawrence Livermore National Lab in 1999 – he is realistic about the limitations of the cloud. Traditionally, HPC uses tightly coupled architectures: ‘Which basically means an InfiniBand interconnect with micro-second latency. It is hard to achieve such latency in a multi-tenant cloud environment without sacrificing the flexibility of resource allocation, which provides the best economy of scale. So, to date, most of the HPC workloads that we see moving to the cloud are embarrassingly parallel.’
‘However, for the message-intensive workloads that are affected by latency, we do have a challenge that will require a different approach,’ Novaes continued. He believes that it can be found in recent work by Professor Jack Dongarra of the University of Tennessee and one of the pioneers of HPC since the Top500 list started using his Linpack benchmark as the common application for evaluating the performance of supercomputers in 1993. Dongarra has developed a new Direct Acyclic Graph (DAG) scheduler called Parsec and a new parallel library called DPlasma to support heterogeneous computing environments, and which address the need to cope with higher latencies as well. According to Dongarra, the future of HPC is in algorithms that avoid communication.
Novaes points out that the migration of HPC to the cloud means that ‘the move of HPC to a new communication-avoiding platform has already started, and we will see a very strong acceleration, starting this year, as such technologies become available’.