Robert Roe looks at the changing ways that the HPC industry uses cloud computing technology
Cloud usage models are changing as the price of cloud is lowered and specialist services are developed for HPC users. This is taking the form of cloud-based disaster recovery services, testbed services and bare metal HPC and AI infrastructure, which is now being developed to support HPC users.
Mahesh Pancholi, business development manager at OCF, commented that the cost of cloud is falling and this is making users more aware of the potential for cloud in HPC. ‘There is definitely a change in attitude towards cloud. I think there is also increasing variety of use cases because public cloud providers have gone beyond just selling their spare cycles.
‘They have realised just how much of a business opportunity there is and so they are starting to offer more niche and specialised services and these are the kind of things that cover the needs of HPC or research computing users.’
Launch of cloud-based HPC services from companies such as vScaler, Oracle, IBM, Xtreme-D and many others combined with increasingly HPC-friendly hardware installed in main cloud providers is creating a more favourable environment for HPC users.
While some workloads have already been deployed on cloud resources, many HPC applications require a specific level of performance that requires HPC specific technologies such as high bandwidth networking, fast storage and large memory nodes. It has taken time for these services to become more widely available and this has helped to bring costs down, which opens up new possibilities for HPC services.
The signs point to increasing use of cloud by HPC users over the coming years. As the number of possible use cases grows, and the cost of using cloud falls, it is likely that HPC users will increasingly make use of cloud technologies. Naoki Shibata, founder and CEO at Xtreme-D, a HPC and AI cloud computing provider based in Tokyo, Japan, noted that Hyperion’s recent research into the cloud computing points to considerable growth in the market.
‘Judging by the trending of HPC in the cloud, we see continued fast growth of the cloud HPC market. Hyperion finds that whereas in 2011, 13 per cent of HPC sites used cloud, in 2018, this figure is at 64 per cent. This still allows for accelerated growth because so far just seven to eight per cent of the work is being done using cloud. With greater availability, flexibility, and ease of access, we conclude that a much greater portion of HPC work will be migrated to the cloud in the coming months and years,’ stated Shibata.
New ways to use the cloud
Disaster recovery technology in HPC can be incredibly expensive as it requires that hardware is kept in reserve in case of emergency. OCF initially set out to create a cloud based service that could be used for disaster recovery which could reduce the cost of reserving additional hardware. The system keeps the infrastructure to procure nodes running so that users can quickly setup a cluster if disaster strikes.
‘Generally with a cluster you have some management nodes that make sure everything is running; you have a scheduler available and it can be used to deploy new hardware as it comes in. We are taking that approach and putting that into the cloud. You will have your management software running and then as you need you can spin up additional nodes’ said Pancholi.
This provides users with the safety of having cloud resources available to them quickly without having to pay for the majority of the resource unless it is needed. Pancholi stressed that this is not just for HPC users, however, as it can be used to recreate your internal infrastructure so researchers can look at novel technologies.
Pancholi gave an example of a recent customer who had become responsible for not just the HPC resources but a whole range of research computing services. ‘They have got people looking into novel areas like IoT and AI and the size of your team to run those services means that you have to be quite clear about where you put those resources. For cutting-edge research you often need to put a disproportionate amount of resources to set it up and keep it running for a small number of people,’ said Pancholi.
‘If you leverage a public cloud resource for that it makes it easy to investigate and see if it is worthwhile to invest that time and money in the underlying infrastructure without having to move people away from the main services that everyone else is relying on,’ Pancholi added.
However OCF’s cloud replication service does more than just provide users with cloud availability as it has been designed to provide users with a replica of their own internal cluster. In the past this has been prohibitively expensive. Without doubling up on servers, it is not possible to recreate a cluster ready for when there is a major issue.
Pancholi states: ‘We have seen a trend towards HPC becoming more and more accepted as a critical service for research institutions. From a business perspective we have also seen changes in how these institutions are run.
They are moving away from a home-grown IT directive to more commercially aware CIO types. The first thing they want to do is make sure they have disaster recovery coverage across the organisation for their key services.’
OCF aimed to come up with a way to provide the reassurance of having a cluster for disaster recovery available without the huge costs. This meant keeping the software and management infrastructure available 24 hours a day, so that users can quickly provision the nodes to replicate their own internal HPC cluster. ‘The work that we have been doing with public cloud providers led us to believe that we could start to do something that is a proper replica of your cluster in the cloud,’ said Pancholi.
‘There have been attempts to try and fill this void before and predominantly people have tried to host software in the cloud and say well you do mainly Ansys stuff on your HPC cluster so you can come and do Ansys in the cloud. Now my personal experience of running a HPC cluster for a university helps me understand that there is a big difference between running a piece of software on two different infrastructures. There are so many different parameters that can change things but actually what you need is something that is installed as close as possible to the original system.’
‘What we have aimed to do with our Cloud Replication Service is to actually build – from the ground up – a replica of your nodes and your software stack and provide a way for your data live data to be available to you in the private cloud,’ said Pancholi.
‘We think this is a very novel approach and the cost implication of having that on standby instead of buying a second cluster is something that we have found to be very attractive proposition when we have spoken to people.’
While some companies develop specialist services for disaster recovery and testing new software, others are looking to make cloud an avenue for traditional HPC workloads. There are several companies now offering so called ‘bare metal’ cloud servers or platforms which are designed to deliver performance that is comparable to an in-house HPC cluster.
Xtreme-D has launched its Xtreme-Stargate, a gateway appliance that delivers a new way to access and manage high-performance cloud resources at SC18, the largest US HPC conference, held in November each year.
The Xtreme-Stargate system is a new HPC and AI cloud platform from Xtreme-D. It provides high-performance computing and graphics processing and the company aims to provide a cost effective platform for both simulation and data analysis. The company has already announced a Proof of Concept (PoC) programme, which has been tested with several early adopters.
Xtreme-D has announced that public services will start at the end of 2018 in Japan, collaborating with Sakura Internet as the major Japanese-focused HPC and datacentre provider. Services in Europe and the US are expected to commence in 2019.
Xtreme-D’s founder and CEO Shibata commented that the main benefit of using the companies Xtreme-D DNA product is in establishing a quick optimal HPC virtual cluster on public cloud. ‘It saves hours of paid use on the cloud that it would have taken to create the configuration without the use of Xtreme-D’s HPC templates as a means for gaining access.’
‘The tool also provides ongoing status of the job being processed, thus allowing better budget control of spending on the cloud and alerts for action in case resources are not sufficient so that work done will not be lost. Xtreme-D Stargate adds a higher level of data security through hardware-based data transfer. The ease of access and the resulting cost savings help accelerate the use of cloud for HPC workloads,’ comments Dr David Barkai, technical advisor, Xtreme-D.
Shibata also notes that the bare metal infrastructure provides a unified environment that helps to reduce the burden of managing multiple types of computing resources.
‘It is difficult to manage a hybrid cloud approach for an HPC solution. Customers incur a double cost and need to put in twice the effort in order to manage both on-premise and cloud implementations,’ comments Shibata. ‘There is also the significant cost of uploading and downloading data to and from the cloud datacentre. Therefore, the HPC native cloud customer requires a separate workload setup for each environment (cloud and on-premise).’
‘The bare metal approach allows customers to manage a single environment. For example, AI and Deep Learning are best suited for cloud computing, but it is very difficult to size the infrastructure for Deep Learning. Xtreme-Stargate provides elastic AI and HPC-focused infrastructure, and is thus the best solution for both HPC and AI. It addresses both data I/O and general purpose computing,’ adds Barkai.
Serendipitous adoption of cloud services
When OCF set out to create the cloud replication service the idea was fairly simple, to create service that could be used for disaster recovery that could mimic an in-house cluster. The service provides an opportunity to massively reduce downtime and give users the opportunity to get their applications up and running as quickly as possible.
However, as Pancholi notes, by designing a robust service they had created something that could be used for a number of different use cases. ‘Initially the thought was “what if my cluster goes bang?” and that is how we came up with the idea but actually we started to realise that you can utilise it for testing or you can utilise it for providing a workspace for a group of users that are new or hasn’t necessarily got access rights to your cluster,’ stated Pancholi.
‘In the testing scenario let’s say there is a critical update to a critical piece of software for your cluster. Other than deploying that on your private cluster, you could test that out on the cloud. You can see, as close as is possible, the consequences of that update and if you are happy with it you can deploy it across the main cluster.
‘It’s not going to be 100 per cent all of the time – nothing can be but it certainly provides a level of assurance that you didn’t have previously.’
Pancholi explained other use cases such as to create a workspace for a group of new users or as a testbed for new software or testing a new architecture. ‘You create a service and put it out there, and then find that people will find novel ways to use it that you hadn’t considered because it fits a niche that they have,’ adds Pancholi.
OCF’s Dean commented that these kinds of cloud services can be particularly suited for those kind of testbed activities. Even in the short time that this service has been developed OCF has already found users testing AI and IoT research, which can be carried out in this way.
‘You can imagine that with these technologies they are cool and interesting technologies – a bit of a buzzword at the minute – when you are talking to an organisation a lot of people will say they want to do it but you don’t know what the uptake is going to be like until you build something,’ said Dean.
‘With computing resources being per-built for you in the cloud you can grant access to that and then you can see how many of your users actually start using these services. Then you can start doing some analysis to find out if it will be viable to bring this service in-house,’ added Dean.