Until relatively recently, the use of high-performance computing (HPC) was dependent upon the ability to attract significant levels of funding to make use of existing resources at large institutions, or indeed for the development of an in-house IT infrastructure.
George Vacek, Life Sciences business development at Convey Computer, summed it up nicely when he stated that there was a time when access to resources was so rare that the running of one HPC simulation or set of analyses would be worthy of a paper in itself. These days, scientific computing has become a fundamental part of research projects. Funding continues to be a significant hurdle for many, but an equally pressing matter is one of use. Not every user of HPC does so continuously and absolutely no one wants to be faced with, nor can afford, idle cycles. So what are the options for these users?
The answer is that there are several possibilities for meeting intermittent demand – not least of which are HPC-on-demand services, cloud resources, and deskside supercomputers. The latter remains a good choice for many modelling and simulation projects, and there are benefits in companies investing in their own small and unobtrusive system that can be customised to ensure it meets precise requirements. As the complexity of the models and simulations increases, however, it becomes necessary to move to higher levels of compute power. Of course this must then be weighed against the demand for capacity.
Oliver Tennert, director HPC solutions at Transtec, likens this to the difference between taking a taxi and buying a car: ‘If a customer is only in need of HPC capacity from time to time it makes sense to go into the cloud, but if that demand rises above a certain threshold then it is no longer a cost-effective option. At this point it makes sense to procure your own resources.’ Below this threshold, he added, users should consider cloud bursting as an option. Determining what this threshold is essentially comes down to a calculation of what level of hardware and usage time would be maximised to fulfil the individual need for capacity.
Companies then need to take a number of factors into account, as Jerry Dixon, business development manager at OCF, explained: ‘Building and operating a HPC system, or even just a few racks, can be difficult. Making a capital investment is fraught with danger for the uninitiated – high performance equipment ages quickly – and maintaining investment in new equipment can be a burden. Moreover, the skills required to build and maintain a server cluster are costly. Such specialist people are difficult to keep too, so how does a department keep them fully utilised and occupied at all times?’
Each of these points must be answered internally and then compared to the cost of on-demand services. Should the use of HPC move beyond the occasional, and if the performance of a personal cluster is meeting demand, then a permanent resource is a sensible choice, but it is interesting to note that a number of large vendors, such as Cray and SGI, have pulled away from the deskside market in recent years. Could the rise in and accessibility of on-demand services account for this shift?
Previously seen as little more than the latest buzzword, the cloud has garnered increasing amounts of attention in recent years. Reflecting this trend, the organisers of the International Supercomputing Conference, ISC Events, launched a new conference devoted to the topic in 2010 – a conference that has demonstrated growth year-on-year. Companies are taking note of this, and according to Oliver Tennert during the past 12 months Transtec has experienced a significant rise in customer demand for cloud bursting, a service whereby applications are deployed locally and then ‘burst’ into a cloud when additional compute capacity is required.
To meet this demand the company is currently building up its own cloud resources. Expectations are that in the second half of 2013 it will have a fully productive and resalable cloud capacity that will come with some applications pre-installed. Tennert stressed, however, that this market is just beginning to develop, making it incredibly difficult to predict where demand will lie in the future. He does believe that the number of HPC cloud capacity providers will steadily increase and that several consolidation processes will take place; either with regards to companies combining or moving towards a specialisation in niche areas of the market.
Bart Mellenberg, the director of Dell HPC for EMEA, believes that cloud bursting is not always the best option, however. ‘It really does depend on the application,’ he said. ‘If a user is running a local fluid dynamics application on 10 servers, but finds a need for the capacity of 20 servers they might feel that the natural choice is to rent 10 servers in the cloud and look at what they have as one big cluster. But that won’t work. At the very least, latencies will become an issue.’
He continued by saying that this model of deployment does however work for certain applications, such as risk assessments within the financial industry, and could possibly be adapted and then adopted within certain areas of the scientific community within the next five years. Again, this depends on the computational demands of the applications in question.
One key benefit of cloud deployments is the assurance of the ideal resource for each individual application. ‘Not only might it be beyond an individual principal investigator to purchase personal equipment, it would certainly be beyond them to buy focused resources for each of several tasks that they need to complete,’ said Convey Computer’s George Vacek. ‘Cloud resources on the other hand can provide very high-performance sub clusters that match each analysis requirement and problem type that a researcher may have.’ He added that if people are just beginning to do analytics, the cloud offers a way of sampling to see whether it would be a useful tool before any long-term investment is made.
Back down to Earth
Cloud computing is not without its drawbacks. As mentioned earlier in this article, should use reach a sustained level then cloud becomes generally more expensive than purchasing outright. For occasional users, one significant cost is licensing. Dell’s Bart Mellenberg commented that the current licensing model means that this expense can easily exceed the cost of the hardware infrastructure of a small HPC cluster. ‘Even the more specific HPC cloud providers don’t offer all the licenses,’ he said. According to Mellenberg, the solution is for HPC cloud providers to speak with licensers to implement a pay-per-use model. While some already do this, the majority do not. The difficulty, he added, is that the main independent software vendors will need to accommodate this model.
Interestingly, Mellenberg commented that there are still not many providers who can offer a cost-effective cloud solution, and so elements of the HPC community are turning to cheaper, generic cloud providers in an effort to reduce expenditure as some applications, such as Monte Carlo calculations, can run well. The problem is that these clouds often use generic interconnects, rather than something more specialised, like InfiniBand. ‘But there is a lot happening in the interconnect industry at the moment as many companies look to build their own. This will drive down the cost and enable people to move to the more capable interconnects. In the meantime, it is important that the community move away from the use of non-HPC clouds.’
Another issue that Mellenberg cited was that many cloud providers deliver hardware and some level of software, but without offering users the level of knowledge needed for running applications on something that can be more complicated that a workstation. He said: ‘Providers may offer a few application licenses, but they won’t necessarily hold users’ hands in ensuring that everything is up and running.’ This brings us to the need for system administrators who can provide an understanding of Linux – as many clusters remain Linux-based – and how applications will run in parallel. Companies’ internal processes need to adapt in order to take advantage of cloud resources.
Migrating jobs to the cloud is a change process – a factor, said Transtec’s Oliver Tennert, behind cloud’s slow albeit steady adoption within the HPC community. He added that being able to analyse the resulting data in the cloud would be an ideal option were it not for the hurdles of network latency and the need to change workflows. Instead, the standard procedure for many users is to transfer the data out of the cloud and then back to a workstation for analysis.
Should the technical challenges surrounding data transfer be solved, however, Tennert believes that workstations will become less important in the industry. He added that this will take at least a few years to occur, but before that happens one further problem that needs to be solved where cloud and other such on-demand services are concerned is the fear surrounding security. ‘There is currently no good answer for the secure transportation of data,’ said Tennert. ‘These external resources are shared resources and the danger lies in the fact that the data is not encrypted. This issue is not well addressed and I believe this will be an inhibiter for larger enterprises to migrate to a cloud infrastructure.’
Manolo Quiroga Teixeiro, co-founder and R&D manager at Gridcore, added that whether companies choose to use the cloud or not, security is already a concern as researchers connect to IT resources and interact with data on a daily basis – they upload and download information, search databases, send emails, etc. The main question should therefore be whether the cloud provider can match the policies for security at these companies. ‘Cloud services should be an extension of a company’s own IT environment,’ he said.
Gridcore offers an on-demand service that is based around middleware and deployed on customers’ premises. This is matched to any existing high-performance computing environment, making it easy for users to adapt. Private clusters are available for use from anything between one day and three years and should a company require increased computing power it can choose to add more nodes. Gridcore’s remote visualisation software allows the aggregation of users from any part of the world so that organisations can engage globally and as a result of this aggregation, companies can enforce a policy of ensuring all data is located in Gompute. Because everything is manipulated remotely, and there is no transfer of a complete set of data, security risks become minimised.
Jerry Dixon, business development manager at OCF, believes that an HPC-on-demand service is the way forward for many companies. ‘A high-performance server cluster accessible via the Internet enables research teams to access far larger compute power than they could afford themselves. Researchers need only need pay for the compute power that they actually use and they can use the service with confidence that all the equipment is fully up to date. And concerns over energy, cooling and space are transferred to the supplier,’ he said.
Regardless of whether the decision is made to invest in deskside supercomputing or in on-demand services like the cloud, there are companies whose services are designed to ensure optimal use of resources. Univa’s Grid Engine software is a distributed resource management, or batch-queuing system that orchestrates users’ workloads to available compute, reducing idle cycles. Grid Engine was initially developed and supported by Sun Microsystems as an open source solution and as a consequence it was easily attainable and had a thriving community discussing how to tune, optimise and integrate it. Fritz Ferstl, CTO at Univa, explained that because of the software’s history and because the development was sponsored by Sun, it was a production mature technology from the start rather than an emergent solution that can often start out as an incredibly shaky piece of software. Grid Engine evolved further over time and many features were upgraded making it a cost-effective option for many users.
When Oracle acquired Sun the open product development ended and eventually the development team and the evolution of Grid Engine moved over to long-term partner Univa. ‘Essentially, the software turns the entire computing infrastructure into a black box,’ said Ferstl. ‘Once users have integrated their applications and workflow, any hardware that’s put behind it – whether it is then upgraded, expanded, re-configured – becomes completely transparent to the end user at the front. They simply submit their jobs as usual and, depending on what is done in the back end of the system, jobs might run better and faster, but the key point is that users don’t need to concern themselves with what’s behind it.’
Should a company choose to completely change its infrastructure and move the computing over to cloud, or even if it were to replace the hardware, the same software layers could still be used. ‘Companies are free to deploy new solutions or applications, or take advantage of developing accelerator card technology (e.g. GPUs) without the need to retrain users or administrators. For the end user, very little would change.’
The danger, however, is that software like this is mission critical because it sits between the hardware and applications, and enables the workflow. Ferstl explained that if the software were to fail, the hardware would become useless as no jobs would be running. ‘Much like a conveyer belt in a factory, if the software stops, everything stops.’
A rising number of academic institutions and supercomputing centre are offering on-demand services. Accelerator is one such example
Offering 200,000 cores of raw compute power, the expanded Accelerator service is being promoted as the most comprehensive and flexible HPC on-demand resource in Europe. Provided by EPCC, the world-class supercomputing centre at the University of Edinburgh in the UK, Accelerator has been created by combining access to HECToR, the UK’s national computing service, with two new systems: an IBM BlueGene/Q, and Indy, a dual configuration Linux-Windows HPC cluster.
Accelerator is aimed at scientists and engineers solving complex simulation and modelling problems in fields such as bioinformatics, computational biology, computational chemistry, computational fluid dynamics, finite element analysis, life sciences, and earth sciences. It has the capability to support a wide range of modelling and simulation scenarios, from the simple modelling of subsystem parts through to the modelling of entire systems. Combined with the ability to run multiple simulations in parallel, Accelerator can dramatically reduce discovery and innovation timescales.
Operating on a pay-per-use model, users gain access to resources via an Ethernet connection enabling them to have supercomputing capability at their desktop. Unlike cloud-based services, no virtualisation techniques are deployed and the service is administered directly by users through a range of administration and reporting functions. The service is fully supported with an integral help desk and EPCC support staff are available to help with usage problems such as compiling codes and running jobs.
Addressing the concern of security, Accelerator’s hardware, which includes petabyte-scale data storage, is housed in EPCC’s Advanced Computing Facility, which was purpose-built to provide what EPCC says are the highest levels of physical and online security. Service access is governed by tightly controlled security and access procedures, with users managing all their own data, code and resource usage through unique, secure accounts.