Robert Roe explores the options available to optimise the use of HPC resources
While it is the normally the latest application or performance figure that grabs attention, IT organisations must employ software to efficiently manage resources from traditional HPC clusters to OpenStack, big data or even deep learning infrastructure.
As computing systems grow in terms of the sheer number of computing, storage and networking components, resources become increasing difficult to manage. Ensuring that a cluster is utilised efficiently is a full-time job. However, employing the latest software can reduce the burden of supporting a clustered computing infrastructure – reducing the number of people needed to manage the resource or allowing more science and engineering to be completed by using the resource more efficiently.
The options available for cluster management software are as varied as the types of computing systems on which they can be deployed. Whether you are an academic institution leveraging open-source software due to budget restrictions, or a commercial company paying for software with additional support and maintenance, choosing the right software package can save key resources – including the time of expert staff, as Lee Carter, vice-president of world-wide alliances at Bright Computing explains.
‘The real premise of why Bright has been successful, and why we do what we do, is really about two things – making things easy and saving you time,’ said Carter. ‘None of us on this planet have as much time as we would like. When I first started working in the UK, at 5pm you might have been heading for the pub and you might not think about work until Monday morning. Well the world has changed and no-one has enough time anymore,’ Carter stated.
Carter explained that, no matter the industry, from the academic users with limited budgets all the way up the largest commercial IT organisations, ‘there is a universal appeal for what we do at Bright.’ Largely, this is because there is a cost associated with managing a cluster – no matter the vertical or user base. In some organisations, academics or smaller IT companies may want to set up an open-source solution to limit the financial investment needed to get the system up and running. However this cost is now paid in the time of experts that could be better utilised in other areas, as an open-source solution does not provide the support or maintenance that you get with a comparable commercial software package.
‘We find universally, across all of the sectors that we work in, that our clients come back to us and say that they can do more with less people,’ commented Carter.
‘We hear time and time again that we have enabled a lean IT organisation to take on additional tasks because we have automated and taken away some of the mundane mechanical things that you might need to do with your cluster,’ Carter added.
Opening the door to new technology
At Bright Computing cluster management software and HPC has been at the heart of the business since its inception, when the company was created to commercially support the cluster management solutions supplied alongside ClusterVision, a HPC systems integrator.
‘In the early days it was predominantly academics in the university space, so we have hundreds of university customers all over the planet. A lot of them are here in the UK such as Leicester, the University of Sussex, Greenwich – and a lot of universities have made a commitment to Bright,’ stated Carter.
Over this time the company has seen the convergence of other computing paradigms including big data, OpenStack and deep learning, leveraging its expertise to create a single hybrid platform that spans across all of these technologies managed through the same core software that Bright has developed over many years.
‘The way that we deliver OpenStack and Deep learning is on the base foundation of Bright, so you don’t have to throw anything away; you don’t have to build a new infrastructure. Subject to licensing for the big data or the OpenStack piece, you just have to turn it on,’ said Carter.
‘That is really what we are all about – because, regardless of whether you are an academic, government or a commercial customer, you are not trying to figure out how to setup OpenStack. You are trying to figure out how to leverage OpenStack to better service your users and your business.’
Carter gave an example of the UK Met Office, which is using a hybrid clustered environment to support the physical cluster with smaller virtualised environments that could be used for application development or testing. These virtualised environments, while not offering the same performance as the ‘bare metal’ physical cluster, give users more agility as software engineers can set up dynamic virtual environments separate to the main cluster to accelerate application development.
‘It is an example of a mini supercomputer servicing a humongous supercomputer. The main system at the Met Office is a multi-petaflop system with millions of pounds worth of investment,’ stated Carter. ‘We find that happening with a lot of our customers; they want to employ a more innovative, agile platform to do these things because at the end of that day they are trying to get the maximum return on the investment they have made in a system.’
However, this is not the only way the traditional HPC users are using OpenStack to accentuate their computing infrastructure. Carter gave a second example of a large engineering company that is using Bright software to limit the number of personnel needed to support the computing needs of the company.
‘They started off as an HPC customer but now they use Bright for big data, Bright for OpenStack, and they use Bright for HPC and they do it over multiple clusters,’ explained Carter.
Carter stated that this is due largely because of the graphical user interface that Bright provides to support their software. ‘They have only got three people who run hundreds of nodes for this commercial organisation. They see the value in Bright because they can manage the entire infrastructure with a very small team,’ Carter added.
Turnkey solutions for cluster management
In March 2017 the Fox Chase Cancer Center in Philadelphia began using Bright Cluster Manager to support a new HPC cluster. The 30-node cluster supports bioinformatics initiatives for Fox Chase’s cancer research programs. Debbie Willbanks, senior partner at Data in Science Technologies, who manage the cluster for the centre, explained that they choose the Bright software solution because of the functionality and ease of use providing by the comprehensive GUI.
‘We evaluated many cluster management tools and Bright was the obvious choice, especially since Fox Chase was transitioning from two different Linux systems,’ said Willbanks.
‘Activities that are difficult in other cluster management tools are easy with Bright, providing a turnkey solution for cluster management. Using a GUI and a few mouse clicks, administrators can easily accomplish tasks that were previously command line driven and requiring numerous ad hoc tools. There is now consistency in our approach to any problem – glitches are easy to diagnose and solve,’ Willbanks concluded.
Credit: Timofeev Vladimir/Shutterstock.com
While software can take some the sting out of managing HPC resources, many academic institutions must turn to open-source software because they cannot afford the licences required for commercial software.
At Durham University the HPC team are taking delivery of the latest hardware addition to their COSMA system. This new system, originally from STFC, is due to go into full production on 1 April as part of Durham University’s Institute of Computational Cosmology (ICC).
The ICC currently houses COSMA 4, which is the main production system for its users, and COSMA 5 the DiRAC-2 Data Centric system, which serves the DiRAC community of astrophysics, cosmology, particle physics and nuclear physics as part of the DiRAC integrated supercomputing facility.
In June 2016 the centre obtained COSMA6; this system is currently being set up and configured, and is expected to go into service in April 2017. Dr Lydia Heck, senior computer manager in the Department of Physics at Durham University, explained that it is not just software but the supercomputers themselves where academic institutions need to save money as the COSMA 6 system was given to the centre by the STFC.
‘We were waiting for Dirac-3; we are currently at Dirac-2 but this new system will be a considerably larger amount of money that is under consideration as part of the ‘national e-infrastructure,’ stated Heck.
‘While we are waiting, we do not want the UK to lose the edge with regard to research competition. The system from the STFC was a welcome addition at around 8,000 cores – it is not a small system,’ said Heck.
Heck explained that Durham and the ICC did not need to pay for this new system but it did need to pay for transport, set-up and configuration – a complex and time-consuming task. However, getting the system up and running was not the only hurdle the team had to overcome, as they needed to learn a new workload management system, SLURM, which the team is using for the new COSMA 6 system.
‘The two previous systems, COSMA 4 and COSMA 5, are currently using LSF with a GPFS file system – but the new system COSMA 6 will be running SLURM and using a Lustre file system,’ stated Heck, adding that this has caused some complications – but, ultimately, the university cannot afford to extend the LSF licence to the new system.
‘We hope that once COSMA 6 is in full production we can convert the other two clusters into the SLURM set-up,’ commented Heck. ‘At the moment it is a little more complicated, but that also makes access more complex. The users do not log in to each computer; they access the system via login nodes. Currently we have two sets of log-in nodes; one set is for access to COSMA 4 and COSMA 5, and then we have log-in nodes for COSMA 6.’
HPC at Lawrence Livermore
The SLURM workload manager was first developed at Livermore National Laboratory in California around 15 years ago, explained Jacob Jenson, COO and vice president of marketing and sales at SchedMD. ‘The lab was not happy with the resource managers that were available on the market at that time, so they decided to develop their own,’ stated Jenson.
The team originally stated with a small team of around seven developers who decided to make their own workload manager through an open-source project under the General Public Licence (GPL). ‘Over the course of about 10 years from 2000 to 2010 the developers went from a team of seven down to two,’ reported Jenson.
At that point the last two developers saw the potential of further development of the software and decided to leave the lab and set up a business of their own. The two developers (Moe and Danny) started the company – SchedMD – which stands for scheduling by Moe and Danny.
‘In the early years, from 2010 to 2012, the company focused primarily on custom development of SLURM. In early 2013 when I joined the company, they we started exploring providing commercial support to help fund future SLURM development,’ said Jenson, who explained that although the software is completely open-source SchedMD does provide commercial support for the platform helping users to get their implementations up and running as fast as possible. While all users get exactly the same version of the software, paid users get access to support through SchedMD, or one of the partner organisations.
‘SchedMD is the only company that provides level 3 support. There are several companies such as Bull/Atos, HPE, Cray, SGI, Dell, Lenovo, and others that provide level 1 and 2 support for SLURM but for any level 3 issues they all outsource those to us,’ said Jenson.
The paid users help to contribute to further development of the software, which is then made available to the free users through the open-source software model – although setting up advanced features without the help of commercial support can be time-consuming.
Jenson stated that the next version of the SURM software would introduce a new feature known as federated SLURM, or grid computing, which allows multiple systems to share job allocation.
‘It allows its multiple SLURM systems to work together to share jobs. Right now, if you want to submit a job to a system, you have to be on a system and that job goes to the system you are on’ explained Jenson.
‘With this new feature in place, sites will be able to have all of their systems communicating. Based on how it is configured, the jobs can be routed to the correct systems to meet the organisations policy,’ stated Jenson.
However, as with all factors in HPC management, there is a trade-off between the time spent fixing a problem and the cost of paying for professional support. Many academic centres do not have the resources and so must invest their time to get these systems running.
Whereas commercial companies may want to spend the money to ensure maximum cluster utilisation at all times to maximise profits and return on investment (ROI). For the HPC users at Durham University, there was no choice but to implement an open-source workload manager because they could not cover the cost of licencing LSF on the newest addition to the COSMA HPC set-up.
‘LSF is a good batch system with wonderful features – and we made good use of that – but if you do not have the money you do not have the money and you have to make do with what you have,’ stressed Heck.
‘With SLURM, the benefit in our opinion is that the community is behind its development. It is an active community and many of the supercomputing systems around the world are using SLURM,’ said Heck.
Heck explained that this is important to HPC users, because they do not want to be left ‘high and dry’.
They do not want to invest in a system and then having that system fail, ending up with a lack of support from developers. An active community helps to allay these fears as there are many users invested in the success of the technology.
Members of this community can also help each other, and expertise started to develop between like-minded users. Heck commented that the ICC was not the only team at Durham using SLURM, as the main Durham HPC facility is also using this technology. Heck’s Dirac colleagues at Cambridge are also using SLURM, so there is clearly some local expertise being developed among the diverse user community.