Thanks for visiting Scientific Computing World.

You're trying to access an editorial feature that is only available to logged in, registered users of Scientific Computing World. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

Energy-efficient supercomputing

Share this on social media:

With a myriad of technologies available to HPC users, it is not always clear which technology provides the most energy-efficient solution for a given application, finds Robert Roe

By increasing the energy efficiency of a supercomputer scientists can save huge amounts of money over the total lifecycle of the system. With ever-increasing core counts and increasingly large supercomputers, the drive for increased computational power comes at a cost. By reducing the amount of energy spent on running these systems HPC centres and academic institutions can put that investment into other areas that benefit scientific research.

While there has been lots of excitement about the use of new technologies, both in computing hardware and cooling technologies, the focus for most users is still on making the most of the more traditional resources available to them.

Mischa van Kesteren, pre-sales engineer at OCF states that when dealing with customers the focus is almost always on the utilisation of a cluster and helping them to fit the technologies around the type of applications they are using rather than just selecting the most efficient technology on paper. ‘I think the main step with a new build is to understand what your workload is going to look like. Energy efficiency comes down to the level utilisation within the cluster,’ said van Kesteren.

‘There are definitely more energy-efficient architectures, in general, higher core count, lower clock speed processors tend to provide greater raw compute performance per watt but you need to have an application that will parallelise,’ said van Kesteren. ‘If you look at something like general-purpose GPUs, Nvidia likes to talk about how energy-efficient they are, and that is all well and good if you have an application that can use all those hundreds of cores at once.’

‘You need to understand your application as somebody that is coming into this from a greenfield perspective. If your application doesn’t parallelise well, or if it needs higher frequency processors, then the best thing you can do is pick the right processor and the right number of them so you are not wasting power on CPU cycles that are not being used,’ van Kesteren continued.

The drive for energy efficiency in HPC is clear as it not only reduces the huge power costs but also provides more scientific output for the economic output. However, energy efficiency is not always the primary concern when designing a new cluster as many academic centres will focus on getting the most equipment they can for a given budget that fits into the power envelope available in their datacentre.

‘Ultimately computing is burning through energy to produce computational results. You cannot get away from the fact that you need to use electricity to produce results so the best thing you can do is to try and get the most computation out of every watt you use,’ said van Kesteren. ‘That comes down to using your cluster to its maximum level, but then also making sure you are not wasting power.’

Cooling technology can also play a big role in energy efficiency but some of these technologies require a specific infrastructure or datacentre design that is not available to the average HPC user.

‘I think we have only had one or two instances where customers have tried to retrofit water cooling to a datacentre – and it is definitely possible with the right infrastructure partners – but it is a bit of a headache,’ said van Kesteren.

‘It depends at which end of the spectrum you are looking at but I would say that the majority of our customers don’t have a custom-built datacentre, they are people who have re-purposed the machine room to have general-purpose computing and then decided they want a cluster,’ van Kesteren added.

‘They are often still using things like air conditioning in the server room and just standard air-cooled servers. But we also have a rising number of people using water-cooled systems but then that is almost always back of the rack water-cooled rear doors. We also have a few high-end customers that are using on-chip cooling.’

While technologies such as evaporative cooling and immersion cooling can provide large savings in total power used, reducing the power usage effectiveness (PUE) of the datacentre, it requires that an organisation has the resources to design or adapt the datacentre to these technologies. In many real-world scenarios, this is just not feasible so a compromise must be made between the engineering cost of building the infrastructure and the return from increased efficiency.

‘Evaporative cooling is really the bleeding edge, it is what some of the really high-end systems in the Top500 are using and I think immersion cooling would also fall under that category. These are the kind of technologies used by massive datacentres – but, at the lower end, there are people with three to five rack HPC clusters. Ultimately those people are still wanting to run very heterogeneous environments where you cannot be restricted to just using water-cooled nodes,’ stressed van Kesteren.

‘In those kinds of situations, you need to have the flexibility that either a rear-door cooling solution, atmospheric cooling or air conditioning offers.’

Efficiency tools

One way to increase the utilisation of a cluster is to tightly control the number of processors being fed power at ay one time. When a system is not running at full capacity, software can be used to help manage the power used by powering down certain sections of the compute infrastructure.

‘If customers come to us and they want to improve energy efficiency based on their current estate, the kind of things you want to look at would be some of the features in the scheduling software that they use which can power off compute nodes, or at least put them into a dormant state if the processor you are using supports that technology,’ said van Kesteren. ‘We would look if they have those kinds of features enabled and if they are making the most of them.

However, for some older clusters that do not support these features and generally provide much less power per watt performance than today’s technologies, van Kesteren argues that there is a real financial incentive to starting over with a more efficient system.‘In this case, maybe they should think about replacing a 200-node system that is 10 years old with something that is maybe 10 times smaller and provides just as much in terms of computing resource,’ said van Kesteren.

‘You can make a reasonable total cost of ownership (TCO) argument for ripping and replacing that entire old system, in some cases that will actually save money over the next three to five years. Sometimes replacing what you have got is the best option, but I think the least invasive way and the first thing that we would look at with customers is: are they being smart with their scheduling software – are there benefits they can get in terms of reducing the power consumption of idle nodes,’ he continued.

While there is always a balance between gauging how comfortable users are with trying something new and how much expertise they have in-house, notes van Kesteren. ‘What we often end up doing is providing a training package for people because there are some schedulers out there that handle power management better than others.’

OCF works with the Slurm scheduler because it provides ‘a simple but effective power management functionality’ which allows OCF or its customers to trigger a script when it realises there is a node no longer in use. ‘At OCF we have customised that script to power down or put nodes into a dormant state and that works the other way as well when it needs more nodes and it starts to run out then it can be used to spin up nodes in the cloud,’ said van Kesteren. ‘That is the sort of software that we would guide customers towards because of how flexible it is and the expertise that we have with it because we have found that it works in lots of different environments.’

The functionality that allows these scripts come out-of-the-box with Slurm but van Kesteren and his colleagues prefer to customise this functionality to suit an individual customers environment and requirements. ‘There are some default scripts in Slurm, but I think it is best to modify them to an extent so that they fit your environment,’ said van Kesteren.

The wider computing market

In December, Super Micro released its second annual ‘Data Centers and the Environment’ report, based on an industry survey of more than 5,000 IT professionals. While this is not focused purely on the HPC market, the findings highlight that energy efficiency is not always a primary focus.

Results demonstrated again this year, the majority of data centre leaders do not fully consider green initiatives for the growing build-out of data centre infrastructures, increasing data centre costs, and impacting the environment.

Responses from IT experts in SMBs, large enterprises, and recognised companies showed that the majority of businesses (86 per cent) do not consider the environmental impact of their facilities as an important factor for their data centres.

Data centre leaders primarily noted TCO and return on investment (ROI) as their primary measures of success, with less than 15 per cent saying that energy efficiency, corporate social responsibility, or environmental impact were key considerations. Some 22 per cent of respondents noted ‘environmental considerations’ were too expensive.

The report also found that almost 9 out of 10 data centres are not designed for optimal PUE. It seems that while there are many novel technologies available to datacentre operators most people setting up a new cluster do not see enough ROI for deploying these technologies unless they are at a large scale or they have the benefit of a data centre that is built with the infrastructure to support them. ‘Within HPC you can pretty much split it into academic environments, which is a large part of our customer base, and commercial environments. A lot of academics, and this is changing with the stance on environmental issues in general, but they don’t see the cost of the electricity,’ commented van Kesteren.

‘They are not billed on it and so historically they have been quite unconcerned. They often think about it in terms of “is this rack going to have enough power supplied to it” but not in terms of maximum power budget and at some stage it is just not cost-effective. That is a much more commercial standpoint,’ he continued.’ In the IT industry, in general, they have a power budget and they spend it, but energy efficiency is not particularly high up on their list of priorities.’

If more energy-efficient technologies are to see widespread adoption, processing technologies, such as Arm or innovative cooling technologies, then they require a cost to implement. For example, switching to GPUs or Arm processors could save a lot of money over the total life cycle but this is offset by the cost of porting existing applications. Similarly, cooling technologies may be more efficient but if it requires a datacentre investment there are diminishing returns on that energy saving. Ultimately it needs to economically viable to be energy-efficient.

‘The first thing is always “can we afford to buy it” and then after that “can we afford to run it?”,’ said van Kesteren. ‘If you take a processor in isolation then the most energy-efficient processor design tends to be the ones with lots of cores that are fairly low powered. But the issue with that, in addition to maybe your application being single-threaded, is that you also tend to lose out on memory bandwidth per core because you are squeezing a lot of cores into one space. GPUs especially suffer because they have really high bandwidth on-card memory but the bandwidth from those processors to the main memory is quite poor.

‘Although you have all these cores and they do not use a lot of power you can end up wasting cycles because processors are waiting for information stored on main memory. This is something that has to be taken into consideration when designing a system with lots of energy-efficient cores. It may not always be the most energy-efficient solution from a holistic standpoint when you take into account the kind of memory utilisation profile of the application you are running,’ van Kesteren concluded.

Exclude from view: