A chilled future for HPC
As the HPC industry collectively approaches exascale, the importance of energy efficiency – and maximising the efficiency and performance of cooling technology – becomes paramount to ensuring that the cost of HPC resources does not become prohibitively expensive.
To meet the ambitious targets for exascale computing, many cooling companies are exploring optimisations and innovative methods that will redefine cooling architectures for the next generation of HPC systems. Here, some of the prominent cooling technology providers give their views on the current state and future prospects of cooling technology in HPC.
What has been the biggest development in cooling technology over the last 5 to 10 years?
Rich Whitmore, president and CEO of Motivair, argues that the migration of cooling systems back to source and more specifically, at the rack level, is one of the largest changes in cooling technology: ‘This has been driven by ever-increasing server and chip heat densities and the more dynamic/rapid shifts between high and low server workloads on a minute-by-minute basis.’
Tom Michalski, senior FAE at Boston, says: ‘We have been using modular, easy to use, direct contact liquid cooling (DCLC) in partnership with CoolIT Systems over the last 5 to 10 years. The benefits of our solutions, such as the Boston ANNA Pascal – as recently featured as part of a liquid cooled cluster at the ISC Student Cluster Competition – are that we are able to maximise server performance and power efficiency in a variety of environments.
'The Boston ANNA Pascal cluster was situated in a mini-rack within the ISC exhibition room – it was mid-June with temperatures in their 30s, there was no direct air conditioning, and the cluster competition booths were situated behind floor to ceiling glass windows, so you can imagine the cooling challenge that we faced.
'The competition only allowed for 3000W of power – so liquid cooling enabled us to decrease the number of fans per server, which are power-hungry, ensuring that we were sending power to components that would benefit more.’
Peter Hopton, founder of Iceotope, focused on the development of advanced coolants – specifically Engineered Fluorocoolants and Liquid Fluoroplastics and their impact on the HPC market.
'Large open baths can use about 5 to 10 litres of coolant (usually oil) per CPU, whereas at Iceotope we’ve constantly been innovating in this area, in 2012, we were at two litres per CPU, now we’re at 0.8 litres per CPU. With some projects that use liquid cooling optimised electronics, such as our work on the EuroExa project at 0.32 litres per Quad CPU node,’ said Hopton.
While this type of cooling solution has been seen as expensive in the past, innovations are expected to reduce this cost considerably. This should bring the technology more in line with established HPC cooling technologies.
‘At these volumes, we can have a low-cost, serviceable, safe system that is Total-Liquid-Cooled, occupies a small footprint and can accept high inlet temperatures that enable the elimination of chiller plant. This brings about a lower cost of infrastructure and a smaller electricity bill,’ Hopton added.
Andy Dean, HPC business development manager at OCF, noted that the HPC industry is ready to move on from the traditional water cooled doors used in many HPC systems to more innovative and, in some cases, exotic cooling technologies such as ‘to-the-node’ water cooling.
Dean stated: ‘In the last 10 years we have started to see the adoption of water cooling. At this stage, the vast majority of our multi-rack systems have got to that point.’ More recently we have started to see this ‘to-the-node’ water cooling. Birmingham was OCF’s first installation of this technology or at least the first academic installation of water cooled nodes.’
CoolIT’s CEO and CTO, Geoff Lyon, commented on the scale of manufacturing reducing the cost of high-performance cooling technology.
‘Since 2010/11, high-volume manufacturing of liquid cooling products have become a mature category in so far that a few vendors have mastered the ability to reliably manufacture high-quality, low-cost liquid cooling assemblies,’ said Lyon. ‘This has transitioned liquid cooling from a novelty to a commonly accepted category of product now being considered for large scale datacentre deployments that increase efficiency, increase density and enable increased performance.’
How do you help users select the correct technology?
On the topic of recommendations for specific technologies, Michalski states that much of the choice comes down to a customer’s preferences and requirements: ‘Some customers may want to keep the cooling efficiency at the highest level, reusing the heat that is produced by the server components so they would consider using one of the liquid cooling technologies in their environment. This does involve a higher initial cost of the cooling equipment, which some customers won’t accept choosing the standard air cooling approach if their air-conditioning units can cope with that.’
CoolIT’s Lyon shared this view on customer requirements but went one step further to explain that these conditions can be appraised through collaborative consultation.
‘Things that play into the recommended approach include the PUE or efficiency goals for the datacentre, the chip-level power density, rack power density, available power to the rack, climate and environmental conditions, heat capture targets, energy re-use, labour expense, existing infrastructure and others,’ said Lyon.
While Motivair’s policy is that customer requirements must come first, Whitmore also argues that expert opinion can help to highlight the need for growth within HPC infrastructure: ‘Each customer needs to evaluate their data centre loads today and forecast where those loads may be in the future.
‘The fact that most data centre operators and owners don’t know where their densities will be in two to five years validates their need for a highly scalable and flexible cooling solution. Cooling systems should be a server and rack agnostic, allowing for multiple refreshes over the life of the facility.’
OCF’s Dean noted the importance of understanding requirements but also stressed the need to properly prepare for potential upgrades in the future, as many users are employing much denser computing solutions: ‘Since I started, we were delivering solutions that were 10 to 15 kilowatts a rack, and this has quickly become 20 to 25 kilowatts per rack. Now we are looking at the next generation of processors, and we are looking at 30 kilowatts a rack, and this is only going upward.’
Dean commented that the possibility for very high-density systems such as GPU based clusters could push this as far as 70 kilowatts a rack in extreme examples. It is therefore important to understand a user’s requirements to determine which technology is right for a particular installation.
‘From the other side, it is down to TCCO and the total available amount of electricity available to the datacentre. If you have an amount of energy or a total number of amps coming into your building, for example if you can drive the PUE as close as possible to one, then you have more energy left over to use for the HPC system,’ added Dean.
‘The upfront investment in going for one of these more novel approaches is higher, and that is one of the challenges to adoption, as datacentres are often procured and funded separately from the actual bit that goes in them. Novel designs cost more money, but this is offset by the savings in efficiency over the life cycle of the cooling and data centre infrastructure,’ Dean explained.
‘The best way to maximise efficiency savings and to ensure the lowest PUE requires a comprehensive approach from procuring the datacentre to setup, infrastructure and the choice of cooling technology.’
‘As you get more integrated and start to look at warm water cooling and eventually looking at things you can do with this water coming from the nodes, then it needs a much closer integration across the entire business,’ Dean concluded.
Iceotope’s Hopton explained the choices that users face when deciding to use direct liquid cooling technologies. He argues that the main choice between direct liquid cooling technologies is Partial or Total-Liquid-Cooling, with both having their ideal use cases.
‘Partial-Liquid-Cooling is a good fix for cabinets that are under-filled in existing data facilities or can be installed into facilities that have a low density – say 5kW/cab – provision for air cooling and
‘The addition of liquid cooled infrastructure adds cost and complexity, but can enable the use of spare power in the existing facility by filling the rack,’ said Hopton.
‘Total-Liquid-Cooling is much more adaptable, it doesn’t need a clean room and doesn’t care about airflow – one could argue its “room neutral”, so it can be installed into any facility where there is power, some even grey spaces or closets. This makes this suited to data centres that have stranded capacity, space or efficiency issues, new builds, modules or edge computing.’
What do you think will drive the biggest change in cooling technology during the next five years?
The challenge with all these technologies is that datacentres move a lot slower than hardware does,’ explained Dean. ‘You replace your IT every two to five years. There are a lot of examples where we are delivering this brand new kit into a datacentre infrastructure that is 20 to 30 years old.
Dean said that a node-based or direct contact water cooling approach is fairly straightforward to integrate with existing data centre infrastructure, so he expects that this will be the technology with a lot of potential in the coming years: ‘In HPC we push everything harder, we use more electricity, we create more heat than other sectors, so whereas we readily adopted water-cooled doors over the last five years, I now see that technology moving into the enterprise [sector] and then HPC starting to move to the next generation of cooling technology such as “to-the-node” water cooling,’ said Dean.
According to Boston’s Michalski, the ‘growing demand for high-density servers and increasing thermal design power of server components, such as CPUs and GPUs, makes the standard air cooling work no longer efficient.
'Today’s servers can output more than a few thousand watts of heat, which is then getting cooled by air-conditioning – this is a massive waste of power and not a good thing for the environment. This is the reason why most new datacentres have adapted liquid cooling technologies with liquid-to-liquid heat changers. This way heat produced by server components is transferred to the water, which then can be reused for heating the office, which results in lower OPEX and is better for environment.’
Whitmore highlighted the increasing density of HPC systems as a driving force for change in the future. ‘The next generation chips such as CPU’s and GPU’s will allow computers to generate immense amounts of data in a fraction of the time when compared to today’s computers,’ said Whitmore.
‘These systems will reject heat at levels never seen in data centres before. The trend to big data will continue to drive the adoption of high-performance computers and other dense IT equipment into pre-existing enterprise and colocation datacentres,’ Whitmore said.
The recent installation of BlueBEAR3, the University of Birmingham’s HPC cluster and part of the Birmingham Environment for Academic Research (BEAR), has demonstrated the benefits of upgrading the cooling infrastructure with significant improvements in the electricity used to cool the system.
The new system has improved cooling energy usage by as much as 83 per cent by switching from air cooling to Lenovo’s NeXtScale direct on-chip warm water cooling technology.
The system takes water at up to 45°C into the rear of the server via heat sinks attached to the CPUs, dual in-line memory modules, I/O and other components. Water returning from the components withdraws heat from the system, rising in temperature by about 10°C in the process.
Simon Thompson, research computing infrastructure architect at the University of Birmingham, explained that three key features helped to achieve this energy saving.
Thompson said that there is an energy saving from moving to water over air: ‘There are no chassis fans in the Lenovo system (except power supplies). This is important as other after-market cooling solutions still require fans to recover heat from (e.g.) voltage regulators, memory, IB card, etc.’
‘The water can be up to 45°C inlet temperature. This means that we do not need chilled water and therefore we can achieve cooling via dry-air coolers. There is no compressor load required to cool the system. We also find that we rarely need to run all of the air-blast fan units we have. Each of which is ~500W with a cooling capacity of 25kW. Therefore, operational cooling costs are significantly lower, compared to requiring chilled water systems such as rear door heat exchanger (RDHx).
‘Lenovo was among the first to be doing x86 cooling with direct cool solutions. Also importantly, it is not a rack-scale only design. Part of the big attraction for us is that we can add compute nodes in a modular manner, which is maybe not as easy with some of the large-scale manufacturers who are working at rack-scale design.
‘I can get a single tray with two compute nodes for a research group and add it to my facility. Lenovo has a TCO calculator, which would be appropriate for other sites to gauge how well it might fit their solution. For a few nodes, it will never be cost effective, but as a modular, scalable system from even just a rack’s worth of nodes, it is a highly competitive solution when looking at TCO,’ stated Thompson.
When asked about the ability to recreate this saving at similarly sized facilities, Thompson explained that this would be possible but also recommends that for maximum efficiency saving, users ‘would want to look at how they chill their water’.
‘There is also the potential that they may be able to use the low-grade heat, e.g. Central Heating pre-heat and therefore reduce total energy load more. Alternatively, with a large enough solution, you could look at adsorption chillers to generate chilled water for your RDHx from the heat produced from the system,’ concluded Thompson.