Advancing computing

In 10 years, the peak performance of high-end supercomputers has increased by roughly three orders of magnitude: 10 years ago we were in the teraflop era while now we are in the petaflop era. The energy efficiency of the supercomputer has greatly improved in terms of flops per Watt, but not at the same speed. Typically, a high-end supercomputer uses between five and 10 times more energy than a decade ago.

Such an increase in power consumption, combined with a higher cost of energy in most countries and a greater awareness of the environmental impact, have lead HPC centres to put a high priority on power efficiency. This can be addressed by implementing more energy-efficient IT equipment and infrastructure.

Regarding the HPC centre infrastructure, the largest energy usage was usually by the cooling system (the second is usually electricity losses in uninterruptible power supplies (UPS) and transformers). This was the first motivation to change the way supercomputers were cooled in order to reduce the total cost of ownership (TCO). The second motivation was related to the increase of power consumption per rack; 10 years ago, 10 kW per rack was a typical value, while nowadays 40 kW per rack is usual. In the future it will probably go to 100 kW per rack, and possibly higher.

In terms of cooling methods, 10 years ago most supercomputers were air-cooled, with free flow of air in the computer rooms. To improve the flow of air, racks were organised to alternate cold and hot aisles. Nowadays, air-cooling is still used but either hot or cold aisles are enclosed in order to avoid re-circulation of air.â€¨Liquid-cooling is used more often, either at the rack level (rear door heat exchangers) or at the component level (cold plates on the most power hungry components). In addition, two important trends are gaining momentum: ‘free’ cooling (no chillers) and heat reuse.

There are several challenges: energy efficiency (reduction of TCO), environmental considerations (Green IT) and power density. Another important point for HPC facilities is flexibility; the lifetime of an HPC facility is typically 20 to 30 years, therefore a facility designed or refurbished today must take into account as much as possible expected requirements of future systems (Exascale and beyond). In addition, the temperature of components is an important point as operating components at higher temperatures may increase the number of failures and increase power consumption at the IT equipment level.

In this context, many discussions on cooling focus on the components – as mentioned before, air-cooling is still in use, while liquid-cooling is becoming more popular.

For air-cooling, the best ways to organise airflow are still a subject of discussion (hot or cold aisle enclosure). Computer simulations are often used; measurements are needed to confirm the results of simulation and often show they are not as accurate as expected.

For liquid-cooling, rear-door heat exchange is a mature technology that works well for racks up to 40 kW. One benefit is that it enables ‘room-neutral racks’, meaning there is no requirement for computer room air-conditioning. The main limitation of this technology is that it requires inlet water at relatively low temperature – which is, in most cases, incompatible with free-cooling and efficient heat reuse. Direct liquid-cooling of components doesn’t have such drawbacks since it uses liquid at a higher temperature, without impacting the temperature of operation of components (which, in some cases, is lower than with other cooling technologies).

Implementing a liquid-cooling system involves a lot of plumbing and a close coupling between the facility and the IT equipment, which means a lot of discussion between IT and infrastructure teams and implementing a global system for monitoring and optimisation. In some cases, it may be necessary to combine both cooling technologies: ‘warm’ water for direct liquid-cooling of high heat-production components (like CPU, memory, accelerators); ‘cold’ water for rear-door heat-exchangers to cool other components (like network switches, disks, etc.) that are still air-cooled for (lower) power density reasons.

In terms of free-cooling and heat reuse, as mentioned before, chillers are needed for producing cold (chilled) water. Using warm water-cooling, or even air-cooling when properly designed, makes it possible to use free-cooling – which means no need for chillers. In the warm water case, the facility water loop is connected directly to heat exchangers in which the water is cooled by the outside air and possibly by other means (lakes or rivers, for example). In the air-cooling case, outside air is, after filtering, directly pushed into the computer room by fans.

For most HPC facilities, free-cooling is possible most of the year. In some cases, chillers are kept for the few days or weeks when free-cooling is not possible. A preliminary study of these conditions is needed when thinking about free cooling. Regarding heat reuse, warm water-cooling makes heat reuse (for example for heating offices) much easier due to the higher water temperature. In the case of air-cooling this can also be achieved by pushing the air heated by the IT equipment in offices, but it is usually less efficient than the use of hot water.

Industry is aware of these challenges and provides/plans to provide cooling systems suitable for dealing with these challenges. Collaboration/joint work, including R&D and early testing with big sites, makes it possible to develop a solution suited to the needs of large sites but also usable in small sites.

It is very important to put in place a tool for analysing, monitoring and recording all the operations data of the facility and of the IT equipment. This makes it possible to control and tune existing optimisation strategies and to find new ones. It is also worth mentioning the trend towards increasing the temperature in the computer rooms. This leads to savings in terms of cooling but should be considered with care, because:

• â€¨Increasing the room temperature may increase the failure rate of components; and

• â€¨Increasing the temperature of operation of components may increase the power consumption of the components (leading to an overall increase of the power consumption).

www.prace-project.eu

Advancing computing

Topics

Read more about:

Editor's picks

The convergence of HPC and AI: Innovation in the post-Moore’s Law era

Online Panel Discussion | Optimise your HPC storage strategy

On-demand | AI in Life Sciences: Practical applications in small molecule design

On-demand Webcast: Transform your labs with cutting-edge AI solutions

Centralising analytical data from mass spectrometry in drug discovery and development

AI-driven Laboratories: Navigating Challenges and Embracing the Future

Choosing a flexible digital platform for drug discovery