Scientific computing is increasingly reliant on the performance of high-performance computing (HPC) systems to accelerate discovery and obtain results. In order to get that performance, users are demanding larger HPC systems that are generally denser to meet space constraints and interconnect requirements, and that consume far greater amounts of power. For example, the fastest supercomputer in 1997 (Sandia Lab’s Asci Red – 1.8 Tflops) used 0.8MW, while the fastest supercomputer in 2012 (Sequoia of DOE/NNSA/LLNL – 16,325 Tflops) uses almost 10,000 times more than that at 7,890MW.
According to this trend, the first exascale supercomputer could require more than 100MW to work, as highlighted in the 2008 DARPA Exascale Study. The increase in power consumption means, in turn, that more and more heat is being generated with less space available to dissipate it. This combination of energy consumed and heat generated poses an efficiency and cooling problem.
What is the size of â€¨the problem?
The issue of energy efficiency is one of the major obstacles that we will need to overcome on the road to exascale – more than 100MW HPC systems are simply unfeasible. At the same time, this problem impacts organisations at different levels. The first difficulty is an economical one as not only are the energy demands for servers on the rise, but also the price per kWh. This means higher bills that in the largest HPC installations have already reached seven-figure proportions.
The second issue is the availability of energy. MWatt-scale installations require dedicated facilities like substations or power generators to be built nearby. And then we have the issue of cooling. The need to extract an enormous amount of heat from HPC systems in an effective and efficient way is a critical one and in most current data centres the energy required to do so and then move that heat outside the building can account for between 30 and 40 per cent of the total energy budget. Moreover, energy consumption and cooling issues are becoming mainstream problems, with the power density of non-HPC data centres passing the 30kW per rack mark.
What are the solutions?
We have to identify the areas of intervention from the processor, accelerator or memory to the entire data centre. Each ‘area’ has some aspects working for better energy efficiency. Companies such as Intel, IBM, Nvidia and AMD, as well as research centres are all trying to make processors more efficient in terms of flops/watt and are even considering radical or revolutionary approaches at an architectural level. Dramatically increasing the parallelism level is one such approach, or indeed using more efficient ISAs (Instruction Sets Architectures) by reusing knowledge and solutions developed for mobile, handheld and embedded computers. Other organisations are improving the efficiency of GPUs – Intel and Nvidia, for example – or memory technologies, as in the case of the HMC consortium. Manufacturers like Eurotech are contributing by making systems more efficient for better system and data centre PUEs. Cooling companies are optimising the cooling systems, and data centre design companies are looking for overall optimisation and ideal geographic positioning.
Of course, we must not forget that software plays a part in making a data centre more efficient overall. It is well known that the current sustained performance achieved by HPC applications is very low and rarely exceeds 30 per cent of the peak performance made available by the hardware platform. This implies that future HPC systems and data centres will require a hand-in-hand collaboration of software and hardware to better exploit the computing performance at an affordable cost. The approach should, therefore, be holistic in terms of CPU/GPU balance, cooling, software, etc. in order to gain the maximum energy efficiency and respond to customer needs.
On the other hand, it is important to breakdown the energy entering a data centre according to the way it is consumed. When doing so, it will be clear what percentage of energy is useful and feeds the IT equipment, and what percentage is used for ancillary services, like cooling.
Understanding the components of energy ‘waste’ can help achieve a better energy balance at data centre level. For instance, we know that in an average data centre 50 per cent of the energy is utilised for cooling. Acting on cooling means impacting a large part of data centre energy consumption. For this reason, public authorities are defining policies and measures that will stimulate good management practices for green data centres – the ‘Code of Conduct’ introduced recently by the European Commission, for example.
Which solution will â€¨improve cooling?
The best cooling solution ultimately depends on the operators’ specific needs. Some may be willing to accept higher computer room temperatures to reduce cooling costs and handle server faults in software. Others may opt for some form of liquid cooling or may try to use free cooling in cold climates.
Technology availability is another factor that influences choice: while some technologies or techniques are grounded like hot aisle and cold aisle containment, others are definitely less mainstream like adsorption chillers, fuel cells or geothermal cooling. Ultimately, the chosen method for cooling HPC systems should minimise the use of energy by initially consuming less and recovering it, while maximising the system density. It is not easy to find this balance, however.
In the long run, I think that liquid cooling is here to stay and will proliferate outside the HPC niche and into the general server market within the next three to four years. On the other hand, it’s worth remembering that even though the first car engines were air cooled, nowadays no automotive manufacturer is considering using air-cooled engines. I believe it is only matter of time, however, and that sooner or later liquid cooling will become an indispensable ingredient for cost-effective HPC and data centres.