Why supercomputers are becoming cool machines
By the time this issue of Scientific Computing World leaves the printers, Europe’s first high-performance cluster to be cooled by total immersion in mineral oil will have started operations at the University of Vienna.
The official inauguration will take place later, in the presence of local dignitaries. But the Netherlands-based integrator ClusterVision, which provided the machine, is already working on another total immersion machine for a customer in the UK. A slightly different design, this cluster is due to become operational by the end of the summer.
The energy needs of Exascale, as discussed in John Barr’s article on page 22, are driving a search for high-tech ways of reducing power consumption, including: low-power consumption chips from mobile and embedded applications; redesigning the way data is moved on the chips and in the machine; and rescheduling jobs to be energy rather than compute efficient.
But electricity is expensive in Europe today, and that cost will only rise into the future. Such considerations are already driving European supercomputer owners to seek out energy-efficiency. Cambridge University’s Wilkes machine scored second highest in the most recent Green500, for example. The sense of urgency is perhaps not so acute in other parts of the world where electricity is cheaper, but even there data centres are being moved to sit adjacent to hydropower or other convenient and cheap sources of power.
Europe does not make its own CPUs, but it does have expertise in embedded processors, which have to draw low power, and this, according to Paul Arts, director of HPC research and development for Eurotech has led to some fruitful cross-fertilisation. An international company based in northern Italy, Eurotech has a presence both in the embedded-systems sector and in high-performance computing, where its systems have been some of the most energy-efficient on the market. Eurotech Aurora systems took both the first and second slots in the June 2013 Green500 list.
But Eurotech’s approach is not total immersion, rather it has developed over the past six years a cooling system that delivers warm water to the processor. According to Arts, contact cooling is widely used within the Eurotech group for applications other than HPC, in embedded computing for example: ‘So the competence for cooling and thermal designs is far from new to the group’. But he stressed a further advantage to the technological solution Eurotech had chosen: ‘We want to build a compact machine. Through compactness we can create high speed. With immersion cooling, you have to create space between the components [for the coolant to flow between them] and you have to have something on top of them to create more heat-exchange surface. In our case, the coolant does not have to flow between the components and the cold plates are attached to the components via a thermal interface material that conducts the heat. It is efficient cooling in a smaller space.’
At the present, the design point for Eurotech is to obtain ‘free cooling’ to the outside air, which means that the coolant water can be delivered at up to 50 degrees. ‘In the end, the only temperature that’s interesting is the junction temperature of the silicon itself. It depends on what components we are talking about, but if we are talking about CPU and embedded industrial silicon, we are talking about 105 degrees. For consumer silicon, it can be 100 degrees.’
The headroom can be used not so much for overclocking but to take advantage of the higher power CPUs available from Intel. He remarked laconically: ‘Overclocking will bring warranty issues’. Eurotech is investigating energy reuse, inside a building for example, but there the water temperature may have to be higher – around 60 to 65 degrees – and ‘you may have to buy extended temperature range components for this.’
Arts stressed the importance of considering power conversion in assessing energy efficiency, and here again, the technology developed by the embedded systems side of the company can offer advantages: ‘Our goal is to make power conversion as efficient as possible – we share a lot of information with our colleagues in embedded. Power conversion in embedded, especially for portable devices, is critical. Many of the power conversion steps on a board – going from 48V to 12V and from 12V to 3.3V and even down to 0.9V – all these steps have quite a bit of inefficiency. So the goal is not only to keep the compute elements cool but also the power elements if we are to reach energy efficiency.’ More efficient power conversion also has the advantage of leading to more compact machines, he added.
ClusterVision’s approach to making its machine compact, within the context of total immersion cooling, has been to join forces with Green Revolution Cooling to design the first ever skinless supercomputer, by removing the chassis and unnecessary metal parts that would obstruct oil flow and thus also keep down the costs of the initial investment.
According to ClusterVision, the solution almost eliminates power consumption for air-cooling – cutting it to just five per cent of a conventional system’s consumption. In turn, this cuts total energy consumption by about half, and also reduces the initial capital outlay by removing the need for equipment such as chillers and HVAC units. Another area of saving is reduced current leakage at the processor level in the submerged solution, resulting in less wasted server power.
‘Power efficiency in high-performance computing is of growing concern, due to technological challenges in the ongoing race to Exascale and, far more importantly, growing concerns on climate change. With this reference, we set the stage for a new paradigm,’ said Alex Ninaber, technical director at ClusterVision.
The ClusterVision machine will be used by Austrian research organisations which are collaborating on the Vienna Scientific Cluster (VSC-3) project. The VSC-3 cluster is designed to balance compute power, memory bandwidth, and the ability to manage highly parallel workloads. It consists of 2020 nodes based on Supermicro’s motherboard, each fitted with two eight-core Intel Xeon E5-2650 v2 processors running at 2.6GHz. The smaller compute nodes have 64 GB of main memory per node, whilst the larger nodes have up to 128 and 256GB of main memory. The interconnect system is based on Intel’s Truescale QDR80 design. Software includes the BeeGFS (formerly known as FhGFS), parallel file-system, from the Fraunhofer Institute for Technological and Industrial Mathematics (ITWM). The VSC-3 cluster is managed using Bright Cluster Manager from Bright Computing.
On the other side of the Atlantic, a different submerged cooling solution has emerged as a ‘proof of concept’ announced by 3M, in collaboration with Intel and SGI. This uses two-phase immersion cooling technology. SGI’s ICE X, the fifth generation of SGI’s distributed memory supercomputer, and the Intel Xeon processor E5-2600 hardware, were placed directly into 3M’s Novec engineered fluid.
According to 3M, the two-phase immersion cooling can reduce cooling energy costs by 95 per cent, and reduce water consumption by eliminating municipal water usage for evaporative cooling. Heat can also be harvested from the system and reused for heating and other process technologies such as desalination of sea water.
In common with the other cooling systems that use a heat-transfer fluid other than air, the 3M technique reduces the overall size of the data centre – the company estimates that the space required will be reduced tenfold compared to conventional air cooling. The partners also believe that their immersive cooling will allow for tighter component packaging – allowing for greater computing power in a smaller volume. In fact, they claim that the system can cope with up to 100 kilowatts of computing power per square metre.
‘Through this collaboration with Intel and 3M, we are able to demonstrate a proof-of-concept, to reduce energy use in data centres, while optimising performance,’ said Jorge Titinger, president and CEO of SGI. ‘Built entirely on industry-standard hardware and software components, the SGI ICE X solution enables significant decreases in energy requirements for customers, lowering total cost of ownership and impact on the environment.’
Beyond the plumbing, the next step will be energy-literate sophistication in the way the cluster runs its jobs. Eurotech believes that it can improve the plumbing by constant engineering improvement, so that it can reduce the cost of a water-cooled supercomputer to less than that of an air-cooled machine – even when cost savings in terms of equipment to reject heat to the outside environment are taken out of the equation. But using expertise from the other side of its business, in embedded systems, Eurotech is starting to provide high-frequency measurement of temperature and power at the nodes within the cluster. With real data on how much power different applications consume and how that power consumption is distributed, software writers will have the opportunity to build applications that can influence the power usage. ‘For management of our nodes, we tend to use low-power processors; and on the indirect tasks of an HPC – for example measuring the temperature sensors – we use very efficient compute modules that we “borrow” from our colleagues [on the embedded side],’ Arts said.
‘If you look at the cross-links we have in-house, then you can see HPC technologies ending up in embedded and the “nano-pc” technologies ending up in HPC’. High-performance computers, he continued, ‘have many sensors on different nodes, and you have to transport that data.’ Building on the cross-disciplinary expertise, Arts believes it will be possible ‘to read out data across the whole machine – to make the developers aware of the energy and temperature across the machine. This feedback is the first step to energy-efficient programming’. The idea of programming for energy rather than compute efficiency is being promoted by many people and, he continued: ‘What we as manufacturers want is to give people tools to work with’.
Arts concluded: ‘What I also want to say is that we have a big research community in Europe that is focused on energy-efficient high-performance computing. As an integrator of the hardware, I see myself as an enabler of the community, so it can push the limits. I think this is possible within Europe and we are very strong. With European projects – such as Prace – we are able to bring new technologies into this field. Europe has a very strong team working towards energy efficient solutions. I am very proud of that.’
Taking up the intelligent power challenge
The demand for intelligent power management, monitoring, and control is growing as data centre power consumption continues to rise. Adaptive Computing’s new Moab HPC Suite – Enterprise Edition 8.0 – introduces several power management capabilities that enable HPC administrators to achieve the greatest efficiency. Two such features are power throttling, and clock-frequency control.
With power throttling, Moab manages multiple power management states, allowing administrators to place compute nodes in active, idle, hibernation and sleep modes, which helps to reduce energy use and costs. Moab also minimises power usage through active clock frequency control, which provides greater control over processor consumption and memory workloads, enabling administrators to create an equitable balance that saves energy and maintains high levels of performance.
Asetek’s RackCDU D2C is a ‘free cooling’ solution that captures between 60 per cent and 80 per cent of server heat, reducing data-centre cooling cost by more than half, and allowing 2.5x-5x increases in data-centre server density. D2C removes heat from CPUs, GPUs, and memory modules within servers, using water as hot as 40°C, eliminating the need for chilling to cool these components.
Chilling is the largest portion of data centre cooling OpEx and CapEx costs. With RackCDU D2C, less air needs to be cooled and moved by computer room air handlers (CRAH) or computer room air conditioning (CRAC) units. Further, liquid cooled servers need less airflow resulting in servers that are more energy efficient.
RackCDU is capable of returning water from the data centre at temperatures high enough to enable waste heat recycling. Data centres choosing this option recover a portion of the energy running their servers – further increasing energy cost-savings, reducing carbon footprint, and resulting in cooling energy reuse efficiencies (ERE) below 1.0.
The challenge facing the developers of liquid cooling solutions is to make a system that is: easy to service; robust; and cost effective. Other issues include warranty and insurance. Clustered Systems’ design satisfies all these requirements.
The design consists of a series of cold plates made out of multi-port tubing (MPT), brazed permanently into rack-mounted manifolds which distribute coolant through a metering device. The rack holds up to 192 cold plates and can cool up to 200kW.
Heat is conducted via heat risers (aluminium blocks work well) from hot components to a single plane, usually the tops of the DIMMs, thence to cold plates through a compliant, thermally conductive interface. This latter compensates for the height and co-planarity variances present in all baseboards.
When a server blade is inserted into the rack, a cold plate slides under the lid. A few turns on a bezel-mounted crank presses the cold plate firmly onto the heat risers.
Traditional data centres use chiller-based systems that use an average of 50 per cent of all data-centre power. CoolIT’s warm-water cooling eliminates or drastically reduces the need for chilled water supply. By using direct-contact liquid cooling (DCLC), the dependence on fans and expensive air conditioning and air handling systems is drastically reduced. This enables over 45kW densities per rack, low power use, and access to significantly higher performance.
Integrating direct contact liquid cooling initially increases the basic server cost; however, this increase is quickly offset by several factors:
- High density solutions require less standard data centre equipment (racks, switches, raised-floor, etc) lowering overall CapEx;
- 25-30 per cent decrease in OpEx thanks to reduced chilled-water requirements when using warm-water cooling. The average ROI for CoolIT’s DCLC system is 0-6 months; and
- CoolIT’s modular and scalable Rack DCLCsystems optimise the server environment for a highly efficient data centre and provide immediate and measurable CapEx and OpEx benefits.
Eaton-Williams, part of the CES Group, provides energy-efficient water cooling solutions that help drive down power consumption and drive up performance.
A leading product in Eaton-Williams’ HPC product offering is the ServerCool CD6 Cooling Distribution Unit (CDU), which is widely used with ultra-high density supercomputers. The high-performance, customer configurable CDU has built in redundancy and communications capabilities from Modbus through to SNMPv3, SSH-CLI and Https web-server. It is used by eight of the world’s top 10 HPC manufacturers.
Designed to minimise energy use, the CDU rejects 305kW of heat using only 4kW of power in less than a square metre of floor space. End customers include many world-class universities as well as leading scientific and financial institutions running high-density applications.
Eaton-Williams’ data centre cooling portfolio offers energy efficiencies, lower emissions, reduced carbon footprints and built-in redundancy, all of which help to maximise ROIs. Eaton-Williams has also pioneered free cooling technologies, including the world’s first zero-carbon data centre in Iceland for Verne Global.
LiquidCool Solutions (LCS) is a technology development firm with patents surrounding cooling electronics by total immersion in a dielectric fluid. LCS technology can be used to cool electronics of any shape and size. For rack-mounted servers the dielectric fluid is pumped from a central station through a manifold mounted on the rack into sealed IT devices, flooding each chassis and slowly flowing over and around the circuit boards and internal components via directed flow. Once the coolant exits the enclosure it is circulated outside the data centre, where the heat is captured for commercial reuse or rejected to the atmosphere by a commercially available fluid cooler.
LCS licenses its IP to OEMs looking for a cooling solution that saves energy, saves space, enhances reliability, operates silently, and can be surprisingly easy to maintain in the field. The fact that LCS cooling technology can dissipate 100 kilowatts per rack and costs less, are just added benefits.
Calyos provides advanced two-phase cooling solutions for high-performance computing servers. Using a breakthrough passive and energy-free capillary pump, Calyos solutions transfer the wasted heat outside the server, thanks to the full vaporisation of a low-pressure two-phase fluid circulating inside a sealed circuit. These solutions significantly outperform full liquid cooling whilst having the same form factor, contrary to immersion cooling.
Adaptable to liquid or air-cooled racks, these silent and high efficiency solutions reduce the energy requirements of HPC infrastructures. Fans and pumps are not required in the server anymore.
Secondly, the very low thermal resistance allows the use of high temperature cooling fluid (>45°C), reducing the need for data-centre chilled water and enabling energy reuse.
The Coolcentric family of rear-door heat exchangers are passive, liquid-cooled, heat exchangers close-coupled to the rear of the IT enclosure. The heat exchangers are designed to help customers address today’s data-centre challenges of increasing IT loads, aging infrastructure, and the need to expand.
The Coolcentric heat exchangers bring heat removal as close to the heat source as possible, providing the ultimate containment solution and eliminating the need to ‘cool’ your data centre. Coolcentric heat exchangers are flexible – they can be attached to any manufacturer’s rack, they are efficient – reducing data centre cooling energy by up to 90 per cent, and they save space.
Coolcentric heat exchangers, High Density RDHX-HD (40kW), Standard RDHx (20kW), and Low Density RDHx-LD (10kW), can provide sensible heat removal, cooling rack loads from 5kW to 40kW.
Motivair manufactures ultra-high efficiency Chilled Doors for HPC applications. These active rear doors remove up to 45kW from a 42u x 600mm rack, using variable speed fans, water control valve and PLC controls with remote communication. This performance is achieved using 65F water and maintaining 75F (intake & discharge air) room temperature.
Using 65F chilled water eliminates any possibility of condensation or the need for condensate pumps. The total Chilled Door fan power requirement of 840 watts at 45kW capacity is entirely offset by the reduction in server fan power, independently measured at 1kW by a major server manufacturer in their test lab. Therefore there is a small net power reduction for an HPC rack fitted with a Motivair Chilled Door.
The use of 65oF chilled water also reduces the power cost of a dedicated chiller by 30 per cent resulting in an extremely efficient system. Any building central chilled water supply can be tempered from 45oF to 65oF by utilizing a Motivair cooling distribution unit (CDU).