Bath your computer in baby oil
One path to better energy-efficiency in high-performance computing (HPC) appears to lead back to the future. In the early 1980s, when Seymour Cray was designing the iconic Cray 2, he tackled the issue of heat generation by submerging the electronics in liquid. Today, 30 years on, several companies are offering modern versions of liquid cooling to try to improve both the compute performance and the energy economics of modern HPC.
Cray chose a Freon-based cooling system, called Fluorinert. Today a British company, Iceotope, is offering convective cooling with 3M’s environmentally-friendly Novec, while two US companies are cooling computers by using a liquid not too dissimilar to baby oil.
The idea that future computers may be bathed in baby oil may seem surprising, but more surprising still is that concerns about electricity consumption appear to be relatively recent. Until the past five years or so, the concern of most supercomputer centres was to squeeze the maximum number of flops that they could out of their machines. Even today, efficiency in HPC is often regarded as an issue of keeping utilisation levels up to or more than 90 per cent, regardless of the power cost. This opens a software-based approach to energy efficiency – whereby the scheduling software can shut down nodes that are not doing any processing, or run jobs so as to minimise power consumption, and where application software too is formulated, to make the most efficient use of the machine.
But in the longer term, the route to energy efficiency in HPC may lie in switching the underlying chip technology. Nvidia’s GPU technology has shown how unconventional chips can find an important place in HPC, setting a precedent for inherently energy-efficient chips, available cheaply as commodity items, such as those used in mobile phones or tablet computers. Thus, instead of HPC being the trailblazer of new compute technology, it may be the beneficiary of technology developed for the mass consumer market.
Nonetheless, Richard Barrington, business development director of UK-based Iceotope, warns there is a risk that as chips become more efficient, people will buy more, increase the density of their systems and end up consuming as much, if not more, power. Currently, he said, the industry ‘spends vast amounts of money blowing air at the problem’. At Iceotope, however, ‘We have taken a long hard look at this and done a lot of work using CFD to understand how could we use a liquid more efficiently – “harvesting” the heat rather than disposing of it. We recognise that heat, potentially, has a value. So, rather than see it as a waste product, if we think about it holistically, we can think of heat as an asset to be redeployed.’
A distinctive aspect of the Iceotope approach is its modular design, he said. ‘We stick the PCB, the motherboard, inside a sealed container and inside this box we have the 3M Novec liquid which hyper-convects the heat.’ One of the problems with the old Cray design was that its heat transfer properties depended on the fluid boiling off. In contrast, Iceotope’s fluid ‘does not phase-change into a gas; we are working in a liquid form,’ according to Barrington.
Keeping the circuit totally immersed in a liquid avoids the problem of thermal shocking that can arise in the traditional cooling paradigm as cold air is blown over hot processors. ‘By immersing the motherboard into Novec we are looking to maintain a constant temperature and present the heat to the outside casing efficiently and this reduces the server energy load by 20 per cent as you have no fans,’ he continued. ‘We also provide a cabinet with a separate sealed water system and, using CFD, we are trickling the water using gravity to a heat exchanger at the bottom of the cabinet.’ The heat is then available to be reused as office heating or to be rejected to the environment via passive cooling.
The final beta systems, which are fitted out with lots of sensors, are at academic facilities for final testing. Iceotope will be starting shipments to customers from the beginning of September. Barrington expects that the system will cut the overall energy bill in half, but he adds that there are also savings in infrastructure costs ‘so you can buy more compute for the same budget’.
Bathing in oil
Christiaan Best, CEO of Texas-based Green Revolution Cooling, also stressed the important of the cooling infrastructure. ‘If you think about it, you have a lot of really smart people designing computers and motherboards and local contractors doing the cooling system. It’s quite a juxtaposition when you have them both in the same room.’ He noted that studies have found that 45 per cent of the cost of a data centre has nothing to do with the computing and believes that there has been ‘a race to the bottom’ on the infrastructure side. Whoever can build the cheapest gets the contract – but, he added: ‘The technology is unchanged for 50 years and that can’t be sustained. There is so much going on, on the computing side; and there is so little going on, on the other side. It’s a question of not if, but when people switch to liquid cooling’. He too has the larger goal of finding a way to recapture that heat, citing one installation in Sweden where the plan is to repurpose the ‘waste’ heat to provide heating for the building – ‘it is all being repurposed. That’s the future.’
In contrast to the modular approach adopted by Iceotope, Green Revolution Cooling submerges the entire server in liquid: ‘It’s quite literally: take your server out of the cardboard box, replace the thermal paste between the CPU and the heatsink with a foil. Remove the fans. Remove the spindle-type hard drives. And dump it right in.’ In some cases, he noted, foil rather than thermal paste is already in use, eliminating that step. The system also necessitates the use of solid state memory.
The working fluid is also different: ‘It’s safe to call it baby oil – it’s very, very close but doesn’t have the same fragrances. If it’s good enough for your child, it’s good enough for your computer.’ The fluid has to be pumped round – it is forced convection. Similar oil-based systems have been used in power electronics for many years, he said, and its dielectric strength is better than air. ‘Removing heat can be done in many ways – you can pump it into your district heating. In Texas, you pump it into warm to hot water and have an evaporative cooler’.
Green Revolution Cooling shipped its first system to the Texas Advanced Computing Center in 2010 and now has installations at five of the Top100 supercomputing sites. But despite this, Christiaan Best said: ‘There are people who are not looking to save money or energy. There are some purchasers for whom energy efficiency is not the highest priority. We can provide 60 per cent cheaper facilities and halve the power costs, but there are some people who will look you straight and say “we don’t care”.’ He is working on a timeframe of five to 10 years before such systems are widely adopted: ‘Two per cent of US electricity is associated with data centres, and half of that is due to cooling. It’s not if, but when.’
Chad Attlesey remembers the problems of working with the liquid-cooled Cray 2 – there was a risk that the coolant would evaporate too fast and arcing would trigger plasma fires: ‘What I did in my basement was to look at single-phase solutions that are very easy to maintain.’ He too uses a non-volatile oil: ‘The liquid is food-grade, so you can drink it – although you don’t go too far from the bathroom afterwards. It biodegrades in the environment – it’s a green solution.’
Now president and chief technology officer of Hardcore Computer, based in the city of Rochester, Minnesota, Attlesey also believes it will take time for the significance of the new generation of liquid cooling technologies to get through to everyone. ‘Liquid cooling is far more effective than air cooling [but] we have something that is considered a disruptive technology. We are nibbling round the edges.’ In his view, the energy efficiency case is compelling: ‘We can take 80 per cent out of the cooling cost of the data centre’. But he stressed that the capital cost of building the data centre is also reduced – much less specialised infrastructure is required as there is no need for raised floors, for example. But, in addition, it is possible to get higher densities – ‘you can put equipment closer together and components closer together. These combine to allow for a much smaller, much less expensive data centre.’
Attlesey also cited improvements in compute speed: ‘Latency is an issue in HPC and in banking and finance.’ With more conventional cooling, ‘I can overclock the processors, but I sacrifice reliability and energy efficiency’. With liquid submersion cooling: ‘you get everything – we can overclock processors because we can keep them cool. We get much better mean time to failure, and you get energy efficiency and other side benefits.’
In fact, he said, ‘There is such a long list of value propositions we have a hard time focusing on what is most important to a given customer. There is no concern about humidity or environmental controls when you have liquid cooling, so you can do installations in harsh environments. Fire suppression is already built into the system. We’re using standard FR4 circuit boards off the shelf – we have done years of material compatibility testing.’ Attlesey believes that he can even run submerged cooling of spindle drives – ‘I have a patent on that’ – but there has been no significant customer requirement for it. Another issue is what you do with the heat: his system pumps the liquid, but the energy consumption ‘is a nit compared to the use of fans. Here in Minnesota, it gets cold in the winter time, so we can reuse the heat for office in-floor heating to get additional efficiencies out of the system.’
He is reluctant to make general claims for Hardcore’s system because he believes that assessments have to be made on an installation by installation basis, but he was willing to cite potential savings of 80 per cent or more in cooling costs and, because ‘these are half or more of data centre costs, there are savings of up to 50 per in running costs. Hopefully soon everyone will see the light – this will truly be a green solution and they get a lot more performance in the end.’
More efficient memories
Bathing computers in liquid, whether Hardcore’s or Green Revolution’s baby oil or Iceotope’s refrigerant, may be disruptive technology, but is it surprising to hear a major multinational vendor such as Samsung stress that it too needs to work hard to get the message of energy efficiency across.
At the beginning of May, Samsung held its second Semiconductor CIO Forum in Europe during which, among other things, it introduced new developments in its Green Memory initiative – in particular its 20-nanometer class DRAM technology in combination with advanced enterprise SSD. In many commercially important computing applications, such as virtualisation and web hosting, memory utilisation ‘is very power sensitive’, according to Peyman Blumstengel, strategic business development, who specialises in Green Memory for the company. The CPUs and graphics tend to be larger consumers of power in HPC applications, he added, but even here ‘HPC sets high expectations in the power consumption of memory. Space and power consumption matters. In HPC, every Watt saved in memory can go to the CPU; every Watt they can save matters to them.’
He noted: ‘Samsung is at the beginning of the whole value chain going from component provider to the end users. We have a technology leadership that we need to translate into a language so that the end user says “Yes, that is something I can experience in my own applications”, in terms of time or cost savings.’ Samsung is working with its partners, he said, to show how to leverage the investment that the company had made in green memory to be passed over to end users.
Thomas Arenz, Samsung’s associate director for marketing communication EMEA, added: ‘We are able to offer power savings on the memory side of 40 to 67 per cent. The second thing is cooling, getting rid of the heat. We are able to take 30 per cent of that pain out of the box, because we generate 30 per cent less heat.’ But he too felt that there was ‘some kind of awareness problem in getting this information on the radar screen of the people who are deciding the power specification of the systems.’ In many universities, for example, the end users do not even see the power bill. ‘The awareness simply isn’t where it should be,’ he said. Like others, he predicted that exascale developments would bring energy efficiency centre-stage and Samsung will be presenting ‘what we think is possible on the memory side on the path to exascale’ at ISC’12, the supercomputing conference in Hamburg in June.
Towards the middle of May, at its GPU Technology Conference in San Jose, California, Nvidia announced the latest in its Tesla product line for high-performance computing. The Kepler 20, which will be available later this year, offers a three-fold improvement in compute performance per Watt over earlier generations. Sumit Gupta, senior director of Nvidia’s Tesla GPU Computing HPC business unit, predicted that the performance per Watt of the new Kepler chip meant that it would be possible to build a 1 petaflop machine with just 10 racks of servers and a power consumption of just 400kW. He said: ‘This puts it within the reach of any university in the world – for their general computing needs.’
Nvidia has been a trailblazer for alternative chip architectures being considered for HPC, according to Simon McIntosh-Smith, head of Micro-Electronics Research at Bristol University in the UK. He is one of the driving forces behind the Energy Efficient High Performance Computing (EEHPC) network. This grouping of researchers and developers in academia and industry want to apply technology from the embedded space to high-performance computing. The processors in mobile phones and tablet computers have always had to be energy efficient, so as not to drain the batteries in hand-held devices.
With Linux now pervasive, according to Dr McIntosh-Smith, ‘it is easier today for people to move from one architecture to another than it has been in the past 20 to 30 years. And what’s the obviously fast-growing space with lots of engineering effort and development going into it? Mobile processors. We’re all buying our smart phones and our tablets, so there are billions of energy-efficient, faster and faster processor chips.’ He cited the European Mont-Blanc project to build a supercomputer using ARM processors, but noted that there are lots of other, smaller projects going on to try other things out, including a PRACE project employing Texas Instruments DSP processors.
Using such chips will require massive parallelisation, which has implications for legacy software, but Dr McIntosh-Smith sees this as a quantitative rather than qualitative issue: ‘The software question is massive and mostly unanswered regardless of any hardware solution we can find for the future.’ Even were it possible to build energy efficient supercomputers in the future, using entirely x86 architectures, ‘we are still looking at millions and millions of cores – that will be the least parallel system we will have to deal with.
‘At the other end of the extreme, systems that are entirely GPU with no host – so very, very lightweight cores – might have one or two orders of magnitude more parallelism. Most people are expecting some hybrid, but we have got to work out how to harness millions of cores – going lightweight makes that only slightly harder.’
But will novel architectures really make a significant impact on energy efficiency in supercomputing in the short term? Dr McIntosh-Smith noted: ‘One of the first companies to be doing 64-bit ARM chips is AppliedMicro in the USA. They are doing a server chip with four heavyweight 64-bit ARM cores – this is a serious quad-core server chip, with multiple memory interfaces and multiple I/O ports. The whole chip running flat out consumes 10 to 12 Watts.
‘That’s about an order of magnitude more efficient than a quad-core chip you can buy today. It feels like there is a factor of 10 to be had already from the first attempt at using these more efficient architectures. They are calling their chip X-Gene and are planning to have first hardware towards the end of this year.
‘I think the next Mont-Blanc stands a chance of being in the Top500. It may not be high in the Top500 and it might be a prototype, but there’s a good chance it will be there within 12 months.
‘We are then looking for major systems vendors to get behind the technology in a serious way, so in several years perhaps,’ he concluded.
Software solutions to energy efficiency
One of the simplest ways to improve energy efficiency is to switch off idle nodes. Accordingly, Altair’s PBS Works middleware has been providing a ‘Green Provisioning’ power management facility for the past four or five years – but mainly in applications such as animation, which can be ‘bursty’ in terms of its demands on the system, rather than in HPC where managers try to run the machines at 99 per cent utilisation. Recently, however, Bill Nitzberg, CTO of PBS Works, has been seeing ‘some opportunities in traditional HPC, so people are paying attention.’
But the issue is one of ‘How do we go from just turning machines off that are idle, to smart predictive scheduling based on power? We have built a second generation version of our green provisioning solution and are deploying it to Government customers with some success.’ The software, he said, ‘plays the game of Tetris with your jobs. With any workload there are holes and normally you will fill those holes with small jobs. But sometimes there will be holes that you can’t fill and these will be of reasonable size and so we enhanced our system to look for backfill holes and turn the nodes off. So the system is running at peak utilisation for its workload.’ He cited one customer whose simulations suggest that it might save about $1,000 a day in power costs.
But as technology develops, he said, ‘we can expect to move from simply turning things off to more interesting stuff: smart scheduling to optimise the use of power – gathering profile data on jobs to assess their power consumption. We want to take that data and run in more optimal ways.’ In mid-April, Altair announced an extension of its partnership with SGI to integrate its power-management scheduler on SGI systems.
Nitzberg concluded: ‘I see our role as to look more holistically and take advantage of the fact that you are running a lot of jobs, and shuffle things around so that we still meet all of your needs. So all your work gets done in the morning, but you use less power doing it.’
Adam Carter, of the Edinburgh Parallel Computer Centre, agrees that having good power management – powering down the bits that are not being used – is important. But he sees a wider role for software beyond middleware and scheduling jobs, so that end-user application software also has a part to play in energy efficiency. He has recently been involved in a research project to build a small-scale parallel computer for data intensive problems – applications that are of increasing importance in scientific computing.
The idea was to balance the flop rate – the speed of calculation -- with the I/O rate rather than concentrate solely on the computational speed. ‘If your code is not compute-bound, you are wasting compute time and energy if you’ve got a system with high-power CPUs that are waiting for data from memory,’ he said. ‘My argument would be that, as important as getting a high flops/Watt rate, is to make sure that all the other components of the machine are in balance for the sorts of problems for which you are using the machine. This is more important than trying to maximise the flops per Watt ratio.’
But his conclusion is not entirely optimistic: ‘Our biggest finding was that it’s much harder than you might think. You need to change the software as well as the hardware. There is work to be done in re-writing software so it sits on these architectures.’