Power and the processor
Over the years many proprietary processors, notably Cray’s vector processors, have been designed to deliver the compute power in HPC systems.
This started to change in the 1990s. At the SuperComputing conference in Reno in 1989, Eugene Brooks of the Lawrence Livermore National Laboratory talked about swarms of ‘killer micros’ replacing these expensive, but powerful HPC processors. Since then we have seen a slow but inexorable transition away from specialist proprietary processors to clusters of commodity systems using mainstream technologies.
The first TOP500 list (www.top500.org) was published in June 1993, and was dominated by processors from the likes of Cray, Fujitsu, NEC and Convex. Even Intel’s presence on the list was down to the i860, and not the x86 family. Just five years later the transition towards mainstream processors was well under way, with more than half of the systems on the TOP500 list now using Sun’s SPARC, SGI’s MIPS, DEC’s Alpha and IBM’s POWER processors – and there was a single system using Intel’s Pentium Pro.
After a further five years the overall picture was similar, but the main players were Intel with the x86, HP’s PA-RISC and IBM’s POWER processors. But vector processors had not gone away, and the DEC Alpha and SGI MIPS still featured. By June 2008 a significant change had taken place in the world of HPC processors, with the list now dominated by the x86 family of processors from both Intel and AMD – these processors powered more than 80 per cent of the systems on the list. In the most recent TOP500 list, published in June 2014 at the ISC conference in Leipzig, Intel powers 427 systems, AMD drives another 31 with their flavour of the x86 architecture, while only 42 systems use alternative processors. Given that the HPC industry has always pushed the limits of what is possible, and adopted weird and wonderful technologies in order to gain a performance edge, the dominance of HPC systems by the x86 architecture is quite remarkable.
The transition that started 25 years ago now seems to be complete – so is it game over for anyone other than Intel in the HPC processor industry? Absolutely not, the game is still very much alive, and the forces that have shaped the industry over the last 25 years – and other issues – will continue to drive change over the next 25 years.
One of the drivers for change that has led to Intel’s strong position in the HPC market has been price – or to be more precise, price/performance. Intel doesn’t just build chips for PCs or database servers and hope that these can be successful in HPC. Over the years many technologies and features have been introduced to the x86 family to make these processors more in tune with the needs of the HPC market. Whereas early x86 processors did not even have a floating point unit, a modern x86 processor looks much like a multiprocessor Cray supercomputer on a chip. A combination of the HPC advances made by Intel (and, to a lesser extent, by AMD), together with the low price driven by high volume, has put Intel in the position it is in today in HPC.
The economics driven by high volume, which has been Intel’s friend in recent decades, may turn out to be its enemy in the future, as very high volume of processors today are deployed in mobile devices – smart phones and tablets – a market segment dominated by ARM, and not by Intel. While an even larger market segment is embedded computing, where more exotic technology is often deployed and x86 and ARM compete with DSPs and FPGAs.
Another driver for change that is now at the top of everyone’s thinking (but was not even on anyone’s radar when the first TOP500 list was published) is power consumption. The first system on the list that provided power consumption information was the Earth Simulator in Japan, which was installed in 2002. It went straight in at the number one spot (by almost a factor of four over second place), stayed there for two years (a lifetime in the world of HPC), and used 3,200 kW of power. This was a particularly power hungry system, with the IBM BlueGene/L system that topped the list in June 2005 drawing only 716 kW, while some systems in the top 10 positions on the list used less than 100 kW of power. Fast forward to June 2014 and the top system, China’s Intel Xeon Phi based Tianhe-2 (MilkyWay-2) requires 17,808 kW, with systems as low as 78th position on the list drawing more than 1 MW (1,000 kW) of power. Bear in mind that a small town requires a 10MW power supply, and it is clear that the trend of HPC systems using more and more power has to change.
Couple the need to reduce power consumption with the mass market economics of processors for mobile devices and embedded computing, and there is potential for future HPC systems to be driven by an evolution of technology that is today deployed outside of mainstream computing and HPC.
What next for Intel?
Intel is well aware of the changes going on in the HPC industry, and is working hard to ensure that it can maintain its strong position in the HPC market.
Future generations of Intel’s Xeon processor will continue to be major components in many HPC systems, but cannot be the only processing technology used if the HPC industry is to respond to the power consumption challenges it faces.
Combining its own developments with technology acquired from Cray and QLogic Intel has announced its Omni Scale Fabric, an integrated network designed to meet the needs of the HPC community. Omni Scale will be integrated in future Xeon and Xeon Phi processors. The benefits of increased integration (e.g. faster communication, lower power use) are clear, but some HPC users are concerned that they may be locked into an Intel-only architecture.
What about Nvidia?
Nvidia is the current leader in accelerated computing, with its HPC business having grown by 40 per cent in 2013. Hundreds of universities teach CUDA (Nvidia’s HPC programming language for GPUs), and there are hundreds of thousands of CUDA developers, while OpenACC and OpenCL provide alternative programming approaches for GPUs. Nvidia is also building ARM SoCs that include an integrated GPU, and is working in the OpenPOWER consortium to develop more powerful chips for analytics incorporating OpenPOWER processors and Nvidia GPUs. In addition, most vendors planning to deploy ARM in HPC are collaborating with Nvidia.
What about ARM?
For ARM chips to make a breakthrough in HPC it is important that they deliver a significant advantage over Intel offerings in terms of performance per Watt, while also delivering on a number of key issues such as 64 bit support and strong floating point performance. A number of products and initiatives are now promoting work towards these goals, including Applied Micro’s X-Gene family, HP Moonshot and Nvidia’s Tegra product family (incorporating ARM processors and an Nvidia GPU).
What about IBM?
IBM has had good success in HPC with its POWER servers and power efficient BlueGene family. In order to meet the needs of future generations of HPC users, IBM has opened up the POWER architecture through licensing it, and related technologies, to members of the OpenPOWER consortium. This move brings a wide range of skills to the POWER architecture, and will result in new processor variants being developed, beyond those that would have fed the interests of IBM on its own. Members of the consortium with a special interest in HPC include Mellanox and Nvidia, while Altera also sees this as being an opportunity to expand the use of FPAGs in HPC.
Are vector processors dead?
Industry trends may indicate that future HPC processors are more likely to be commodity, mobile or embedded components, but NEC believes that there is a future for modern variants of vector processors. The SX-ACE is a modern implementation of NEC’s vector architecture that delivers an order of magnitude better power efficiency compared with the previous SX-9 system. One of the key features of a vector system is fast access to data, and the SX-ACE offers 64 GB/s memory bandwidth per core which is well balanced to the 64 Gflop/s performance of each vector processing core.
What about accelerators?
Accelerators have been used for number crunching since the early days of HPC, but only by a very small percentage of users. The first accelerators included array processors from FPS, Numerix, and CSPI, while more recent devices included IBM’s Cell processor and the Clearspeed CSX600. There are now two dominant technologies in this growing segment, which is driven by the need for increased compute power and a reduced electricity bill. GPUs have seen growing use in recent years as compute accelerators, with ease of use improving through software tools such as CUDA, OpenACC, and OpenCL. Nvidia is the dominant GPU supplier to the HPC industry, with the new kid on the block ironically being Intel, with its Xeon Phi Many Integrated Core architecture that has a lower cost of entry for porting applications (although tuning for optimal performance is not dissimilar to tuning for GPUs). The next generation of the Xeon Phi family will be self-hosting, so the term ‘accelerator’ may no longer be appropriate.
What about more radical approaches?
It may take a breakthrough using radical technology to deliver the low power consuming, high compute power, processors required by future generations of HPC systems. There are several potential technologies that are currently being tried and tested.
Adapteva’s Epiphany multicore architecture offers up to 4,096 cores per chip, delivering 5.6 Gflop/s (single precision) at an energy efficiency of 70 Gflop/s per watt. On the Green500 (a variant of the TOP500 list that is focused on the most power efficient HPC systems), the top system in June 2014 delivered just 4.4 Gflop/s per watt demonstrating that although the Adapteva technology is early in its evolution, the approach is one that can deliver significant value as the industry seeks to radically reduce the power consumption of HPC systems.
An alternative extreme multicore implementation is the MPPA MANYCORE architecture from Kalray. The MPPA-256 processor has 256 cores per chip that can deliver 11 Gflop/s per watt on a single precision matrix multiply. A 3.6 GHz commodity processor delivers half of this performance but uses three times as much power.
FPGAs offer tremendous potential for high performance within a very low power budget, but at the cost of programming complexity. Oskar Mencer of Maxeler Technologies claims that they could build what he calls an ‘Exascale equivalent’ system in just a handful of cabinets. This would be a system tailored for a specific class of applications, rather than a general purpose HPC system.
FPGA manufacturer Altera has been promoting the use of OpenCL to generate host code as well as FPGA kernel code, a process that could make FPGAs more easily accessible to the general-purpose HPC market, although there is still a lot of work to be done in this area. In order to provide more functionality for HPC users, Altera is also developing floating-point arithmetic engines and DSP blocks that can be included in an FPGA-based processor design.
Microsoft has been experimenting with FPGAs to support its Bing search engine. The Microsoft Research Group built a system called Catapult that adds an Altera FPGA with 8GB Ram to each of 1,632 standard servers. The FPGAs handle only one part of the search process, that of ranking pages that match a search. The experiment has been a great success, almost doubling performance at an increased system cost of only 30 per cent, and an increased power budget of 10 per cent, so the system will go live in 2015. Microsoft sees this work not as a one-off, but as a demonstration of the potential for FPGAs to deliver cost effective accelerated computing. Even Intel acknowledges the value of FPGAs, by offering a hybrid Xeon/FPGA on a single chip, admitting that for certain tasks an FPGA can outperform a Xeon processor by a factor of 10.
Texas Instruments has been dabbling in HPC for a few years, with its DSPs being deployed in the energy efficient nCore BrownDwarf supercomputer alongside ARM processors. TI also supplies chips that combine 4 ARM cores and an integrated DSP for HP’s Project Moonshot.
While we are looking at radical approaches, the wackiest one worthy of a mention is D-Wave, which builds systems based on quantum computing. This will never be a mainstream technology (after all, it runs at very close to absolute zero, or -273.13 degrees Celsius), but it is able to solve some classes of problem much quicker than traditional technologies (such as discrete optimisation problems) by analysing many potential solutions at the same time. D-Wave systems won’t replace the majority of HPC systems, but as the technology matures it could provide an important component of large scale HPC infrastructures.
It took decades from the first discussion on killer micros to today’s position where commodity processors dominate the HPC landscape, so we should not expect the next technology transition to happen overnight. In the long term, the commodity processors that drive the internet and giant databases are unlikely to meet the compute power per watt requirements of future HPC systems. But will they continue to provide the backbone of HPC systems with support from accelerators? Or will a more radical approach win the day? Programming complexity will make it hard for some of the emerging technologies to make a quick breakthrough to mainstream HPC, but anyone who thinks that Exascale systems built from an evolution of today’s commodity processors and accelerators will be easy to program is fooling themselves.
The majority of today’s HPC systems are clusters of x86 servers, with each processor having a handful of cores. This will not be the case in 10 years’ time -- but the question about which technology will be dominant has not yet been answered. Some of the solutions that are attractive from a performance per watt perspective (FPGAs, DSP, massively parallel devices) are not widely considered today because they are difficult to program. Perhaps the next big breakthrough in HPC will not be in hardware, but will be in software tools that make some of the more exotic, but energy efficient, devices more accessible to the average HPC programmer.