Robert Roe surveys the processor market that underlies the next generation of HPC systems
High performance computing is on the brink of its biggest transition in two decades. Ever since Thomas Sterling and Donald Becker built the first Beowulf cluster at NASA in 1994, mainstream high-performance computing has tended to consist largely of clusters of server nodes networked together, with libraries and programs installed to allow processing to be shared among them. This represented a major shift away from proprietary, vector-based processors and architectures that previously had almost defined the term ‘supercomputer’ because, initially at least (and certainly in the classical Beowulf configuration), the servers in the clusters were normally identical, commodity-grade hardware.
Although the hardware may be more specialised today, the clusters are still dominated by the general purpose x86 processor, in marked contrast to the vector machines that appeared in the early 1970s and dominated supercomputer design from then until the 1990s, most notably the machines developed by Seymour Cray. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the demise of the vector supercomputer by the end of the 1990s. So the scene today is of massively parallel supercomputers with tens of thousands of ‘off-the-shelf’ processors.
But now the landscape is going to change again. This time the shift will be driven not only by ‘technology-push’ – in the shape of new processors – but also by ‘demand-pull’ as the nature of HPC changes to more data-centric computing. And there is an external constraint as well: power consumption.
The change in the nature of computing was recognised by the report, in 2014, of a Task Force on High Performance Computing set up by the Advisory Board to the US Department of Energy (DOE), which stated that data-centric computing will play an increasingly large role in HPC. The report recommended that ‘Exascale machines should be developed through a co-design process that balances classical computational speed and data-centric memory and communications architectures to deliver performance at the one-to-ten Exaflop level, with addressable memory in the Exabyte range.’ To further these objectives it recommended setting up a programme to be managed by the National Nuclear Security Administration (NNSA) and the Office of Science (both part of the DOE).
The DOE’s FastForward2 project aims to accomplish this through public-private partnerships to accelerate the development of critical component technologies needed for extreme-scale computing. The project awarded $100 million in contracts to five US companies: IBM was tasked with memory research; Intel, Nvidia, and Cray were all separately awarded contracts for node research; while AMD was given both memory and node research.
Diversity of processors
The change in processor technology, according to Steve Conway, vice president of high-performance computing at the market research company IDC, means that users are now looking to develop more specialised tools for specific jobs. So the general-purpose x86 technology that has dominated HPC for the last 15 years or so, is now being joined by accelerators and other, less mainstream technologies such as DSPs and FPGAs.
Roy Kim, group product manager in Nvidia’s Tesla HPC business unit, stated: ‘It’s an interesting time right now for HPC. A few years ago, Nvidia GPUs were the only accelerators on the market and a lot of customers were getting used to how to think about and how to program GPUs. Now if you talk to most industry experts, they pretty much believe that the path to the future of HPC, the path to Exascale, will be using accelerator technology.’ The drive is to find different ways to accelerate the parallel sections of code, as this is seen as the area that can provide the largest gain in performance.
HPC feels the Power of IBM
The move from commodity HPC hardware to a more specialised model is exemplified by IBM’s Power processor. This is a reduced instruction set computer (RISC) processor (the acronym originally stood for Performance Optimization With Enhanced RISC) and so very different from the general purpose x86. It has a system-on-a-chip design: integrating processors, memory, and networking logic into a single chip. Currently four out of the top 10 supercomputers in the Top500 list are IBM BlueGene machines, all with the Power BQC processor.
Its prominence in HPC is largely down to two factors. The first is IBM’s success in the US DOE’s Coral programme. The DOE brought together several of its national laboratories in the joint Collaboration of Oak Ridge, Argonne, and Livermore (hence the name, Coral) to coordinate investments in supercomputing technologies, streamline procurement, and reduce costs, with the aim of developing supercomputers that will be five to seven times more powerful when fully deployed than today’s fastest systems. Two out of the three systems procured through the Coral programme, costing around $325 million, will make use of a combination of IBM Power architecture, Nvidia’s Volta GPU, and Mellanox’s interconnect technologies.
The technological development behind these partnerships underlies a highly strategic move, particularly clear in the partnership between IBM and Nvidia. While Nvidia was developing its NVLink interconnect technology, which will enable CPUs and GPUs to exchange data five to 12 times faster than they can today, IBM set to work with Nvidia to integrate the technology with its own CPUs. The forethought and planning that went into this technology integration demonstrates a partnership that has been in the making for some time.
This exemplifies, in HPC, a second factor in the growth of interest in the Power processor and its associated architecture more generally: IBM’s efforts to promote a wider ‘ecosystem’ of developers as well as other hardware manufacturers around the Power architecture, through the creation of the OpenPOWER Foundation. IBM’s recognition that it needs others in the wider computing community to write software for its hardware perhaps reflects a recognition of the success of Nvidia, which put an immense amount of effort into building up a community of users of its Cuda language for GPUs. It is significant therefore that, earlier this year, IBM recruited Sumit Gupta from Nvidia, with a specific mission to expand the ecosystem of developers writing applications and software for the Power architecture.
The OpenPOWER Foundation itself was founded in 2013 as an open technical membership organisation. The intention is to open the Power architecture to give the industry the ability to innovate across the full hardware and software stack; to simplify system design and to drive an expansion of enterprise-class hardware and software stack.
A third factor, not directly related to HPC as narrowly defined, is that IBM is aiming the Power system at a much wider range of users than HPC by itself. It aims to include commercial and enterprise data centres and hyperscale users (the search engines and social media providers, for example) among the users of the technology. So there will be a wider user base.
Intel and ARM
One significant driver for change in the current HPC market is energy consumption – an area where it is hard to compete with Intel. The economics of Intel’s advantage are largely due to the sheer volume of its processors installed in desktops, servers, and clusters across the world.
While the technology may not have been created for HPC – early x86 processors did not even have a floating point unit – over the years, many features have been introduced to make these processors more suited to the needs of the HPC market. That, together with the low price driven by high volume, has given Intel a seemingly insurmountable lead in the CPU market, which is mirrored in the HPC sector. Any new architectural advances or new technologies will have to generate a skilled user base with coding skills that can effectively scale software – something that has been built up around x86 CPUs over the past 20 years.
However, if this race was driven purely by the economics of high volume sales, then it may be ARM – which is known primarily for processors in mobile devices such as tablets and smartphones – that begins to pull ahead, as these markets are starting to dwarf the other areas of consumer computing markets. ARM has made admirable progress in the HPC. It develops the architecture and instruction set, and then leases the chip designs and the ARM instruction set to third parties, who develop the products.
ARM produces very energy-efficient processors that deliver excellent power/performance. As the HPC market moves towards the Exascale era, it will need to make significant energy savings compared to today’s technology. The European Mont-Blanc project is one attempt to build an Exascale HPC architecture based around the ARM processors, largely because of their energy consumption. The project was granted an additional €8 million to continue its research until 2016.
The need for energy-efficiency and the drive towards data-intensive compute operations has encouraged people to look even further outside the traditional HPC environment to find a solution to increasing power needs. FPGAs and DSPs offer tremendous potential for high-performance computing accelerators as they can operate within a much lower power budget than a typical GPU. The cost, however, is complexity of the programming.
FPGA manufacturer Altera has been promoting the use of OpenCL to generate host code as well as FPGA kernel code. The process could make FPGAs more easily accessible to the general-purpose HPC market, although there is still a lot of work to be done in this area. In order to provide more functionality for HPC users, Altera is also developing floating-point arithmetic engines and DSP blocks that can be included in an FPGA-based processor design.
FPGAs took a step closer to the mainstream of high-performance computing when FPGA manufacturers began to introduce specific tools for developing C and C++ codes. Altera and Xilinx, two of the largest FPGA manufacturers, opted to do this through the use of OpenCL. Mike Strickland, director of the computer and storage business unit at Altera said: ‘The problem was that we did not have the ease of use; we did not have software friendly interface back in 2008. The huge enabler here has been OpenCL.’
The recent acquisition of Altera by Intel has generated further interest in FPGA technology but Addison Snell, Chief Executive Officer, and CEO of Intersect360 Research, believes this acquisition is more focused on the hyperscale computing market rather than on Intel trying to position FPGAs for HPC. Snell said: ‘We watched it [the hyperscale computing market] grow and mature to the point that we have now taken that out of our high-performance computing methodology and really established it as a separate hyperscale market. It has grown and matured to the point that it has its own market dynamics that behave differently from other enterprise, but also other high performance or scale-sensitive applications.’
The issue with FPGA technology is that it necessitates the optimisation of data movement and computation on an application-specific basis. In a general-purpose HPC market that runs tens or even hundreds of applications across a cluster in a fairly short period, it is unlikely that HPC centres will want to switch off even a few nodes while they are optimised for a new application – let alone that entire cluster.
‘FPGAs have been around a long time, and we have seen them in selected areas, but they have always been most successful in deployments that are focused on single applications, or a few applications. They have also been best deployed in areas that are highly scalable and generally for applications that are text or integer-based. They are not as strong on floating point applications that use fractional arithmetic,’ concluded Snell.
The RISC (reduced instruction set computing) processor architecture was originally developed back in the 1980s around a simplified instruction set that could potentially provide higher performance as it is capable of executing instructions using fewer microprocessor cycles per instruction.
The development of the early work was done by two US universities, Stanford University and University of California, Berkeley. Stanford would go on to commercialise its work as the MIPS architecture, while Berkeley’s RISC evolved into the SPARC architecture. Developed by Sun Microsystems and introduced in 1987, the first implementations of SPARC were based on 32-bit operations and initially designed to be used in Sun Microsystems server and workstation systems, replacing Motorola processors. SPARC international was eventually set up to license out the technology in the hopes of encouraging the development of the processor ecosystem. The SPARC architecture was licensed to several manufacturers, including Texas Instruments, Atmel, Cypress Semiconductor, and Fujitsu.
SPARC processors have seen many implementations since those early days, but Fujitsu has had perhaps the most success with the RISC based SPARC architecture to date. Having developed its line of SPARC processors, the latest – the SPARC64 VIIIfx 8C 2GHz – was used in the K computer, which as of the June 2015 release of the Top500 is still ranked as the fourth most powerful supercomputer in the world today.
The Amdahl spectrum
In HPC, the days are long over when users could just wait for the next CPU to deliver a 50 per cent increase in application performance. Now they must look to more innovative, architectural advances which require an understanding of parallelising code and how to map that efficiently to specific accelerator technologies.
Rajeeb Hazra, VP of Intel’s architecture group and GM technical computing, said: ‘We are starting to plateau, not as a company with a product, but as an industry on how quickly we can build performance with the old techniques, by just increasing the frequency.’ Interestingly, for a representative of what is regarded as a hardware company, Hazra placed his emphasis on software, stressing that the key to increasing performance is first to identify clearly which sections of code can be parallelised and which cannot.
Hazra said: ‘Amdahl said the world is not all highly parallel or highly serial; there is a spectrum and the amount of performance gain you can get through parallelism is gated by the proportion of how much is parallel and how much is serial in an application.’
He concluded: ‘There are many applications that are highly parallel and there are some that cannot easily be parallelised. So what we need is a family of solutions that covers this entire spectrum of applications and this is what we call the Amdahl spectrum.’
The future of HPC is no longer a monoculture of clusters of commodity hardware but rather a highly diverse ecosystem, populated by different processor technologies, different architecture, and different software solutions. It may be messy, but it will be interesting.