The path to an energy-efficient exascale supercomputer
The International Supercomputing Conference (ISC17) closed on 22 June 2017 in Frankfurt, Germany, ending an eventful week from a growing community energised on how to advance the sector. Highlights included the latest product announcements, updates on machine learning, GPU accelerators, the conclusion of the sixth iteration of the computing cluster competition and – of course – the Top500 supercomputer list.
China’s Sunway TaihuLight climbed to the top of the supercomputing list, achieving 93 quadrillion operations per second (93 Pflops) on approximately 15 megawatts of power. Even with Sunway TaihuLight’s achievement, it is clear that we may be living in a post-Moore’s law world where processor performance growth is slowing overall.
The next question when you compare the Sunway TaihuLight’s Top500 list achievement to the goal of exascale computing – a system that will run one billion billion calculations per second – is how will the first exascale supercomputer keep energy consumption low while sustaining far more computational power?
The Top500 list contestants all compete using the Linpack benchmark, which uses the IEEE Standard 754 floating point arithmetic, measured in floating operations per second (flops) that solves a dense system of linear equations, most of which are dense matrix–matrix multiplications. Currently, some mavericks within the supercomputing community are striving to shift the focus to methods that produce consistent reproducible results, while also looking at whole applications to give a better idea of real-world performance.
Diversifying benchmarks for real-world performance
While a flops-based approach keeps pushing managers of supercomputing centres onwards in one dimension of complexity, other lists have emerged to compliment the Top500. For example, the Green500 list looks at Linpack flops-per-watt for energy efficiency.
At this year’s ISC17 another complementary benchmark to the High Performance Linpack (HPL), known as the High Performance Conjugate Gradients (HPCG) benchmark, entered its seventh year.
HPCG placed the Sunway TaihuLight system in fifth place on its list of 110 entries – and Japan’s Riken/Fujitsu K Computer at number one.
To date HPCG, which measure performance that is more representative of how today’s scientific calculations perform, has been run on many large-scale supercomputing systems in Europe, Japan and the US.
Last year, in a peer-reviewed paper published in the journal International Journal of High Performance Computing Applications, Jack Dongarra, director of the University of Tennessee’s Innovative Computing Laboratory, US, who has been involved in Linpack’s development since 1993, along with two other colleagues, analysed the performance of HPCG in comparison to HPL.
The team concluded that their preliminary tests show that HPCG exhibits performance levels that are far below the levels seen by HPL, one of the main reasons being the so-called memory wall. Still, HPCG scales equally well when compared with HPL.
‘HPCG, in addition to HPL, is a good benchmark and should be run on every new system in addition to running HPL. HPCG shows a different characteristic of the system that is benchmarked and should be a good addition,’ said Robert Henschel, chair of the Standard Performance Evaluation Corporation’s High-Performance Group (SPEC/HPG).
‘I disagree with the statement that HPCG measures “real application performance”,’ said Horst Simon, deputy laboratory director and chief research officer at Lawrence Berkeley National Laboratory, US, and co-editor of the biannual TOP500 list. ‘Since HPCG is mostly determined by this fundamental speed of the machine, it will correlate with HPL in the foreseeable future.’
Then there is the High Performance Computer Challenge (HPCC) benchmark, sponsored by the US’s DOE, the National Science Foundation and DARPA. It comprises seven tests such as the HLP, Fast Fourier Transform, STREAM and communication bandwidth and latency. The HPCC benchmark looks at computational performance as well as memory-access patterns.
‘Benchmarks like HPCC, HPL and HPCG are important and allow users to draw conclusions about the absolute best performance of a system, but they may not be representative of real-world workloads,’ said Henschel.
SPEC/HPG benchmarks offer complementary metrics to HPCG that enable behaviour analysis of whole applications in a more in-depth view of real-world performance. SPEC/HPG benchmarks usually focus on parts of a large system or single nodes.
SPEC/HPG maintains three benchmarks called: SPEC MPI2007, SPEC OMP2012 and SPEC ACCEL. Each of the benchmarks addresses different ways that scientific applications can be parallelised. SPEC/HPG members include AMD, HPE, IBM, Intel, Nvidia and Oracle, as well as a host of associate universities.
SPEC ACCEL contains codes that make use of accelerators, such as GPUs or specific processors, to speed up performance of scientific applications in fields such as medicine, astrophysics, molecular dynamics, weather and fluid dynamics.
‘All SPEC/HPG benchmarks are designed to measure the performance of real applications, not just a benchmark kernel or an algorithm. From our point of view, this gives users a more realistic picture of how applications are going to perform on one system compared to another, or how much of an advertised performance boost of a new processor is actually visible in application performance,’ said Herschel.
In comparison, HPCC is a benchmark that uses very low-level benchmark kernels such as HPL and STREAM tests.
‘Those benchmarks measure only small parts of what a scientific application would normally need to do during its runtime on a supercomputer. In contrast, SPEC ACCEL contains complete real-world applications, measuring the full execution cycle of a scientific application,’ said Henschel.
With computational performance, one of the most important things to know is how much time is spent accessing each level of memory; including registers, cache, DRAM, mass storage and all levels in between. With this level of detail you can forecast performance by understanding the scale of the problem.
‘Yet, that is usually the first thing benchmarks discard: they fix the size of the problem!’ said John Gustafson, currently a visiting scientist at the A*STAR (Agency for Science, Technology and Research) in Singapore. Gustafson is an accomplished expert on supercomputer systems and creator of Gustafson’s law in computer engineering.
According to Gustafson, the original Linpack was a fixed size benchmark that did not scale. By persuading Jack Dongarra to switch to a ‘weak scaling’ model – for which Gustafson’s law applies instead of Amdahl’s law – this helped the TOP500 list to endure for 25 years. Since the 1980s, the benchmark’s definition has become more goal-oriented, amenable to parallel methods and less subject to cheating.
From floats to a posit-based approach
This year a peer-reviewed research paper titled Beating Floating Point at its Own Game: Posit Arithmetic was published in the journal Supercomputing Frontiers and Innovations. The paper’s authors believe this data type has the potential to revolutionise the supercomputing community’s approach and attitudes to performance, both of applications and the systems they are run on.
‘Benchmarks should always be goal-based, but usually they are activity-based,’ said Gustafson. ‘Which is where you get silly metrics like ‘floating point operations per second’ that do not correlate well with getting a useful answer in the smallest amount of time,’ said Gustafson.
In the paper, Gustafson and his co-author, Isaac Yonemoto from the Interplanetary Robot and Electric Brain Company in California, US, conclude that the ‘posit’ data type can act as a direct drop-in replacement for IEEE Standard 754 floats, yet have higher accuracy, larger dynamic range and better closure – without any need to reduce transistor size and cost.
In short, posits could soon prove floats obsolete and further steer the community away from one-dimensional benchmarks altogether.
In another experiment by Gustafson, when comparing posit-based arithmetic with floats, the posit approach again came up on top. Gustafson ran random data through a standard Fast Fourier Transform (FFT) algorithm. Then he inverse transformed it and compared it with the original signal.
‘For a 1024-point FFT and a 12-bit analogue-to-digital convertor data I was able to get back the original signal, exactly, every bit, using only a 21-bit posit representation,’ says Gustafson. ‘That’s something even 32-bit floats cannot do. I can preserve 100 per cent of the measurement information with fewer bits.’
In April of this year, Paul Messina, director of the US Exascale Computing Project (ECP), presented a wide-ranging review of ECP’s evolving plans for the delivery of the first exascale machine – which has now moved its launch to 2021 – at the HPC User Forum in Santa Fe, US.
Messina stated that from the very start the exascale project has steered clear of flops and Linpack as the best measure of success. This trend has only grown stronger with attention focused on defining success as performance on useful applications and the ability to tackle problems that are intractable on today’s petaflops machines.
For well over 25 years, leading voices in the supercomputing community have urged users to measure systems by their capabilities to solve real problems. A few of these individuals include Messina, Gustafson and Horst Simon.
Back in 2014, Simon said in an interview that calling a system exa-anything was a bad idea, because it becomes a bad brand, associated with buying big machines for a few national labs; therefore, if exaflops are not achieved, this will likely be seen as a failure, no matter how much great science can be done on the systems being developed.
‘My views on naming anything ‘exa’ are still the same,’ said Simon. ‘However, what has changed is that we now have a well-defined exascale computing project in the US. This project includes a significant number of exascale applications – in the order of 24.’
These applications range from cosmology to genomics and materials science.
Europe is also working towards a supercomputing ecosystem effort known as EuroHPC. In July 2017, on a European Commission blog, an established voice in this field said the European supercomputing community faces a weak spot in technologies, such as the development and commercialisation of domestic computer or CMOS chips and processor technologies.
This view came from Wolfgang Marquardt, scientific director and chairman of the board of directors of Forschungszentrum Jülich (Jülich Research Centre), in Germany, home to a supercomputing centre and one of Europe’s largest research centres.
‘A single nation cannot make significant progress in this endeavour, and we need to work together to advance in this field,’ said Marquardt.
For Europe’s EuroHPC, or any exascale, effort the success factors that should take precedence have shifted from traditional metrics, according to Marquardt.
‘In my opinion, there are more appropriate ways to discuss the power of supercomputers, rather than rigid benchmarking lists: energy-efficiency, scalability and adaptability for a variety of different frontier applications have become more important parameters than the obvious efficiency metrics, such as the number of cores, or the peak performance on some standard test suite.’
Leading thinkers in the US tend to agree on this approach.
‘What is most important is that they all have to measure progress on a metric that makes sense for the application. So there is no push to get some artificial results that may not make sense scientifically. Instead the application developers need to demonstrate a factor of a hundred improvement over 2017’s state-of-the-art in their chosen metric,’ said Simon.
But is there an approach or benchmark that will be able to meet the demands of future applications and their users, or will something new be needed for the ‘exa-age’ of supercomputing?
‘There is no overall best benchmark. These benchmarks are not suited to actually make a purchase decision for a machine. You will need to first define what you want to accomplish with a supercomputer,’ said Simon.
These questions could be whether the benchmark is for a single application or for a very diverse workload; or, is it for a small number of users or for many users – these are critical factors for managers of supercomputing systems. In many cases, one benchmark will be suitable for one problem, but not another.
‘For an actual procurement, the Sustained Petascale Performance (SPP) is much more useful, but it needs to be tailored to the individual requirements,’ said Simon.
The Sustained Petascale Performance metric tool is used on the Blue Waters system at the University of Illinois at Urbana-Champaign, US. It helps its users get a more detailed understanding of each application’s performance, workload and the overall continual performance of the entire system.
To posit-operation-per-second processors and beyond
In an email interview, Dongarra emphasised that to get an idea of the best benchmark to work towards, the DOE’s current exascale goal is a good guide: an application that can run 50 times better than on today’s 20 Pflop systems, running under load at between 20 to 30 megawatts of power and with less than one fault per week.
According to Robert Henschel, the SPEC ACCEL single node benchmark is applicable to the future exascale system, but it would only evaluate a small section of a system – so a very comprehensive analysis, but not at scale. Posits could reset the approach of the community; first, it has to overcome natural scepticism and the current manufactured computer processors.
Simon said that in his opinion the main obstacle is that the global hardware computing industry is now a $350 billion enterprise. It will be very hard to move that market towards innovative concepts like posit-based architectures, even if they are probably better.
Marquardt also said, ‘The concept of posits is interesting but – at least at this point in time – is not expected to be of relevance for the next generation of supercomputers.’
Perhaps a breakthrough innovation will not come from one of the larger players, such as big chip manufacturers Intel or Nvidia, but a smaller start-up on the bleeding edge of technology.
Start-up semiconductor company REX Computing, based in the US, is developing a novel low-power processor chip called ‘Neo’. This 28 nanometre-sized technology is touted by its creators to have up to 25 times energy efficiency improvement for supercomputers and digital signal processing over conventional CMOS chips.
REX Computing’s initial test chip was produced last year and uses a custom-designed IEEE compliant floating point unit that is being sampled by early customers.
The team at REX are also experimenting with posits and see great potential in them. A processor variant using posits is in production under contract with A*STAR.
‘We are a very small team, but are punching outside our weight class,’ said Thomas Sohmers, CEO of REX Computing. ‘For a start-up like REX, we want to cater to early adopters and customers that have the absolute highest requirements for their systems, which is a niche too small to base major product decisions on for the big guys.’
With $2 million in funding they have already developed a new processor architecture, created silicon chip units and the initial software. In comparison, the typical cost for a 28-nanometre node process for traditional semiconductor companies ranges from between $30 to $250 million.
‘A number of start-ups in the past decade raised tens of millions of dollars without ever producing a working chip,’ said Sohmers. ‘While those big companies may look at posits as a risky proposition, we see opportunity in being the first to offer innovative solutions to those early adopters.’
To date, the Neo general-purpose float-based processor is achieving 128 single precision and 64 double precision gigaflops per watt in tests. ‘In comparison, that’s more than double what you see at the top of the latest energy-efficient Green500 supercomputer list,’ said Gustafson.
According to Sohmers, the latest Intel ‘Knights Landing’ Xeon Phi chip, made on a 14 nanometre process, has a theoretical peak performance of about 10 double precision Gflops per watt.
The current theoretical peak performance on their Neo chip is better. On an older 28 nanometre process, the Neo performs 32 double precision Gflops per watt, 26 Gflops per watt for a DGEMM benchmark (designed to measure the sustained floating-point computational rates of a single node) and 25 Gflops per watt for a FFT – a very communication intensive function. This will widen advantages over x86-type processors for 64-bit operations.
‘We are showing three to 25 times better energy efficiency while we have a huge (us on 28 nanometres versus Intel’s 14 nanometres) process technology disadvantage. For our production chip, which is roughly on a par with Intel’s 14 nanometre, our numbers would just about double,’ said Sohmers.
Based on these conservative estimates, a 32-bit REX-type design processor, based on posits, instead of a 64-bit processor based on floats, could achieve 60 billion real-world operations per second per watt.
Scaled up, this is the energy-efficient exascale computer that Sohmers and Gustafson envision. ‘With a 20-megawatt power budget, yes, you’re definitely beyond exascale at that point,’ said Gustafson.
However, as mentioned, some in the community have their doubts, which is based on the current dominance of the larger players in the semiconductor industry. ‘That may have been possible in the late 1980s when the industry moved to the IEEE floating point standard, but at that time, the market was much smaller and floating point arithmetic was indeed faulty and counterproductive for software development,’ said Simon.
But, according to Gustafson, IEEE 754 floats are obsolete: it’s just that the world doesn’t know it yet. ‘The small companies have early-mover advantage and the big companies have amazing resources to apply, but are
always conservative. That’s where the revolutionary fun is – and always has been. Very much like the disruption of parallel computing in the 1980s,’ said Gustafson.
Large and established chip manufacturers are still squeezing as much out of CMOS technology by investing in Fin FET (fin field-effect) and seven-nanometre scale transistors.
‘The established companies won’t lift a finger until they see their market share threatened by an upstart; and sometimes, not even then. With the belief that these initially risky ideas will gain more mainstream adoption once they are proven out as being viable... it would only be at that time that the rest of the industry would be practically forced to change.’
The innovation that REX Computing is making is by taking a lot of unnecessarily and complex logic out of their hardware design for their processor through the use of ‘scratchpads’. They have written unique code that gives exact latency guarantees for all operations and memory access, allowing a compiler to be able to handle all of the memory management just within software, not hardware.
‘While it sounds simple and obvious, the actual algorithms and compilation techniques we are using are very unique, and up until us doing it, many said it would be impossible,’ said Sohmers.
In regards to REX Computing’s IEEE-float-compliant Neo processor, they have had evaluation units in use by early customers since May 2017. And they are planning on sampling 16 nanometre-scale chip units in spring of 2018, with larger volume availability in the last quarter of 2018.
Sohmers said, ‘Depending on our results with the posit project, we expect to have evaluation units available for a variant of our processor replacing the IEEE float unit available in spring 2018.’
Based on their current posit-based simulations, they are very confident that they will they exceed 60 Gflops per watt with their first production chip next year, which has one potential ‘peta-scale’ supercomputer installation in the pipeline for 2019. This shows the potential for a reasonably priced exascale supercomputer by 2020 using Neo chips.
Back in the late 1990s, quips Gustafson, the goal was a ‘tera-ops’ machine, staying clear of flops and Linpack. But it wasn’t long before the supercomputing community said, ‘Yeah, yeah, sure. So does it get a Tflop on Linpack?’
This cycle repeated itself in the 2000s with the peta-scale computing goal: Pflops became the flavour of the decade. Exascale will probably reach the same fate with the first questions being about Eflops. ‘It’s just too much fun to plot trend lines for a benchmark that is older than dirt,’ said Gustafson, who is still undaunted of the potential for posits.
‘I did a quick scan of my email and found 40 entities working on making posit arithmetic real at the hardware level. Most are start-up companies, but also national laboratories, universities and companies like IBM, Intel, Qualcomm, Samsung, Google, Microsoft and Nvidia.
‘Mostly, the feedback I’ve gotten: When can I have it? I want it now! Frankly, I’d be surprised if people are still using IEEE 754 floating point in 2027.’
In the supercomputing chip race, perhaps the surprise will come from a smaller country or start-up that will develop paradigm-shifting solutions first, and drag the race to a new path. Nonetheless, the past has shown that for any big idea, it takes time for change; the clock is ticking.