The current buzz in the world of high-performance computing is exascale systems. The next major milestone is HPC systems that will be capable of executing an exaflop – a billion billion floating point operations per second, writes John Barr
Extrapolating the performance delivered by the fastest systems in the world in recent years suggests that an exascale system could be built in 2018. However, the approach that has often led to the development of the next generation of supercomputers – more of the same but bigger and faster – is no longer tenable. Additional compute power cannot be achieved by cranking up the processor clock-speed because this also cranks up the power consumption. The best option for an exascale system is to use a massive number of very efficient processors that have very low power consumption, augmented by a small number of higherperforming processors for tackling segments of applications that do not scale well. Steve Oberlin, CTO for accelerated computing at GPU manufacturer Nvidia, explains that the industry has had it easy in recent years, with CMOS scaling giving greater density of transistors on a chip and faster clock speeds as well as power reductions. Oberlin believes that by moving to simpler, more efficient processor architectures and by focusing on issues such as data locality, the industry can stay on the linear performance projection, even if we are approaching physical limits of scaling CMOS devices.
When will the first exascale system be operational?
Before this question can be answered, we really need to understand exactly what an ‘exascale system’ is. If we mean a system that is as usable for a broad range of scientific applications as current petascale systems are, and consumes less than 25 MW power, then the general consensus is that we will have to wait until well into the next decade. If our definition is more liberal, and includes systems whose theoretical peak performance exceeds one exaflop/s – and we ignore whether or not it can stay up long enough to run real applications, and if we don’t care how much power it consumes – then many HPC experts think that we will see the first exascale system around 2019.
However, it may even be a challenge to keep the machine operational long enough to run the Linpack benchmark and demonstrate a sustained performance in excess of one exaflop/s. Alex Ramirez, manager of the heterogeneous architectures group at the Barcelona Supercomputer Centre, suggests that the first exascale systems may be motivated by politics or national interests and may be both unreliable and have an unreasonable power requirement – possibly up to 100MW.
Producing an early exascale system could have significant benefits, even if the system’s usefulness is limited. It could be a statement of intent by the country that funded such an exercise, or the companies that developed the key technologies, pulling more skilled staff and further investment into the programme to develop a more usable, second generation of exascale systems. Also, having a full-scale testbed would allow the HPC industry to run experiments that demonstrated which aspects of the system were up to the job, and which could benefit from taking a new approach.
Professor Thomas Lippert, director of the Institute for Advanced Simulation and head of Jülich Supercomputing Centre, points out that the first petaflop/s system arrived earlier than expected (at the Los Alamos National Laboratory in 2008) and used IBM’s Cell processors – a technology that is no longer available in HPC systems. He suggests that motivation for Japan to do this is, however, is unclear, as the commercial impact of technologies leveraging the K computer (which was the world’s fastest computer in 2011) has, so far, been limited. the first exascale system may also use exotic technologies to hit the performance target.
At the Big Data and Extreme Computing meeting in Fukuoka, Japan, earlier this year, Japan’s Office for Promotion of Computing Science announced a collaboration between a number of Japanese computer vendors and research institutes that planned to build an exascale system by the end of this decade. With an investment in excess of $1 billion, and a target power consumption of less than 40 MW, the system will use what they call an extreme SIMD architecture with thousands of processing elements per chip, including on-chip memory and interconnect. This architecture is aimed at solving computationally intensive applications such as N-body, MD and stencil applications. The motivation for Japan to do this is, however, is unclear, as the commercial impact of technologies leveraging the K computer (which was the world’s fastest computer in 2011) has, so far, been limited.
As already noted, it is impossible to reach exascale just by doing more of the same but bigger and faster. Power consumption is the largest elephant in the room, but it is not alone. In many areas progress towards exascale systems and applications will not be by incremental change, but by doing things differently. The main issues that must be addressed before exascale systems become a reality include:
The most efficient large-scale HPC system today is Tsubame 2.5 at the Tokyo Institute of Technology, which has a peak performance of 5.6 petaflop/s and consumes 1.4 MW. If the current system is scaled to an exaflop/s it would consume 250 MW, which is at least an order of magnitude too much.
Exascale systems will have millions of processor cores, and exascale applications will have billions of parallel threads.
It is generally accepted that exascale systems will be heterogeneous, with the computation being handled by highly parallel, low power devices such as the Intel Xeon Phi or Nvidia GPU accelerators.
The majority of the power consumed by supercomputers today is not used to handle computations, but is used to move data around the system. A higher level of integration for components such as interconnect and memory will both speed computation and reduce power consumption.
Exascale systems will use so many components that it is unlikely that the whole system will ever be operating normally. The hardware, system software and applications must cope with both degraded and failed components.
Programming methodologies and applications
There are two schools of thought regarding the programming methodologies required to build exascale applications. Some HPC experts think that is it feasible to extend today’s MPI plus OpenMP plus an accelerator programming model for exascale. Others believe that a radical rethink is required, and that new methods, algorithms, and tools will be required to build exascale applications.
There is a serious lack of parallel programming skills both at the entry level and at the very high end. As most mobile phones and tablets, and all computers are now multicore devices, all programmers should be taught the rudiments of parallel programming. Today, this happens only at a small number of universities, but there is a growing number of entry-level parallel programming courses being taught. The challenges of programming systems with thousands or millions of cores are far more complex than programming a simple multicore device, but most highend supercomputer sites have to train their own staff, as only a handful of universities or research facilities provide this level of training.
Bill Kramer, who leads the @Scale Science and Technologies unit at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, believes that the biggest challenges the industry faces in moving to exascale computing are the fact that Moore’s Law no longer delivers regular power reduction, the cost of moving data, and memory and I/O capabilities being out of step with advances in compute power.
In order to minimise the amount of data moved around a system (an activity that consumes more power and takes more time than actually processing the data), application writers should consider if all of their data structures really need to use double-precision floating point, as the use of single-precision data could halve an application’s memory requirement and data transfer time.
Unless there are applications that can exploit such a system, there is no point in building an exascale machine. The number of applications that can use the full capabilities of petascale systems today is relatively small, but the programming and application-design skills are improving fast as supercomputing centres around the world focus on application porting, tuning, and training. The jury is out on what an exascale programming environment will look like.
Some HPC experts such as Kramer believe that there won’t be radical changes to the programming models used for high-end HPC systems, while others including Mark Parsons, executive director of the Edinburgh Parallel Computing Centre at the University of Edinburgh, believe that in order to develop applications for exascale systems the industry must invent new methods, algorithms, and tools. A number of the open-source scientific codes that are being run on NCSA’s Blue Waters petascale system are being updated to cope with the latest high-end HPC technologies, and Kramer is confident that some of these codes can make the transition to exascale systems. But he goes on to say that the programming model needs to become more flexible, and that the compute-synchronise approach needs to change in order to cope with jitter in the system.
Thomas Sterling, professor of informatics and computing at the Indiana University (IU) School of Informatics and Computing, also serves as associate director of the PTI Center for Research in Extreme Scale Technologies (CREST). He believes that, throughout this decade, there will be two approaches to developing applications for exascale systems. On the one hand, many users want to avoid disruption to existing codes and therefore look for ways to evolve these codes towards exascale. But longer term, Sterling believes that there will be a paradigm shift in exascale applications, similar to those seen as vector, SIMD, and cluster systems appeared, requiring a refactoring of algorithms and applications to support dynamic, adaptive behaviours and to improve both efficiency and scaling. Sterling also points out that the TOP500 list (the ‘pop charts’ for supercomputers) has two very distinct regions. The performance range across the lower 80 per cent of the list is less than a factor of three, while the top 20 per cent has a performance range of almost two orders of magnitude. So while we are looking towards exascale and talking about petascale, most users are still content with terascale applications.
Dieter Kranzlmüller is professor of computer science at the Ludwig-Maximilians-Universität (LMU) Munich and member of the board of the Leibniz Supercomputing Centre (LRZ) of the Bavarian Academy of Sciences and Humanities. Kranzlmüller reports that the recent Extreme Scaling Workshop concluded that there are many technical issues that still require work, if codes are to be effective at petascale. Hybrid codes (using both MPI and OpenMP) tend to be slower than pure MPI codes, but they can scale better to a large number of cores. Pinning threads to cores (which can be beneficial to performance) can also be a bad idea unless the programmer really knows what he or she is doing, while parallel IO remains a challenge for many applications. Kranzlmüller thinks that more radical approaches such as PGAS languages (partitioned global address space) are worth further investigation. Ramirez of BSC agrees that PGAS languages should be considered to assist fault tolerance and scalability, and to avoid data replication. He goes on to say that, due to their massive size, exascale systems will always suffer from degraded components, and that the performance jitter caused by this mean that the dynamic reallocation of workload is important if high performance is to be maintained. But he sees a huge barrier to the adoption of new technologies such as PGAS. As long as MPI programs continue to work, users will not demand PGAS from vendors, and vendors will not support it. To move to PGAS requires collaboration from users, software and hardware suppliers. But there is a serious problem that this may not be generally accepted until it is too late.
When the HPC industry has gone through paradigm shifts in the past, the impact of these changes has been mainly limited to the hard-core HPC industry. However, HPC is today a strategic asset for many companies beyond traditional HPC users, so major changes to applications will have far-reaching affects.
Lippert believes that some applications can evolve to exascale, while others will need to consider new approaches. The HIGH-Q Club at Jülich supports applications that can fully exploit the compute, memory and/or network capabilities of their 458,000 core IBM BlueGene/Q system. The club now has 12 members, with a further 10 working on qualification.
Applications today, even those running on the fastest supercomputers delivering in excess of 10 petaflop/s, assume that the system will always operate correctly. But exascale systems will use so many components that it is unlikely that the whole system will ever be operating normally. So software must track the state of the system and pass information about failed (or poorly performing) components to applications, which in turn must be built to operate correctly in such an uncertain environment. According to Parsons, handling the lack of resilience of not only computation, but also communication and storage, will be a major issue for exascale systems. Ramirez thinks that while current petascale applications should be able to run on exascale systems, a new generation of applications that use fault-tolerant algorithms is required to enable resilient applications to scale to the full size of the machine. Lippert proposes a different approach, suggesting that virtualisation of compute, memory, interconnect, and storage could hide reliability issues from exascale applications. However, no hypervisor support has yet been announced that could make this a reality.
An important issue relating to exascale that Ramirez thinks is extremely valuable, is not the high-end systems themselves, but is the low-cost, low-power consuming capabilities that the required technology advances will bring, resulting in petascale systems in a single rack with a power draw of 20 kW, and terascale capabilities on portable devices. These systems will deliver high-value to society, especially in healthcare where doctors will be able to deliver real time diagnosis rather than waiting for weeks to be able to access expensive specialist systems.
Sterling is convinced that we will see the first exascale system before the end of the decade. But, he says: ‘The question is not “will we have an exascale system?”, but “will we have the right one?”.’ It is worth bearing in mind that the first teraflop/s machines, like the Cray T3E and Intel’s ASCI Red system that were operational 25 years ago, seemed unbearably complicated and difficult to program, and we now have devices like the Intel Xeon Phi and Nvidia K20 GPU accelerators that routinely deliver a sustained teraflop/s. So, however tough the problems seem to be, the HPC industry will overcome them and, in time, the challenges of exascale will be solved and we will soon be looking towards zettascale machines. Perhaps a sign of the times is that the chief architect of the Cray T3E was Steve Oberlin, who is now CTO at Nvidia.
It is worth remembering that the world is a naturally parallel place, so while many current algorithms may not cope with the billions of threads that exascale systems may require, a new breed of applications that do not compress the natural parallelism of the universe may be able to succeed.
Mike Bernhardt, former publisher of The Exascale Report, and now a marketing evangelist at Intel, says: ‘We are indeed taking some big steps into a new parallel universe. With the impressive breakthroughs in cooling technology and processor fabric integration, building an exascale machine is something we can do today, albeit not in any practical or affordable fashion. That will, of course, improve. But what would we do with such a machine, other than run some benchmarks? The biggest hurdle to picking up traction in this new parallel universe is developing an exascale-level architecture and the new programming models needed to support a new generation of parallel applications.’