The long road
Bill Feiereisen, senior scientist and software architect â€¨in Intel’s exascale architecture pathfinding group
There are many technologies that we are investigating beyond current state-of-the-art. In aggregate, they will contribute to improving power efficiency beyond its already lean values. Among them are alternate processor architectures that concentrate on extreme parallelism while stripping out any unnecessary functionality, fine grain power management across the processors, the memory systems and the system architecture itself, as well as the system software that directs it. But in the spirit of co-design there is much work on the applications and algorithms themselves. For example, one of the largest usages of power is simply moving data around the machine. This can be minimised by alternate application and algorithm strategies.
Exascale computers will require parallelism in the extreme, managing as many as one billion computing threads of execution. It is not clear that current programming methods will be able to manage computations of this size. The Tianhe-2 machine in Guangzhou, China, for example, contains more than three million execution cores. Its largest computations will need to manage data communication and movement in this large distributed space. Individual cores are not expected to be more powerful, requiring the increased performance from increased parallelism. By this math an exaflops machine will have about a billion cores. Current top computers provide a number of schemes to manage computation and data movement, message passing, threading, as well as capabilities built into some higher level languages, but these capabilities will be severely challenged when required to scale that additional 30 times beyond what is already necessary on Tianhe-2. And this will already be a challenge in this largest machine on the planet. There is much work in the community to define new programming models and to make them useful by also providing the tools that allow a programmer to not only write, but to debug and tune.
Each of these barriers can be overcome individually, but the greater challenge will be to overcome them together, because they interact. Mastering just one of the challenges will not solve the overall problem. I cannot emphasise how important the concept of co-design is. Computing at this scale is one of the most complex multi-disciplinary problems in the engineering and scientific community.
Dave Turek, vice president of advanced computing at IBM
The efficiency of power usage is an important challenge, but very dominating issue revolves around the cost of the system. If one were to try and keep the memory per core on an exascale system on par to what people are accustomed to today, the cost of the system would likely be viewed as very unaffordable. The infrastructure needed to handle the data associated with a system of this size as well as other subsystems will explode in cost unless fundamental compromises are made to the overarching design relative to what users are used to. Finally, reliability of the system in terms of both hardware and software will prove daunting as the sheer scale of the components will present observable failure rates outside the bounds of what most users would expect.
Whenever technology is expected to increase by orders of magnitude of performance or capability the impact of scaling becomes very pronounced and daunting. One cannot simply deploy more of the current standard in order to get to exascale, because cost and space will render the approach untenable. Consequently, invention and innovation are required across all aspects of the system simultaneously.
All of these barriers can be overcome but at what cost? One of the real cost issues associated with exascale has to do with the possibility that changes will be required to the programming models used today. This has the potential to put a tangible burden on users in terms of migrating and optimising current codes as well as the training required for new programming models. Most of the companies pursuing exascale have, therefore, made a real effort to eliminate or minimise the impact on programming models.
Looking to the future, someone could achieve exascale today with enough money, space and power, but that is likely to be a system of limited utility and shaky reliability at best. I think a practical system (with capabilities that extend beyond just a Flop-centric design) will likely be anywhere from five to eight years away. There are unlikely to be radical developments that will make exascale a fait accompli, but everyday there are incremental changes in all aspects of technology that make the ultimate goal feasible.
Pete Beckman, director of the Exascale Technology and Computing Institute at Argonne National Laboratory
There are layers of challenges that stand between us and exascale, and while the industry is focused on reaching that goal, we still have a long way to go. At the very top of what needs to be addressed is programming. A great deal of uncertainty still exists about how to express billion-way parallelism as our current models have been designed for a way of programming that is, in essence, ‘equal work is equal time’. This approach means that scientists divide their applications into portions, distribute these portions across the entire machine, and then assume that each one will execute in the same amount of time as all the others. The problem is that we know exascale machines are not going to deliver standardised execution due to factors such as power requirements, load balance, thermal issues and the programming paradigm of how to express these massive amounts of parallelism.
Right now, OpenMP does not provide a solution to that, and while MPI is perfect for sending messages between nodes, it doesn’t solve the paradigm. We need to figure out how to express parallelism and at the same time build a run time that handles the load balancing – we can’t simply assume that everything will be equally partitioned. There are fantastic ideas floating around the industry, but one thing we’ve learned in the scientific community is that it can take 10 years or more to move to a stable environment of languages, compilers and tools.
Vendors will need to come together to promote a single programming environment that can be stable across all platforms. MPI is a good example of companies coming together to agree on a single standard that everyone is confident in. However, in this space, vendors are still competing with their own unique technologies and approaches. Nvidia has the Cuda programming model, for example. OpenACC is another option, but it is currently in a state of change. And people are attempting to roll OpenCL features into OpenMP. The ideal would be for the industry to agree on a standard, and that universities teaching courses on scientific computing would all use this same model. This would take us a step closer to exascale.
Being able to do more than just a stunt exascale run by 2018 seems like a stretch of the imagination. Someone could buy hardware and demonstrate something very large, and there’s a lot of pride that comes with that, but it wouldn’t impact scientists who want to move to the next level of their work. We’re further away from exascale than I would have hoped, and part of that delay is down to funding. A more realistic goal for reaching that level of compute is 2020. There will be the early adopters who use heroic efforts to add massive parallelism to their code but, until we have a stable programming environment that can be purchased from any vendor, we will struggle to have large communities of scientists using these platforms.
John Goodacre, director of technology â€¨and systems, CPU group at ARM
It’s easy to see the main exascale hurdle simply as the need to deliver an increase in operations per watt, and you can see a number in the community simply looking at the most power efficient implementations of the technology that can replace the old components in their existing system; a new GP-GPU fabricated in the latest 28nm; using a low power ARM processor; finding a low power interconnect. Approaching the challenge through such incremental changes, however, is unlikely to achieve the target efficiencies. The main hurdle is moving the thinking outside of the current system architectures, and beyond the constraints imposed by those technologies.
We need to consider all aspects of the system: the choice of fabrication technology; nanotechnology integration; the design of the silicon system, not just in terms of the processing, memory and I/O of a packaged part, but how that part could be optimised into a sub-system of parts to deliver the required compute density. The reliability and manageability of such a system can’t be ignored as the compound effect of mean time between failures (MTBF) cripples a system containing millions of parts. Then all this hardware innovation must be designed in harmony with the software. You can’t assume an application that scales to a few thousand cores today will scale to a million and beyond, and you can’t assume that the latency and bandwidth expectation of that software will simply work if you reduce the costs of communication.
In the next 12 months we’ll see the first compute devices that have adopted a low-power processor integrated design, along with integration of the latest connectivity I/O. These integrated devices will be able to demonstrate the efficiency saving that system on chip (SoC) design and I/O integration can bring to a system, removal of abstract interfaces between processing and I/O and the reduction of Watts/Op. The realisation of the now hugely increased relative cost of this communication will drive the next phase of holistic system design and we’ll see both the optimism of more efficient compute at increased density, and the pessimism that applying all the component level optimisation will not be enough to reach exascale.
The challenge is that the community is full of highly specialised individuals. The processor design engineer knows little about the interface characteristics of the latest connectivity I/O, and the algorithm writer knows little about the advanced memory models that could be enabled by nano-technology integration of new memory types supported by 3D IC integration. Once a holistic view can be applied to the problem, then I predict the scalability challenges of exascale will start to fall into place; the order of magnitude benefits each aspect of the system when added together brings to the solution will overcome the challenges.
Sanjay Bhal, focused end equipment manager, HPC and cloud at Texas Instruments, and Arnon Friedmann, business manager, DSP at Texas Instruments
It is apparent that the evolution of a generally accepted architecture such as x86 will not reach the power efficiency required to power an exascale computer. Architecture approaches such as GPUs have been tried but so far have not shown the path to achieve the necessary scale of power efficiency. Current architectures can be extended to exascale but would take 100 megawatts of power and therefore is not cost effective. Investment in the system architecture is required to enable scalability at the lower power cost. This includes research in new types of memory, interconnect, and I/O technology.
In terms of getting the GFlops/Watt at higher orders of magnitude, it does seem that we are on the right track with new acceleration ideas. This leads directly into the next challenge, data access. While the processors are getting higher order multicores and larger numbers of compute elements, feeding the data to these computing behemoths is becoming increasingly challenging. New memory technology like HMC and WideIO may help mitigate the problem but as they are new it’s unclear if they will provide the answer yet. As we continue scaling away from the processing cores, the interconnect is the next challenge. Again, as we hit unprecedented levels of cores operating together, we are going to need orders of magnitude improvement in interconnects and it remains to be seen if we know the way forward for this or not. Simply scaling Infiniband or GigE may not be sufficient.
The final aspect is the software. Current MPI-based programming methods may not provide the efficiencies needed to hit the coming wave of exascale and beyond computing machines. Combine this with the new programming models required for the accelerators being built to achieve the necessary GFlops/Watt and the view of supercomputing software for exascale becomes quite murky.
Achieving exascale will also present new challenges that we cannot even predict at this point. As we get closer those challenges should crystallise and solutions will then be found. The most likely timeframe is probably in 2020-2023. There is a strong push to get there by 2020 but this would necessitate more investment than we currently see in the community. It also depends on the efficiency that people are willing to accept for exascale, one could be a very inefficient machine in a shorter timeframe if there was the willingness to supply vast amounts of power to such a system.
Dr Thomas Schulthess, professor of Computational Physics and director at the Swiss National Supercomputing Center (CSCS)
The industry at large is in a state of confusion, because exascale as a goal is not well defined. Simply extrapolating from sustained tera- to peta- and on to exaflops in terms of the High-Performance Linpack (HPL) benchmark produces machines that may not be useful in applications. Since these machines will be expensive to produce and operate, they will need a real purpose, which HPL no longer represents well.
We have two fundamental problems: the frequency of individual processor cores no longer increases; and moving data over macroscopic distances costs time (latency) and energy. The former translates into an explosion of concurrency and the latter requires algorithms that minimise data movement. Critically, we don’t have good abstractions or adequate programming models for the type of architectures required to address these challenges. It will thus be difficult to develop application codes that use exascale systems effectively.
These challenges can be overcome if the HPC community approaches exascale in new ways. Irrespective of architectural direction, a massive investment in software and application code development will be required. Only if we leverage investments in other areas that face the same challenges, such as mobile devices and the gaming industry, will we be able to sustain the path to exascale. Particularly technologies from the gaming industry will attract recent graduates to the HPC developer community – our field will need many new developers.
The recently announced Open Power Consortium is something to watch. It will bring fresh competition to the market for latency-optimised cores and open new avenues in hybrid multi-core design. This architecture could rapidly develop beyond the accelerator model of today’s CPU-GPU systems, providing key innovation toward exascale for several application domains. We will see first exascale machines by the end of this decade, although not traditional supercomputers designed for HPL. Several scientific domains, such as climate or brain research require machines of this scale – it will happen if there is a real need and a path to solution.
We should focus on science areas that require 100- to 1000-fold performance improvements over what is available today, and design supercomputers specifically to solve their problems. Problem owners, i.e. the domain science communities, have to take charge and the HPC industry at large should provide the necessary support. This is how large, expensive-to-operate scientific research infrastructures are built. Furthermore, if we adopt this approach and the purpose of a particular exascale system is well understood and articulated, fewer concerns will be raised about development and operational costs – questions keep being raised about power bills of supercomputers, while hardly anybody discusses the power consumption of the Large Hadron Collider.
William Gropp, director at the Parallel Computing Institute, deputy director for Research at the Institute for Advanced Computing Applications and Technologies, and Thomas M. Siebel, Chair in Computer Science at the University of Illinois Urbana-Champaign
In order to reach exascale, a more explicit focus on programming for performance is required and any idea that we can delegate that problem to programming models or software tools is misguided at best. We’ve never been able to achieve this and there is no evidence that we would be able to in the future. People have been trying this approach for a long time and occasionally we do see demonstration cases that can work, but in general it’s been a very hard process, particularly when the types of adjustments that need to be made to improve the speed of codes are at odds with keeping those codes clear. Because they will require more specialised data structures to be optimised, Exascale systems are only going to make this situation far more complicated.
As an industry, we’ve been plodding along pretending that someone else will deal with the performance issues, but we need to recognise that performance is part of correctness, rather than something we hope for. An underperforming code represents a big problem and just as we have programming constructs to help us with correctness we need performance constructs to help us with correctness. This will also address many of the performance ‘surprises’ that people see today – whether it is OS noise and performance irregularities or interactions between different programming systems, such as between MPI and OpenMP. In the case of MPI and OpenMP programming, there’s really nothing wrong with either of these as they stand, but the tools that were supposed to work on top of them never materialised. The idea that replacing them will fix our problems is not the only way to go, and in fact wouldn’t necessarily address the real problem: performance. Apart from those of us who find it intellectually stimulating, parallel programming is never done for fun! It’s done because it has become a necessity, and it doesn’t make sense that greater emphasis hasn’t been placed on building tools for this area.
Of course, there are tools that attack some of these problems, for example, by rewriting a lot of code. Domain-specific languages, for example, are an attempt to do this by enabling programmers to express what they want to do at a higher level of description. Because there is a narrower focus, the higher level languages are less general purpose than MPI. While this offers more knowledge about what’s going on, the complier may have trouble discovering what it needs to do. These are undoubtedly steps in right direction, but they are only steps – few tools are interoperable or general purpose enough.
The other problem with domain-specific languages is the word ‘domain’. I prefer to look at them as data structure specific languages, because that’s essentially what they are. Matlab, for example, does not apply to any single scientific domain. Rather, it’s a matrix language. Regular grid languages can also apply to any scientific domain that requires a regular grid.
We’ll be on the right road as soon as we can learn to make these tools interoperable, and can view them as languages that handle parts of algorithms and data structures needed for a particular part of a calculation. But we need to make performance part of programming in order to get there.
Bill Dally, chief scientist and SVP of Research at Nvidia
Power efficiency is challenging because the magnitude of the gap to be closed is large and the amount of gain that we can expect from better semiconductor processes is much smaller than in the past. Today the most energy-efficient supercomputers – those at the top of the Green 500 list – are based on Nvidia Kepler GPUs and have a power efficiency of about 2Gflops/W. To get to exascale within 20MW (a stated goal), we must achieve 50Gflops/W, a 25-fold improvement. It’s as if we had to improve the efficiency of a car that gets 20mpg to get 500mpg.
To make this 25-times gap even more difficult, the gains we are now getting from process improvements have been greatly reduced. Back in the days of voltage scaling, a new generation of process technology gave about a 60 per cent reduction in energy. Today, each new generation gives only about a 20 per cent reduction in energy. Over the three generations between now and exascale process technology will only give about a 2.2-times improvement leaving about 12-times to be achieved by other means.
Programming with massive parallelism is likewise challenging because it requires a change to how programmers think and program. Today a large supercomputer, like the Titan machine at Oak Ridge National Laboratory, requires roughly 10 million threads – independent pieces of work – to execute in parallel to keep busy. An exascale machine will require 10 billion threads to keep busy. This thousand-fold increase in parallelism requires rethinking how many applications are written. It will require what is called ‘strong scaling’ where we increase the parallelism more rapidly than we increase problem size.
I am optimistic, however, that we will rise to the challenge of programming with massive parallelism by creating better programming tools that automate much of the task of mapping an abstract parallel program to a particular machine architecture. Such tools will use auto-tuning to find optimal mappings, enabling the 1000-fold increase in parallelism without burdening the programmer. Ultimately, the gaps of energy efficiency and parallel programming will be closed by a myriad of small steps. Improved technologies are being reported every year in the circuits, architecture, and parallel programming conferences.