The compute equivalent of the Mars landing
‘Moving to exascale is the compute equivalent of the Mars landing and for me it’s just as exciting.’ This comment by Franz Aman, chief marketing officer at SGI, mirrors the general industry attitude that we not only face considerable work in hardware, but need a fundamentally new approach to software. Clearly, today’s software just won’t scale up to run on millions of cores and it will also have to execute on the innovative hardware that systems will require once we reach exascale.
Taking an interesting approach are Thom H. Dunning, director at the US National Center for Supercomputing Applications (NCSA), and his team. When the Cray 1 was first introduced, computational scientists and engineers had to restructure and rewrite their applications to take advantage of its vector capabilities. These codes performed well, with some tweaking, on future Cray computers but within a decade and a half computing technology was moving toward parallel systems. We are now at the end of the ‘traditional’ parallel computing era and computational scientists and engineers are once again faced with the task of rewriting and re-designing their applications for new computing technologies.
What has really changed in the past 30 years is the intent of computer designers. In the 1970s, some of them, like Seymour Cray, focused on improving the performance of algorithms common in scientific computing, such as with common operations like the linked triad. With the switch to commodity systems in the 1990s, the connection to scientific computing (a small market) was lost. Researchers had to not only restructure and rewrite their codes for new, ‘massively parallel’ computing technologies, but also deal with the limitations of commodity hardware such as longer memory latencies and bandwidths that were low relative to the speed of the CPU’s arithmetic units.
There is a dramatically increasing gap between a computer’s peak performance and what is realised in real science and engineering applications. This is also true for the Linpack benchmark – it is no longer a reasonable proxy for the performance of a computing system on many, if not most, science and engineering applications. The situation is further complicated by the fact that researchers, in an attempt to address the most pressing problems, are developing software that is extremely complex. In fact, program performance is limited by characteristics of the computing system other than CPU speed and the ability of software such as compilers to automatically find parallelisms and opportunities for reducing latency. For data-intensive applications, I/O performance is paramount. For other applications such as agent-based modelling, the size of the memory is critical; and in many instances, it is a combination of factors that limits performance.
The level of effort required to take full advantage of multicore and especially heterogeneous computing is intimidating, Dunning’s team adds. Scaling applications to tens or hundreds of thousands of compute cores often requires rethinking of the algorithms and, at times, even the fundamental computing approach. Effectively using many-core heterogeneous processors poses even greater problems. Although the advantages of employing modern computing technologies have been amply demonstrated by the research community, only well financed companies can typically afford to rewrite and revise their large base of legacy applications unless absolutely forced to by the competition.
The need to retool
For exascale software, humans are the limitation in how quickly people can be trained on new techniques says SGI’s Aman. We must reskill programmers and this is a generational effort. What we learn in university is what we generally use for a lifetime; it’s difficult to retool. By and large, we still write large-scale software in old generation languages. ‘When we talk to the engineering team at NASA, which runs some of the largest supercomputers, we find that they do most programming with Fortran, with some PHP around the edges.’
As you get further away from CPUs, it’s more difficult to figure out how the software should be written to get the maximum out of all the layers: the OS, compilers and applications. In software, we’re not necessarily writing applications for the hardware five years away; we’ll figure out later what to do with it.
Another problem with exascale, adds Aman, is that we must expect something to break ‘every five minutes’. The MTBF of a memory chip might be a million hours, and at the desktop/laptop level an error is not likely in the computer’s lifetime, but in an exascale system with petabytes of memory, do the math: a DIM will fail catastrophically every other hour. The software can’t just stop and restart a huge job that’s been running for days or longer. We must build in resiliency where the system can detect that a memory chip is going bad and exclude it from the pool. The software must be more resilient to other similar hardware failures; we must eliminate all single points of failure. And let’s not forget errors in thousands upon thousands of disk drives.
Most applications today run in lockstep and are synchronised, but this approach won’t work on exascale machines with millions of cores, so says Dr David Henty, HPC Training and Support at the EPCC (Edinburgh Parallel Computing Centre) in the UK. Obviously, we have to break problems down into many smaller parts. Furthermore, it’s not just how programs are written, as standard OS aren’t designed to rapidly switch among large numbers of tasks and also aren’t ‘parallel aware’. That is, they might be able to de-schedule a job if it’s waiting for data from disk, but aren’t aware of the fact it might be waiting for messages from another processor.
In the past 15 years, he adds, we’ve been programming machines in essentially the same way. A big challenge will be using multiple programming models in the same program. The community first standardised on MPI and then with threads such as OpenMP, and this has allowed us to move forward during this time. Now we need new standards addressing how to program accelerators, such as with directives – but we also need standards for these directives, which are just emerging.
Compilers try different approaches
Today’s compilers typically take a conservative approach that is certain to work and therefore don’t examine multiple approaches. Compilers presently can ask the user for information about special cases, such as with directives, but in a large program running on an exascale system, you won’t know what information the compiler needs. While techniques for programming a few thousand cores are still working adequately, they won’t scale up to exascale. A number of projects are addressing this, one of which is CRESTA (Collaborative Research into Exascale Systemware, Tools and Applications), which is based at The Edinburgh Parallel Computing Centre (EPCC). It is investigating intelligent compilers where program tuning is included, which examines the code, tries different approaches, runs them and then picks the best results. As well as being built into a compiler, this ‘intelligence’ could also be implemented in a higher level program that auto-tunes the code by compiling and running many versions.
As for writing programs, adds EPCC’s David Henty, scientists don’t always apply the same scientific rigour to the behaviour of their software that they do to their academic research. This opinion is backed by a Princeton study of scientific programming trends presented last November at SC11 which states: ‘…scientists have hypotheses on which portions of their code are hot [where considerable execution time is spent], but more often than not do not test these hypotheses. Consequently, scientists may not be targeting the important sections of their program.’
Thus, he continues, when developing programs, snap decisions shouldn’t be made. It’s very important to understand the limitations and problems in your code. What is it in your program that hinders scaling? It’s often not what you think. Are there load imbalances? Synchronisation issues? Too many messages needed? You need to do experiments in the software to convince yourselves of what’s actually going on. Learn to use tools, such as performance analysis utilities, to get a good understanding. Take advantage of knowledge of new techniques before fully committing to a method. Then try incremental methods to make the code work faster, often with a mixture of OpenMP and MPI or implementing specific routines in a new model such as using accelerators or new languages like UPC or co-array Fortran.
What’s happening today?
Meanwhile, what are ISVs doing today to deal with these issues? To find out, I spoke with one of the largest suppliers of scientific software, The MathWorks. Jos Martin, principal software engineer, comments: ‘We’re not utterly focused on exascale because that’s too big for our customers at this time. Instead, we’re one step back but we face the same issues.’ He adds that scientists have no interest in writing software that talks directly to large clusters and that his job is to add language features in Matlab to make life easier.
As a specific instance, he refers to the PARFOR parallel FOR loop in the Parallel Computing Toolbox and also points to the Princeton study of scientific computing.
It states: ‘The most dominant numerical computing language in use was Matlab – more than half the researchers programmed with it… Only 11 per cent of researchers utilised loop-based parallelism, where programmer-written annotations enable a parallelising system to execute a loop in parallel. The most common form of loop-based parallelism was the use of the parfor construct in Matlab, which enables execution of multiple iterations of a loop in parallel and requires that the iterations access/update disjoint memory locations.’
Martin continues by emphasising that we must learn a new way to program and that accelerators, such as GPUs, have changed the model. For instance, we rely heavily on constructs to do interesting things such as memory allocation, heaps, stacks and other fundamental program constructs. GPUs, however, don’t have many of these constructs. We also realise that we are running on many threads with many constraints. GPU vendors have been trying to relax these constraints, but they also face hardware limitations. Everyone is looking for easy ways to write these programs.
One key member of the Numerical Algorithms Group’s training team, Dr Ian Bush, comments: ‘There is very obviously a gap between where academic institutions finish and where efficient use of supercomputing resources can begin. In essence, there are two pieces needed to bridge this gap. First, programmers need to be taught about the range of tools and techniques, such as OpenMP and MPI, that are available to make optimal use of HPC. The second is to teach them the ways of deciding when and how to apply which tool.’
There is no doubt that hardware innovations are happening at a rapid pace, but they also present configuration and management challenges that demand a different approach says Matthijs Van Leeuwen, CEO of Bright Computing. System admins often fall back on familiar ways with their new clusters – using cluster management toolkits and heavy scripting to build and then manage their systems and these new technologies – and work their way through a learning curve in the process. Unfortunately this practice robs them of the significant productivity gains that could be realised by using an integrated solution. Aside from drastically reducing time to set up and use their clusters, they needlessly sap their own productivity and system performance moving forward. There is also a huge opportunity cost here: the vast amount of time they spend scripting and keeping these tools synchronised is usually at the expense of focusing on other priorities that could take more advantage of the advances in hardware.
The hardest part is re-architecting software to use parallel algorithms instead of serial algorithms, comments Sumit Gupta, director of Tesla Product Marketing at Nvidia. This is a common task that the developer must do, no matter for CPUs, GPUs or even FPGAs. Auto-parallelising compilers can help in the form of the recently announced OpenACC GPU directives compilers from PGI, Cray and CAPS.
But even for these compilers to be effective, the developer has to at least re-architect his software to use data structures that expose parallelism to the compiler. This is something done by big research labs, major ISVs and companies like oil and gas firms where performance is critical. This is why these are also the first to adopt GPUs; their code is already ready for the massive parallelism that GPUs offer.
The challenge is for the vast majority for whom the best approach forward is to adopt OpenACC GPU directives. This method does not require major code changes, but everything the developer can do to expose parallelism in the data and in the algorithms gives more speedups.
With the end of frequency scaling and the rise of heterogeneous computing, much software is being left behind with performance stagnating, says Oliver Pell, VP of Engineering at Maxeler Technologies. Increasing numbers of companies are rewriting their code to embrace heterogeneous computing, but this is a process that really requires expert understanding of both the application and hardware that is being targeted.
Companies that write and run their own software can evaluate the costs of changes compared to the business value or TCO benefits, but for ISVs the situation can be less straightforward. It might not be possible to charge extra for a version of their software that takes advantage of heterogeneous computing, so they are left with just the cost side which makes it unattractive to invest in that area.
For the end user, maximising performance is increasingly going to be a reason to need control over your own software rather than using third-party options that don’t fully exploit the capabilities of your hardware. On the other hand, vendors who are able to supply integrated software and hardware solutions will see their customers benefiting significantly from the greater performance these solutions can deliver.
A temporary performance gap
The NCSA team summarises the entire matter nicely: For the next several years, we will see an increasing gap between the peak performance of a computer and the realised performance on real science and engineering applications. The number one computer on the Top500 list will still be touted by the institutions that deploy them, but scientists and engineers may grow frustrated when trying to use them to do their work. The advances in computing power as written on paper or with simplistic measures will not match those ‘on the ground’.
Eventually, however, investments in software – especially if those investments increase as planned at the US Department of Energy and National Science Foundation, also in the US – will begin to decrease the gap. As this happens, the fidelity of computational models will dramatically increase and it will enable computational scientists and engineers to model the complex, real-world systems of paramount importance to society.
1. Prakash Prabhu et al., ‘A Survey of the Practice of Computational Science’, Proc. 24th ACM/IEEE Conference on High Performance Computing, Networking, Storage and Analysis (SC11), Nov’ 2011.