How to get useful work done by a supercomputer
Robert Roe explores the efforts made by top HPC centres to scale software codes to the extreme levels necessary for exascale computing.
Tuning applications to run efficiently on massively parallel computers was a focus on the last day of the conference. The session reflected concerned elsewhere that current supercomputers and next-generation exascale computers are going to deliver only one or two per cent of their theoretical peak performance when they run real applications, as reported in Exascale: expect poor performance.
The Jülich Supercomputing Centre (JSC) in Germany is tackling the issues of software scalability and portability by setting up what it calls ‘the High-Q Club’, Dr Dirk Brömmel, a research scientist at JSC, told the session entitled: ‘application extreme-scaling experience of leading supercomputing centres’. ‘We want to spark interest in tuning and scaling codes’ said Brömmel. ‘The work does not stop there; the ultimate goals of the programme are to encourage our users to try and reach exascale readiness.’
Another aspect of this strategy is not just scaling the codes themselves but also developing APIs and tools to create applications and effectively optimise how code is run on a cluster.
Tuning codes for Coral
Bronis de Supinski, chief technology officer at the Livermore Computing Center, part of the US Lawrence Livermore National Laboratory, who also chairs highlighted the software issue that faces the US national labs and described the LLNL’s strategy for tuning software applications that will be running on systems well in excess of 100 petaflops, as part of the US Coral programme.
Supinski stated that the LLNL’s plan to combat the increasing complexity of programming at extreme scale was to use a programming tool developed at Livermore called RAJA. ‘The idea of RAJA is to build on top of new features of the C++ standard.’ He also explained that LNLL would continue to use Open MPI to provide intra-node parallelism and Open MP to address node-level performance.
Supinski said: ‘The main thing that we are looking at is improving application performance over what we are getting on Sierra, Sequoia, and Titan.’
‘Application performance requirements are what we really care about. We termed them in what we call figures of merit; we targeted speedup over current systems of 4x on scalable benchmarks and 6x on throughput benchmarks’ stated Supinski.
Data and memory grow in importance
Supinski explained that examples of ‘figures of merit (FOM)’ are number of years simulated per day, and number of particles pushed per second. This gives some indication as to the changing values for many leading HPC centres. As computing moves into a more data centric era, system operators are less concerned with raw FLOPS performance and much more interested in application performance or the amount of science achieved rather than sheer volume of numbers crunched per second.
This correlates with the LLNL’s requirements for its new system Sierra: although a low cap of 100 petaflops was required the system is much more focused towards memory.
Supinski said: ‘We did have a peak performance figure in there, but that is a very low bar. We will actually pretty well exceed that,’ he said. ‘We also asked for an aggregate memory of 4 PB and what we really care about is that we have at least one GB per MPI process. It turns out hitting the four petabytes was the most difficult requirement that we had.’ He went on to explain that memory budgets and memory pricing was a hindrance in achieving this requirement.
‘In my opinion, it is not power or reliability that are the exascale challenges: it’s programmability of complex memory hierarchies,’ Supinski said.
Increasingly large datasets are run on HPC systems so aggregate memory or memory per MPI thread becomes increasingly important to application centric computing. Supinski said: ‘We don’t care about FLOPS rate, what we care about is that you are actually getting useful work done and so the FOM are in terms of application metrics.’
European efforts to scale codes
Although the Europeans may not have the investment to install new systems at the same rate as their American counterparts, they must also look towards the exascale era of supercomputing. The high-Q club is an example of the preparation taking place to focus application users on scaling their codes so that they can run across all of the JSC’s IBM Blue Gene racks.
Brömmel said: ‘The trend towards much higher core counts seems inevitable Users need to adapt their programming strategies.’ Brömmel is heavily involved with software optimisation at the JSC, organising the JUQUEEN Porting and Tuning Workshops and also initiating the High-Q Club.
He went on to explain that the JSC had initiated the programme to build up a collection of software codes, from several scientific disciplines, that can be successfully and sustainably run on all 28 racks of Blue Gene/Q at JSC. This means that software must scale massively – effectively using all 458,752 cores with up to 1.8 million hardware threads.
So far 24 codes have been given membership to the High-Q club. They range from physics, neuroscience, molecular dynamics, engineering to climate and earth sciences. Two more have been accepted into the programme, although work has not yet finished on optimising these codes.
Centres of excellence
Developing understanding and expertise around a series of codes on a particular computing platform enables the researchers at a HPC centre to share in the experience and knowledge gained. This sentiment is shared by the LNLL as it has chosen to setup a centre of excellence to aid in the development and porting of applications to sierra, which is based on the IBM Power architecture and Nvidia GPUs.
Supinski explained that the centre of excellence (COE) would be a key tool in exploiting these technologies to make the most of the next generation of HPC applications. ‘That’s a real key aspect of how we plan to support our applications on this system. This will involve a close working relationship between IBM and Nvidia staff, some of whom will be actually located at Livermore and Oak Ridge working directly with our application teams.’
The CORAL programme is a procurement process setup by the US DOE to provide 100+ petaflop machines to three US national laboratories; Argonne, LLNL and Oak Ridge. As these systems will be some of the largest in the world, they will provide a platform for software development in the US enabling researchers to scale code to levels not possible in the past.
IBM won two of the CORAL contracts, to provide Oak Ridge National Laboratory’s (ORNL’s) new system, ‘Summit’ and Lawrence Livermore National Laboratory’s (LLNL’s) new supercomputer, ‘Sierra’ while a combination of Intel and cray one the third bid for the system at Argonne National Laboratory ‘Aurora’.
Supinski said: ‘The way our RP was structured we said we will procure the three systems but we will need two different system architectures. So we decided to choose the two systems that could provide the best overall value for the DOE.’
Investing in multiple architectures gives the DOE’s National Laboratories the flexibility of two different supercomputing platforms as some applications will scale more effectively on GPUs for example will likely be better suited to the IBM system which makes use of Volta GPUs and NV Link.
However the decision to use GPUs has been made possible by features, such as device constructs, that will be added into the next generation of OpenMP 4.0. ‘This was key to LLNL’s decision to utilise the IBM system’ said Supinski. ‘Device constructs are a pretty straight forward way to mark your code and say “go run this over on the GPUs” which allows application developers to better parallelise their code’ he concluded.
A key requirement for the selection of these new systems for the DOE was portability of code because many of these applications are not only designed to run on one system. By developing APIs and software tools that can make the optimisation and porting process easier ultimately enables more science to be completed as resources can be utilised more efficiently.