Thanks for visiting Scientific Computing World.

You're trying to access an editorial feature that is only available to logged in, registered users of Scientific Computing World. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

The new realism: software runs slowly on supercomputers

Share this on social media:

No supercomputer runs real applications faster than five percent of its design speed. Robert Roe and Tom Wilkie report on recalibrating expectations of exascale and on efforts to tune software to run faster

The speed with which supercomputers process useful applications is more important than rankings on the Top500, but exascale computers are going to deliver only one or two per cent of their theoretical peak performance when they run real applications. Both the people paying for, and the people using, such machines need to be realistic about just how slowly their applications will run.

Tuning applications to run efficiently on massively parallel computers, and the inadequacy of the traditional Linpack benchmark as a measure of how real applications will perform, were persistent themes throughout the ISC High Performance Conference and Exhibition in Frankfurt in July.

Bronis de Supinski, chief technology officer at the Livermore Computing Center, part of the US Lawrence Livermore National Laboratory, told the conference: ‘We don’t care about the Flops rate, what we care about is that you are actually getting useful work done.’

But according to Jack Dongarra from the University of Tennessee: ‘You’re not going to get anywhere close to peak performance on exascale machines. Some people are shocked by the low percentage of theoretical peak, but I think we all understand that actual applications will not achieve their theoretical peak.’

Dongarra’s judgement was echoed in a workshop session by Professor Michael Resch, director of the HLRS supercomputer centre in Stuttgart, Germany. The HLRS was not targeting exascale per se, he said, because the centre’s focus was on the compute power it could actually deliver to its users.

According to Resch, simple arithmetic meant that if an exascale machine achieved a sustained performance of only 1 to 3 percent, then this would deliver 10 to 30 Petaflops. So buying a 100 Petaflop machine that was 30 per cent efficient – which should be achievable, he claimed – would deliver the same compute power, for a much lower capital cost and about one tenth the energy cost of an exascale machine.

The Jülich Supercomputing Centre (JSC) in Germany is tackling the issues of software scalability and portability by setting up what it calls ‘the High-Q Club’, Dr Dirk Brömmel, a research scientist at JSC, told the session entitled: ‘application extreme-scaling experience of leading supercomputing centres’. ‘We want to spark interest in tuning and scaling codes’ said Brömmel. ‘The work does not stop there; the ultimate goals of the programme are to encourage our users to try and reach exascale readiness.’

At the Lawrence Livermore National Laboratory, effort is going into developing APIs and tools to create applications and effectively optimise how code is run on a cluster. Supinski stated that the LLNL’s plan was to use a programming tool developed at Livermore called Raja. ‘The idea of Raja is to build on top of new features of the C++ standard. The main thing that we are looking at is improving application performance over what we are getting on Sierra, Sequoia, and Titan. Application performance requirements are what we really care about.’

The US Coral programme means that applications will be running on systems with a Linpack performance well in excess of 100 petaflops within the next couple of years, and Supinski highlighted that the US national labs will take account of memory requirements as much as processing speed in their approach to the problem of tuning software applications to run on the new hardware.

Supinski said: ‘We did have a peak performance figure in there, but that is a very low bar. We will actually pretty well exceed that.’ He continued: ‘We also asked for an aggregate memory of 4 PB and what we really care about is that we have at least one GB per MPI process. It turns out hitting the four petabytes was the most difficult requirement that we had.’ He went on to explain that memory budgets and memory pricing was a hindrance in achieving this requirement.

‘In my opinion, it is not power or reliability that are the exascale challenges: it’s programmability of complex memory hierarchies,’ Supinski said.

Dongarra issued his warning about how poor the performance of supercomputers actually is, compared to the theoretical performance as measured in the Top500 list, while discussing the results of a different benchmark for measuring the performance of supercomputers: the High Performance Conjugate Gradients (HPCG) Benchmark. As had been widely expected, when the Top500 list was announced on the opening day of the conference, Tianhe-2, the supercomputer at China’s National University of Defence Technology had retained its position as the world’s No. 1 system for the fifth consecutive time. The Tianhe-2 also scored first place in the alternative metric for measuring the speed of supercomputers, the HPCG, announced on the last day of the full conference.

The Top500 bi-annual list uses the widely accepted Linpack benchmark to monitor the performance of the fastest supercomputer systems. Linpack measures how fast a computer will solve a dense n by n system of linear equations.

But, the tasks for which high-performance computers are being used are changing and so future computations may not be done with floating-point arithmetic alone. Consequently, there has been a growing shift in the practical needs of computing vendors and the supercomputing community, leading to the emergence of other benchmarks, as discussed by Adrian Giordani in How do you measure a supercomputer’s speed? (SCW June/July 2015 page 22)

The HPCG (High Performance Conjugate Gradients) Benchmark project is one effort to create a more relevant metric for ranking HPC systems. HPCG is designed to exercise computational and data access patterns that more closely match a broad set of important applications, and to give incentive to computer system designers to invest in capabilities that will have impact on the collective performance of these applications. Its creators regard the Linpack measure and HPCG might best be seen as ‘bookends’ of a spectrum: the likely speed of an application lies between the two and the closer the two benchmarks are, the more balanced the system.

Although the Europeans may not have the investment to install new systems at the same rate as their American counterparts, they are also looking towards the exascale era of supercomputing. The high-Q club is one example of the preparation taking place to focus application users on scaling their codes so that they can run across all of the JSC’s IBM Blue Gene racks.

Brömmel said: ‘The trend towards much higher core counts seems inevitable. Users need to adapt their programming strategies.’ Brömmel is heavily involved with software optimisation at the JSC, organising the JUQUEEN Porting and Tuning Workshops and also initiating the High-Q Club.

He went on to explain that the JSC had initiated the programme to build up a collection of software codes, from several scientific disciplines, that can be successfully and sustainably run on all 28 racks of Blue Gene/Q at JSC. This means that software must scale massively – effectively using all 458,752 cores with up to 1.8 million hardware threads.

So far 24 codes have been given membership to the High-Q club. They range from physics, neuroscience, molecular dynamics, engineering, to climate and earth sciences. Two more have been accepted into the programme, although work has not yet finished on optimising these codes.

Developing understanding and expertise around a series of codes on a particular computing platform enables the researchers at a HPC centre to share in the experience and knowledge gained. This sentiment is shared by the LNLL as it has chosen to setup a centre of excellence to aid in the development and porting of applications to Sierra, which is based on the IBM Power architecture and Nvidia GPUs.

Supinski explained that the centre of excellence (COE) would be a key tool in exploiting these technologies to make the most of the next generation of HPC applications. ‘That’s a real key aspect of how we plan to support our applications on this system. This will involve a close working relationship between IBM and Nvidia staff, some of whom will be actually located at Livermore and Oak Ridge working directly with our application teams.’

About the authors

Dr Tom Wilkie is the editor for Scientific Computing World. 

You can contact him at tom.wilkie@europascience.com

Find us on Twitter at @SCWmagazine.

Robert Roe is a technical writer for Scientific Computing World, and Fibre Systems.

You can contact him at robert.roe@europascience.com or on +44 (0) 1223 275 464.