Tooling up for exascale
As the HPC industry prepares for the next major milestone in HPC performance, exascale programmers must look at new tools and frameworks that will allow them to program the software needed to exploit a billion, billion calculations per second.
However, it is not just a case of installing more hardware and running today’s applications. Much of the code that has been designed today will not work on exascale systems without considerable optimisation or large re-writes because of the highly parallel nature of exascale computing systems with complex hierarchical memory architectures.
Mark Bull, application architect at the Edinburgh Parallel Computing Centre, explains that this paradigm shift in HPC programming is down to the amount of parallelism that needs to be exploited to reach exascale.
‘The only way we are getting more compute power is to add more cores. The consequence for that is more explicit parallelism in applications,’ said Bull.
‘Modern architectures are very hierarchical, we have distributed nodes, each of those has multi-core processors that have hyper threads, SIMD units and potentially GPUs in them as well – so there is a big hierarchy of parallelism,’ Bull continued.
To fully appreciate this change in programing models, it is important to understand the history of HPC programming. In the past, the industry relied on increasing clock speeds to accelerate computing performance but Dan Holmes, applications consultant in HPC Research at the Edinburgh Parallel Computing Centre, highlighted that much of this comes down to power requirements of the underlying hardware.
Holmes said: ‘Trying to increase the clock speed, more than we currently are, can only get us to around 4GHz, the speed of many gaming PC’s or workstations on the market today. Eventually the thing is so hot that it melts the silicon that it is made of because you cannot dissipate the heat out of it fast enough. There is a limit on how fast each processor can be and therefore to get more capability you need more of them.
‘You want the electrical distance between them to be as short as possible for power reasons and so you end up with small clumps of processors which are then grouped together in bigger clumps, and so on.’
The result is a drastic increase in performance due to the increasingly parallel nature of computing architectures, but this performance comes at a price: ‘If you have got more processors then you have problems with them communicating with each other,’ said Holmes.
The result is an increasingly complex hierarchy of reasonably small processors working in tandem. This requires more intelligent programming to address distributed memory and highly parallel computing architectures that we see in HPC today.
One concern for HPC programmers working towards exscale is that these extremely large systems create different problems at certain orders of magnitude, as Holmes explained.
‘It is relatively easy to scale something to four or 10 processors, but once you get past 100 you start to see some of the problems with a naïve algorithm,’ said Holmes. ‘Having solved those you encounter a new set of problems at a thousand processors because you show up different levels of parallelism there and different algorithms work less well there than they would. At 10,000 processors you see another set of problems, and so on’.
Mark Bull gave an example of this with MPI libraries. He explained that most MPI libraries are implemented to require that every process stores at least some bit of information about every other process.
‘Today people are running MPI programs with a million processes, and that’s ok if you want a few bites of storage for each one of a million processes – that consumes a few megabytes.’
‘If we want to go to a hundred million or a billion processes which are what we are going to have to do to get beyond exascale then that is a few gigabytes and that quickly consumes your entire memory. It just gets out of control – we have to deal with that somehow,’ stressed Bull.
The EPCC is running two projects that are aimed at solving some of the challenges for exascale computing. The first project, Epigram, which recently finished after three years of research, was an EC-funded FP7 project on exascale computing with the aim of preparing message passing and PGAS programming models for exascale systems by fundamentally addressing their main current limitations.
The concept is to introduce new disruptive concepts to fill the technological gap between the petascale and exascale programming models in two ways. First, innovative algorithms are used in both message passing (MP) and partitioned global address space (PGAS), specifically to provide fast collective communication in both MP and PGAS, to decrease the memory consumption in MP, to enable fast synchronisation in PGAS, to provide fault tolerance mechanisms in PGAS, and potential strategies for fault tolerance in MP.
The project also aims to combine the best features of MP and PGAS by developing an MP interface using a PGAS library as communication substrate. The idea is to use PGAS to overcome some of the shortcomings and overheads associated with the use of MPI for HPC message passing.
Bull highlighted that PGAS and MPI are similar in intent but use different ways of addressing memory across a system: ‘They have the same sort of flavour because they support some global address space which means that any process can read and write memory locations everywhere. This gives you the essential remote read and remote write functionality.’
Bull explained that the difference is MPI starts with a view of the machine as lots of separate bits of distributed memory: ‘Rather than there being partitions of a global address space, there are lots of separate address spaces. You send messages in between those address spaces rather than reading and writing directly to them.’
‘If you don’t know when you write the program where the data that you want is going to be at any one time then coding both sides of a two-sided message is difficult. You want to know where the data is going to be ahead of time.’
This also relates to the issue of memory usage that Holmes referred to earlier as each process requires information about each other process. This two-sided message passing could become a hindrance in the future, so the EPCC is exploring other methods of communication such as PGAS.
‘The big advantage of the PGAS or single sided approach is that you can essentially do data dependent accesses anywhere in the machine,’ said Bull. ‘There are certain applications that can benefit from that or at least it makes them more tractable to implement.’
The second project that takes the lessons from the recently finished Epigram project is the Programming Model INTERoperability ToWards Exascale (Intertwine). This project takes the lessons learned from Epigram and other projects and investigates the use of APIs with a particular focus on interoperability between different HPC programming frameworks used in HPC. The project covers MPI, GASPI, OpenMP, OmpSs, StarPU and PaRSEC, each of which has a project partner with extensive experience in API design and implementation.
‘The difference is we have expanded that to include not just MPI and PGAS, but to include some of the node level programming models such as OpenMP also including some of the newer runtime programming models like OmpSs for example,’ said Bull.
For many HPC programmers, it is clear that no single programming framework will be the ‘silver bullet’ that solves the challenges of exascale computing. While there are some projects investigating this Bull and Holmes think that even if a silver bullet programming model could be created, it would require potentially years of work to port applications and make sure that this framework would work effectively across a wide range of supercomputing platforms.
Instead, Intertwine has taken a view of using the current tools but adapting them to better suit exascale software development.
Holmes stated: ‘It would be lovely to have a plan-A solution of “let’s throw the whole lot of what we currently do in the bin and use this magic silver bullet programming model which is easy to use and it translates into really efficient code”.
‘This thing doesn’t exist, so we need a plan B until that thing exists and Epigram and Intertwine are focusing on this Plan B, added Holmes. ‘Let’s modify the things we have got and take an incremental approach – tweaking these things as we go so we can use a combination that makes sense for a particular machine and a particular application.’
However, the EPCC and Europe are not the only organisations trying to solve this challenge. The US and Asia are working on their projects such as Legion which is being developed by the US at Stanford with funding from the US Department of Energy’s ExaCT Combustion Co-Design Center and the Scientific Data Management, Analysis and Visualization (SDMAV) programme.
Mark Bull admitted that nobody knows quite which model will win out in the end: ‘As an industry, it is time to hedge our bets with programming just as we have seen with recent deployments at the BSC where supercomputers are adopting various next-gen technologies as it is unclear exactly what exascale programming might look like.’
However, a solution must be found for this issue as Holmes expects that in the future the level of parallelism will continue to rise – this will also be true even for HPC programmers that are not looking at exascale computing: ‘In 10 years’ time, every machine is going to need this level of programming unless we can find some silver bullet every programmer is going to have to deal with these issues day to day,’ he concluded.