Next-generation software: who will write it?

As HPC systems and their software increase in complexity, users need access to more specialised programming models and an ecosystem of shared knowledge and experience to scale software effectively. Parallel programming skills are in relatively short supply, but they are becoming more important as new processor and accelerator technologies increase the levels of parallelism in HPC systems.

At the same time, these new types of processors are also leading to a diversification of compute architectures, thus splitting the experience and coding knowledge-base. HPC users may develop skills around a particular compute architecture and its associated programming model, only to discover if they move to a different institution, of if they stay put but their organisation upgrades, they have to learn new skills entirely.

At the end of 2014, the market-research company Intersect360 research surveyed a number of HPC centres as part of a report written for the US government council of competitiveness. The results showed that software scalability was viewed as one of the largest barriers to a 10x increase in the scalability and hence the productivity of high-performance computing.

Addison Snell, chief executive officer, and CEO of Intersect360 gave his view on how this diversification of architectures may impact HPC in the future. He said: ‘The stress point is that it takes us back to specialisation. What the cluster era gave us, in the 1990s, was this notion of portability or commonality. If I ported an application to run on someone’s Linux cluster, it would run on any Linux cluster.

‘This takes us back and away from this. You are really going to have to choose an architecture that is going to be good for your application, or you are going to have to maintain that application across multiple architectures over time,’ concluded Snell.

To get a sense of the scale of the problem, the US Lawrence Livermore National Laboratory uses large, integrated physics programs that contain millions of lines of source code and tens of thousands of loops, in which a wide range of complex numerical operations are performed. Although these are intended for military applications in the US nuclear weapons programme, there are civilian applications of high-performance computing – for example in the oil and gas industry – that also employ software with hundreds of thousands if not millions of lines of code. Any changes in hardware or parallel programming methods will make it very difficult to achieve high performance without disruptive platform-specific changes to this type of application software.

Lessons from GPUs

GPUs have been widely adopted in HPC over the past five years, and much of this is to do with the large-scale grassroots efforts that Nvidia has spent on developing its programming language Cuda.

Driving adoption of the technology required significant investment of both money and time, but Nvidia made smart decisions about partnering with many universities to encourage adoption of the technology by academics. This was in addition to working with research centres and university HPC centres across the globe to expand the Cuda user base.

Roy Kim, group product marketing manager at Nvidia said: ‘There are different classes of developers within HPC; they are not all the same. There are some developers that really want to get their hands on and tune for performance on the GPU. There are others who want to get performance quickly and get on-ramp as soon as possible and really focus more on the science of their application.

‘So we have OpenACC and Cuda. Cuda is really targeting the first set of developers and OpenACC is targeting the second set.’

Kim said: ‘Parallel programming is hard. It’s not easy to have someone think about hundreds or even thousands of threads in parallel. That is the challenge that the modern HPC developer has. Cuda solves a big chunk of that programming issue, which is why it became such a pervasive programming model within HPC. It made HPC programming easier.’

Kim concluded: ‘OpenACC really offloads a lot of the burden of parallel programming to the compiler and the complier does most of the heavy lifting for the developer. For developers that want to get acceleration on a GPU quickly, OpenACC is the right path.’

Beyond Cuda and OpnACC

The problem facing HPC developers today as they look at software scalability is compounded as computer technology moves closer towards exascale. While most users can make use of tools such as Cuda and OpenACC today, if they wish and if they have the skills and knowledge; those at the most extreme end of HPC must look to methods which can take them beyond the current levels of parallelism and node performance.

Whether they be the top supercomputers at the US National Laboratories or corresponding institutions in Europe and Asia, they all face a similar challenge of scaling and supporting applications at an unprecedented scale.

This is driving development of programming models that can support the technology and promote increased parallelism and portability of code. Programming models are typically designed to make increasing performance easier for developers, but rapidly changing processor architectures and the increasing complexity of platforms that will support exascale applications are significant barriers to the design of future implementations of these models.

This is compounded by the fact that the next generation of supercomputers will increasingly rely on concurrency and complex memory hierarchies while maintaining a sufficient level of interoperability with today’s applications.

Intel: preserve the legacy code

Perhaps Intel has the easiest job in maintaining its established user base. This is because it is extending languages, models, and development tools from the Xeon CPU family across to the Xeon Phi.

Rajeeb Hazra, VP of Intel’s architecture group and GM technical computing said: ‘There is a lot of legacy code; there is a lot of knowledge about how to write such code that has been built up in the software industry; and we did not want to lose that. It would be akin to burning your entire library and starting anew.’

Hazra also explained that because programmers were using similar techniques, such as threading and vectorisation to parallelise code, investigating speed-up on Xeon Phi would not be a wasted exercise because the improvements would still impact performance even if the speed-up of the code on Xeon Phi ended up being insufficient.

Hazra said: ‘That is a huge economic gain for companies that have applications and do not have the money to throw away on months of work’. He also stressed that this meant programmers could use Xeon CPU as a development environment for Xeon Phi code using AVX instructions for example.

Nevertheless application developers still have a lot of work ahead of them if they wish to maintain codes through the next generation of HPC as it will increasingly rely on parallelism and concurrency to achieve performance gains.

How to scale software?

Kim said: ‘Software scalability is a key issue that the industry is grappling with.’ He explained that supercomputers are getting wider or more parallel rather than getting faster through increased processing power.

Kim said: ‘There is scalability within a server node and there is scalability across nodes. Within a node, having accelerators like GPUs gives the node high performance and lots of parallelism and that is where you are using things like OpenACC or Cuda.

‘MPI has been around for a long time and there is some overhead in terms of both processing and memory footprint but that is pretty lightweight, so it is the HPC developer’s tool of choice for scaling across multiple nodes,’ Kim concluded.

Snell said: ‘For clusters, everything was really converted over into MPI. That was not the dominant programming model before we went to clusters, and people went through a painful conversion process to get applications over into MPI. Even today, a lot of applications either are not in MPI or they do not scale as well with MPI as they did with other models.’

Snell continued: ‘How does MPI evolve? Even without accelerated components like Xeon Phi or a GPU, just having a multicore processor on the x86 side there is a question of whether the existing MPI programming model is sufficient to get enough parallelism capabilities out of those chips.’

Tuning applications

At one of the sessions on the last day of the ISC High-Performance conference in Frankfurt this year, tuning applications to run on massively parallel computers was the focus. Bronis de Supinski, chief technology officer at the Livermore Computing Center, part of the US Lawrence Livermore National Laboratory, highlighted the LLNL’s strategy to deal with the programming conundrum that it faces with its upcoming supercomputer, Sierra, funded through the US DOE’s Coral programme. Supinski said: ‘The main thing that we are looking at is improving application performance over what we are getting on Sierra, Sequoia, and Titan.’

Supinski, who also chairs the OpenMP Language Committee, explained that the LNLL was planning to use a combination of Open MPI to provide intra-node parallelism, and Open MP to address node-level performance. While these tools will be used to support the concurrency and node performance, Supinski remarked that on top of using these tools the LNLL had developed a programming tool called RAJA. ‘The idea of RAJA is to build on top of new features of the C++ standard.’ The RAJA abstraction layer is designed to simplify porting C/C++ code to by reducing developer disruption – which helps to support the need for interoperability between these new systems and older applications.

Supinski said that although the LLNL did have a target peak performance figure it was not the primary objective of developing the new system: ‘That is a very low bar. We will actually pretty well exceed that,’ he said.

‘We asked for an aggregate memory of 4 PB and what we really care about is that we have at least one GB per MPI process. It turns out, hitting the four petabytes was the most difficult requirement that we had.’ He went on to explain that memory budgets and memory pricing were a hindrance in achieving this requirement. ‘In my opinion, it is not power or reliability that are the exascale challenges, it’s programmability of complex memory hierarchies,’ Supinski said.

Supinski said: ‘We don’t care about FLOPS rate, what we care about is that you are actually getting useful work done.’ This is the kind of view that will become more and more familiar as computing moves to a more data-centric model. In many cases, future systems will be more reliant on data than pure number-crunching power, because of the increasing size of datasets. In turn, this will drive memory size and the development of the complex memory hierarchies that are required to handle the flow of that data.

Is Open Source the solution?

It is far too early to predict accurately which programming models will be most popular for each architecture. The Sierra system will be one of the largest supercomputers in the world and will be based on IBM’s OpenPower system using Nvidia Volta GPUs. The work done on these IBM systems will likely pave the way for HPC centres that want to make use of the IBM systems in later years.

The US national labs are in a unique position when it comes to the supercomputer market. Because they are in the vanguard of introducing new HPC machines using innovative hardware, they must also try new technologies and develop methods to scale and sustain their software applications without the help of the larger community.

Although the hardware suppliers may always be competitors to each other, Steve Conway, research vice president, high performance computing at the IDC suggests that the increasing complexity of the new programming models will drive collaboration among HPC software developers. They will rely on tools and knowledge provided by their colleagues to help drive software scalability forward in HPC.

Conway said: ‘Ecosystems are going to be increasingly partnering with, or at least driven by, the open source community. The Intel ecosystem will be very heavily dependent on Open Source. OpenPower, as the name implies, will be as well.’

He explained that this could eventually produce a software ecosystem whereby you would have specific vendors distributing their own versions ‘that are consistent with the open source versions’ but that add functionality through proprietary features or tools.

New technologies split the knowledge-base

In this way, it seems increasingly likely that HPC developers will need to rely on a community-driven software ecosystem for software scalability in the future. However the number and size of these communities that can be supported by the HPC community is uncertain, as each new community effectively splits the knowledge-base into factions that support a specific technology platform.

This was view espoused by Snell who warned that while specialised hardware enables these increases in parallelism, they are also splitting the code base between Intel and IBM, or those that use Nvidia GPUs and those who use Xeon Phi.

But even this does not account for all the hardware technologies available today. Newer technologies – one might think of ARM and FPGAs – and those without a sufficiently strong user-base behind them may find difficulty in supporting HPC users adapting to the changing landscape of software programming at increased scale and parallelism.

‘Now you have to have specialised tools for the various architectures. Ultimately they can be incredibly useful but without adoption they will fail,’ said Snell.

This may bring smaller technologies or groups of users together and drive them towards more established open source communities so that they can share from the knowledge provided by the developer community.

Conway said: ‘Even with the software stack as it is, it is becoming too onerous for one single organisation to take that on. The requirements are just exploding. You need more hands. You really need a whole community, in the order of Linux, to move things forward.

‘I think we are moving into a world of reference architectures that correspond to specific ecosystems,’ Conway concluded.

The HPC community needs to cultivate and maintain a skilled base of systems engineers, application developers, and software scientists to drive the development of the next-generation tools that will be needed. But Conway stressed that one of the issues facing the HPC community was the size of the workforce and the level of its development.

As the HPC market searches for the optimal strategy to reach exascale, it is clear that the major roadblock to improving the performance of applications will be the scalability of software, rather than the hardware configuration – or even the energy costs associated with running the system.

Robert Roe is a technical writer for Scientific Computing World, and Fibre Systems.

You can contact him at robert.roe@europascience.com or on +44 (0) 1223 275 464.

Find us on Twitter at @SCWmagazine, @FibreSystemsMag and @ESRobertRoe.

Next-generation software: who will write it?

Lessons from GPUs

Beyond Cuda and OpnACC

Intel: preserve the legacy code

How to scale software?

Tuning applications

Is Open Source the solution?

New technologies split the knowledge-base

Editor's picks

The 2026 storage survey: strategies for AI and data-intensive research

NEW On-Demand | Ontologies - the missing foundation for AI in drug discovery

On-Demand | One workflow, every tool: how AI-native ELN is changing drug discovery

On Demand: Free Online Panel Discussion | LIMS innovation boosts precision and security

The path to AI federated learning for drug discovery

Workstations vs Clusters for Ansys Applications

Avoid Duplication, Reduce Fragmentation | Integrated Informatics for Scientific Research