Skip to main content

Scaling software for HPC

Embracing functional programming and a better understanding of serial progamming challenges is driving a paradigm shift in HPC coding, writes Eugenia Bahit

The difference between serial and parallel programming is enormous and increases when the objective is to take it to a high-performance ecosystem. The training process cannot focus only on learning specific technologies: scaling for HPC supposes a change in the way of analysing, planning and building software. 

Scientists must ask themselves how to analyse and plan a parallel software for HPC. What is needed to build it, and where to learn about it by researching the available bibliography, consulting with qualified experts, and differentiating between the challenges that scientists and researchers may face and those that software engineers should tackle.

From sequential to parallel computing: shifting the paradigm

Sequential programming seeks to make logical connections along the code instructions. The better connected the code instructions are, the better performance the algorithm has. However, parallel programming requires a change in perspective, forcing researchers and engineers to a paradigm shift.

Dr Peter Pacheco, Professor Emeritus at the University of San Francisco, confirms that a paradigm shift is needed since "the problems that arise in the development of parallel programs are fundamentally different from the problems that arise in serial programming" and offers race conditions as an example.

Focusing on scientists who are just trying to solve their research problem, Dr Christopher Woods, Head of Research Software Engineering at the University of Bristol, indicates that “writing a program needs to stop being seen as writing a recipe” and recommends that researchers “view the process as assembling a data processing factory, having to specify what tasks need completing, and what depends on what”. He concludes that “the paradigm shift is a much greater embrace of functional programming ideas such as map-reduce and the use of collective operations or already-parallelised libraries, such as the automatic parallelisation of Fortran”.

It becomes necessary to understand that in parallelisation, the computational instructions inside a parallel code block remain sequential. Even the main process that manages the parallel blocks is sequential, and only independent tasks can be executed simultaneously. Notwithstanding, it is no longer just a question of creating logical connections among instructions but accurately dividing the proper code blocks. So, a first step for scaling a program from sequential to parallel computing may involve localising the code blocks that can be executed independently. However, this task is not so easy because it is not only about identifying independent blocks. It requires a previous understanding of the problem that is desirable to solve using parallelism to decide how to split and parallelise it.

From a computer science perspective, it is worth asking whether a methodological consensus or a scientific methodology exists to analyse and solve a problem from a parallel computing context. An accepted methodology approach – still in force nowadays – was given by Dr Ian Foster in 1995. This four-step methodology allows for isolating the hardware-dependent aspects from those that are not. So, the first two steps, partitioning and communication, focus on concurrency and scalability. In contrast, the last two, agglomeration and mapping, focus on performance matters that largely depend on the hardware configuration.

Notwithstanding, Dr Woods assures that “only experts with a deep knowledge of a problem have a chance of writing efficient performing code if they manage communication and scheduling themselves”, so a more straightforward methodology for researchers would be “to identify the tasks to perform, describing the data that needs passing from one task to the next, and the dependencies between them” and finally, “delegating communication and scheduling to the computer”.

Identifying the problem and deciding how to split it up

To identify the problem to solve using parallelism, the Lawrence Livermore National Laboratory (LLNL) suggests using a profiling tool to localise the code blocks where the program spends most of its time or where most of the instructions are executed. It also proposes determining which code blocks are significantly slower.

When the problem is known and the accurate code blocks where the issues reside are identified, it is time to split it. This procedure is known as partitioning and can be achieved by dividing the data, a process called domain decomposition, or splitting the tasks, which is called functional decomposition.

From a parallel computing perspective, domain decomposition is applied by splitting the data among the cores so each core executes the same tasks on different data blocks. This kind of parallelism is known as data parallelism. Functional decomposition, instead, is applied by splitting the tasks among the cores so that each core executes different tasks on the same data blocks. This kind of parallelism is known as task parallelism.

For example, if it was necessary to perform one task on a large quantity of data, it could be split into smaller data blocks so that each could be processed in a different core. If more than one task is needed, one of them could be the more expensive one, where the program spends most of its time. This task could be a bottleneck, and a better solution could be to split the tasks so that one core executes that costly task and another executes the rest. However, what will those independent tasks do to merge their results? This is where more complex aspects of parallelism, such as communication and synchronisation, begin to be necessary.

Facing complex matters of the parallelisation

A common requirement is that two or more tasks need to merge their results. This demands communication and synchronisation between each process. 

Most authors and experts consulted seem to agree that a design goal is to minimise communication among tasks. Therefore, a balance in the decomposition degree might be desirable since a higher decomposition degree results in a higher level of tasks, so more communication will be needed among them.

Since synchronisation is not just about making an ordered list of tasks, Dr Pacheco explains that “more powerful parallel programs are written using explicit instructions for parallelism”, so writing effective parallelisation requires indicating – at a low level – what core executes what task and when it must do it. This could be faced by using specific libraries, which will be discussed later

A more familiar matter is identifying the critical sections where a nondeterministic problem could be produced if two or more threads or processes try to modify the same data at—more or less—the same time. Those sections require deciding the best mutual exclusion approach to avoid race conditions.

Grappling with mutual exclusion can include implementing methods such as lock. However, Dr Subodh Kumar, in his book Introduction to parallel programming, suggests using compare and swap (CAS) synchronisation instead of a lock-based one to reduce overheads. Overhead is a common problem involved in parallelism. It refers to the situation where a task demands a high consumption of the available resources. If a thread is locked because another one is still executing, it must wait before it can continue, producing a thread overhead. According to Dr Kumar, it can be avoided by using a CAS synchronisation instead of a lock-based one.

Another complex matter is deciding the better load balance, which will mean choosing how many tasks or data blocks will be sent to each core. In data parallelisation, it might seem easy to determine how much data should be managed by each core. It makes sense to think that perhaps it would be a question of splitting the data blocks into equal parts—such as n-bytes or n-words per data block—but how can it be done to split tasks into equal parts? Unfortunately, the answer is not easy. Consulted on this matter, Dr Pacheco explains: “when there is no clear equivalence between the amount of work associated with two different tasks, one possible approach may be to execute the two tasks sequentially, finding parallelism within each task. A second approach, if there are many inherently sequential tasks, might be to create a repository for the tasks and have the threads or processes select tasks from it.”

Dr Woods emphasises that for non-experts, it is too complex to do it manually, so he recommends that non-experts, “express the tasks and their dependencies, and then let the task-based parallel library do the load balancing.”

A further particular concern is modularity. Dr Pacheco reinforces this fact for a parallel program: “In general, the complexity of parallel program development is much greater than the complexity of serial program development. So it’s essential that techniques such as modular design be employed. It is far more difficult to debug a subroutine that’s being executed by multiple threads or processes. So a programmer will benefit tremendously from the existence of well-defined, independent blocks of code.”

Finally, another matter to consider is choosing between GPU or CPU programming. The selection cannot be random since the difference is substantial. Dr Pacheco explains the performance differences obtained in SIMD and MIMD systems, allowing the programmer to make that choice: “Programming SIMD systems is very different from programming MIMD systems, and since GPUs are composed of SIMD systems, to obtain reasonable performance, it’s essential to be aware of the differences. For example, in MIMD programming, it’s often the case that when different threads or processes execute different conditional branches, there is very little (if any) degradation in performance. However, in a SIMD system, even if the threads or processes are executing tasks requiring similar amounts of work, only one branch can be executed at a time, and the work may be serialised.”

Therefore, that difference could mean CPU programming might be a better alternative for algorithms with heavy logical statements.

Parallel and HPC technology

HPC infrastructure is mainly constructed over GNU/Linux systems. Most of them implement a RedHat Enterprise Linux distribution. Hence, having a knowledge base about Linux commands and essential GNU tools is indispensable.

Among the chosen programming languages for parallel and HPC are the classic ones such as C/C++ and Fortran, scientific-oriented languages such as Julia, and scripting languages such as Python. Knowing which compilers are present on the HPC infrastructure is crucial for compiled languages. GNU, Intel and Nvidia compilers are the most common.

Some helpful parallel programming middlewares for HPC are:

  • OpenMP: a compiler directive for a shared-memory model, which provides a relatively easy way to parallelise a serial program through threading creation. 
  • MPI: an interface which facilitates message-passing for distributed systems. It is implemented by several language libraries such as C, C++, Fortran, Python and Julia, among others.
  • OpenCL: an open standard which defines a programming language (C99-based) and a low-level API for parallel programming of supercomputer accelerators (CPU and GPU) and has been implemented by different vendors, such as Intel and Nvidia.
  • CUDA: a GPU programming proprietary platform by Nvidia.

Some frameworks, such as charm++ (recommended by Dr Woods) for parallel programming in C++, or Apache Hadoop, to perform the map-reduce operations—mentioned by Dr Woods—are eligible for a high-level data parallelisation.

Finally, to run a finished work on an HPC environment, it is necessary to submit and schedule the job, so knowing the basic commands to use a workload manager is desirable. In the last few years, Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM) has increased in popularity and is widely used across academic and enterprise HPC systems.

Learning and training for HPC

Beyond the official documentation for each technology, there is a wide range of options to learn and improve HPC skills. The following lines offer a brief from online resources and books to specialised training centres.

  • A must-read book to learn to design parallel software is Designing and Building Parallel Programs by Dr Ian Foster. It is available for free at
  • The classic book An Introduction to Parallel Programming by Dr Peter Pacheco, is one of the best options for learning the fundamentals of parallel programming and its practical aspects.
  • The online book Computer Systems Fundamentals by professor Michael S. Kirkpatrick from the James Madison University (JMU) is an excellent option to deepen the fundamentals of parallel systems. It is available for free at
  • For all those who want to go deeper into parallel algorithm design, the new book Algorithms: Parallel and Sequential by professor Umut A. Acar from Carnegie Mellon University is an excellent choice. It is available for free at
  • The website of Dr Christopher Woods offers excellent free-access course choices focused on high-level parallel programming:
  • Centred on technical and practical HPC aspects and with a strong focus on scientists and researchers, the Partnership for Advanced Computing in Europe (PRACE) has a specialised training program offered by ten training centres around Europe:
  • The Danish e-infrastructure Cooperation (DeiC) has gathered an exhaustive list of best practices to code and run software on HPC infrastructures. It is available for free at


  1. Barney, B., & Frederick, D. (n.d.). Introduction to Parallel Computing Tutorial. HPC at Lawrence Livermore National Laboratory (LLNL). Retrieved 26 August 2022, from 

  2. Pacheco, P. (2011). How do we write parallel programs? In An Introduction to Parallel Programming (1st ed., p. 8). Morgan Kaufmann.





Read more about:


Media Partners