HPC PROJECTS: MULTICORE
Multiple cores multiply programming
The transition from applications written for sequential execution to those that can take advantage of multicore architectures has taken on enormous importance and brought with it some challenging problems. Paul Schreier examines some of the tools that are available to help programmers parallelise their code
HPC has the potential to benefit enormously from the trend of placing multiple cores on each processor chip. Suddenly, in virtually every desktop machine, multiple cores are available to speed up large tasks that can be broken into multiple parallel threads.
Software, however, hasn’t been keeping pace with this hardware trend. For instance, market research company IDC has heard from some users, for the first time, that legacy code sometimes runs more slowly after they upgrade to the latest multicore hardware. Most of today’s code isn’t optimised for multicore operation, and in some cases the overhead of passing data among cores degrades performance.
Meanwhile, the entire industry has recognised the importance of multicore code, and a number of companies have jumped in with tools that help developers create code that exploits the potential of today’s hardware and that can scale up along with HPC systems.
Speed-up depends on the application
How much of a performance increase is possible? On this topic, many people refer to Ahmdahl’s Law, which states that the speed-up of a program using multiple parallel processors is limited by the fraction of sequential code in a program. For example, if 95 per cent of a program can be parallelised, the theoretical maximum speedup is 20x – no matter how many processors are used. This exercise points out the necessity to focus programming improvements on those parts of a code that can actually benefit from parallelisation and also that not every code will benefit from a multicore architecture.
Start at the chip level
How can we exploit parallelism in smaller configurations; in the HPC world, how can we take advantage of parallelism at the node level? These are some of the issues we face today, says James Reinders, a multicore evangelist at Intel. Five years ago, scientists were more concerned with distributed computing, but today there’s a surge in interest on how to improve performance on a single node. For this purpose, Intel offers a number of multicore programming tools and will soon be introducing even more.
Start with Intel’s Parallel Studio, which is an add-on for Microsoft’s Visual Studio that adds C/C++ parallelism. Tools in the Studio augment the included C++ compiler: with Parallel Composer you can introduce threads, compile and debug code; with Parallel Inspector you find threading and memory errors; Parallel Amplifier helps in tuning code; and Parallel Advisor Lite identifies areas that can benefit from parallelism.
Intel Threading Building Blocks provide an alternative to the OpenMP API, which has traditionally been a leader in the HPC area. Open MP was originally written for code for vector supercomputers and for improving performance in loops, but, according to Reinders, it’s starting to show its age. For a more flexible scheme, TBB is a runtime-based parallel programming model for C++ code that has been placed in the public domain. It is said to help programmers leverage multicore programming without having to be a threading expert, because you specify logical parallelism instead of threads. Breaking a program into separate function blocks and assigning a separate thread to each one is a solution that typically doesn’t scale well, whereas TBB provides an abstraction layer for the programmer, allowing logical sequences of operations to be treated as tasks that are allocated to individual cores dynamically by the library’s runtime engine.
Fig 1: Intel’s Parallel Amplifier improves parallel performance by reducing the time that threads have to wait on locks
Another tool from Intel is the Thread Checker, something Reinders labels an ‘exciting’ tool because it can directly detect possible causes for a deadlock or a data race condition. Data races are ‘horrible’ to debug, because they’re similar to an intermittent fault in a hardware design, and weeks can be spent searching for a data race condition. They’ve been a major barrier to the creation of good parallel code.
Two more tools that should become available this year are the result of acquisitions from Intel. The first is Ct technology, which Intel acquired with RapidMind. Ct is forward-scaling; it lets a single-source application work consistently on multiple multicore and manycore processors with different architectures, instruction sets, cache architectures, core counts, and vector widths without requiring developers to rewrite programs. Ct technology is built off the C++ language to provide a simple portable data parallel programming API that results in simpler and more maintainable code. Finally, Ct technology prevents parallel programming bugs such as data races and deadlocks by design. It guards against these problems by prompting developers to specify computations in terms of composable, deterministic patterns close to the mathematical form of their problem, not in terms of low-level parallel computation mechanisms. Ct then automatically maps the high-level, deterministic specification of the computation onto an efficient implementation, eliminating the risk of race conditions and non-determinism.
Another upcoming product for node-level parallelism is Cilk++, based on technology acquired from Cilk Arts last year. This extension to C++ is designed to provide a simple, well-structured model that makes development, verification and analysis easy. With it, programmers typically don’t need to restructure programs significantly in order to add parallelism. With the Intel Cilk++ SDK, programmers insert three Cilk++ keywords into a serial C++ application to expose parallelism. The resulting Cilk++ source retains the serial semantics of the original C++ code. Consequently, programmers can still develop their software in the familiar serial domain using their existing tools. Serial debugging and regression testing remain unchanged.
Similar help is coming from other companies. For instance, FASThread is an interactive compiler add-on integrated into an IDE, currently Visual Studio (with plans for an Eclipse version and also a commandline version). At this point, Visual Studio C/C++ is the only supported complier with Intel C/C++ and GNU C/C++ in the works. This tool was developed by Nema Labs, a spin-off from research led by Professor Per Stenstrom at Chalmers University of Technology in Gothenburg, Sweden.
Stenstrom states that the long-term vision of the technology is to shield platform-dependent optimisations from the software developer so that he can focus on software innovation while the FASThread technology unlocks the performance of present and future multicore platforms, whether homogeneous or heterogeneous. The capabilities added are twofold. FASThread first guides the developer to clean up a program from dependences so that it can be automatically parallelised and tested. This includes, for instance, removing all data dependences that could cause data races at run-time. FASThread then parallelises the sequential source code and generates a semantically equivalent parallel version of the program using a particular parallelisation API supported by the compiler for the target system. This parallelised version of the original code is directly fed into the C/C++ compiler for the target system to generate an optimised and parallelised binary.
In tests run on a set of nine applications from various scientific areas, the applications run two times faster on a quad-core machine after only a 15-minute session with FASThread. For most of them, the developer had to interact with the tool to remove dependencies from the code before parallelism could be unlocked. Spending a total of eight hours, it was possible to increase performance two to five times on an eightcore machine.
A decade of experience
‘When it comes to parallel programming, it’s easy to do something that looks right, but it’s difficult to be sure it is right and will do the same thing under all conditions,’ says Andrew Jones, VP of HPC Business for the Numerical Algorithms Group. That company supplies both the NAG Parallel Library and the NAG Library for SMP and multicore. The latter – which is nothing new at all and has been available for a decade – is said to be the largest commercial numerical algorithm library developed to harness the performance gains from the shared memory parallelism of Symmetric Multi-Processors (SMP) and multicore processors. It has more than 1,600 algorithms, with many specifically tuned to run significantly faster on multisocket and multicore systems. The NAG Parallel Library has been specifically developed to enable applications to take advantage of distributed memory parallel computers with ease. The library components hide the message passing details to maximise their modularity.
‘We strongly urge people to use prepackaged routines such as these where other people have done the difficult work of dividing up the tasks in an optimal way,’ says Jones. He adds that simply by linking in the NAG libraries you can get an immediate 2x or 4x speedup, assuming that the computationally intensive parts of a program use the libraries.
In NAG’s support program for HECToR (the UK’s national supercomputing facility), the company is running optimisation projects on a number of codes. One example is CASTEP, a key materials science code. It was enhanced with band-parallelism to allow the code to scale to more than 1,000 cores. Using NAG technology, the speed of CASTEP on a fixed number of cores was improved by four times on the original, representing a potential cost saving of $3m of computing resources over the remainder of the HECToR service.
Text-based languages express parallel code using special notation that creates parallel tasks, but managing these multithreaded applications can be a challenge. The situation is far different with LabView, a graphical programming language from National Instruments. Applications are developed as if drawing a block diagram on paper, and LabView’s dataflow nature means that any time there is a branch in a wire (a parallel sequence on the block diagram), the underlying LabView executable tries to execute in parallel. Automatic multithreading has been natively supported in LabView since 1988, and later versions have refined this process.
Fig 2: LabView’s Real-Time Execution Trace Toolkit
By default, LabView automatically multithreads an application into various tasks that are then load-balanced by the OS across available processor cores. The OS scheduler typically does a good job of this, explains Tristan Jones, technical marketing team leader for National Instruments UK and Ireland. In some applications, though, it might be desirable to assign a task to its own dedicated core. Doing so allows the remaining tasks to share the other processor resources and ensures that nothing interferes with the time-critical process. For this, the Timed Loop and Timed Sequence structures in LabView provide a processor input that allows programmers to manually assign available processors to handle the structure’s execution. In addition, a number of function blocks, such as matrix multiplication, are optimised for parallel operation.
LabView also provides a number of utilities to help programmers maximise performance. Figure 2 shows how the Real-Time Execution Trace Toolkit is being used to perform post analysis on the execution of an application running on a real-time multicore system. ‘CPU 0’ has been selected in the ‘highlight CPU mode’ drop down, which means that the parts of the application which do not execute on CPU 0 are greyed-out, thereby allowing the developer to optimise the execution.
As with any of the methods discussed in this article, it’s difficult to get specific figures about the performance increase a given technique or programming tool brings, because everything is so application dependent. However, Jones points to an application at the Max Planck Institute in Munich, where researchers applied data parallelism to a LabView program that performs plasma control of the ASDEX tokomak. The program performs computationally intensive matrix operations in parallel on eight CPU cores to maintain a 1ms control loop. Lead researcher Louis Giannone notes: ‘In the first design stage of our control application programmed with LabView, we obtained a 20x processing speedup on an octal-core processor machine over a single-core processor while reaching our 1ms control loop rate requirement.’
Another scientific language that has added parallelisation tools is Matlab from The Mathworks. That company’s Parallel Computing Toolbox lets you solve computationally and data-intensive problems using Matlab and Simulink on multicore machines. Parallel-processing constructs such as parallel for loops and code blocks, distributed arrays, parallel numerical algorithms and message-passing functions let you implement task and data-parallel algorithms at a high level without programming for specific hardware and network architectures. Converting serial Matlab applications to parallel apps requires few code modifications and no programming in a low-level language. By annotating code with keywords such as ‘parfor’ (parallel for loops) and ‘spmd’ (single program multiple data) statements, task and data parallelism offered by various sections of algorithm can be exploited.
Down at the OS level
Some ongoing efforts at improving multicore performance might not bear fruit in the immediate future, but hold extreme long-term promise. Barrelfish, for instance, is a new OS being built from scratch in a collaboration between the ETH Zurich and Microsoft Research, Cambridge.
The researchers are exploring how to structure an OS for future multi- and manycore systems. The motivation behind the project is two-fold: first, the rapidly growing number of cores, which leads to scalability challenges; second, the increasing diversity in computer hardware, requiring the OS to manage and exploit heterogeneous hardware resources.
Senior Microsoft researcher Tim Harris explains that Barrelfish involves placing a separate OS kernel on each core, communicating with explicit messaging over hypertransport. ‘This is very much a research OS. We’re using it as a laboratory for prototyping new ideas and concepts. We learn what works, then we talk with the product groups who examine how and when to commercialise the technology.’ He adds that there are fewer opportunities for end users to tune algorithms to a particular machine; this must fall to the job of the OS to identify cores for parallel phases and which jobs run on which cores. This work will have to be done at the OS level. ‘If we’re successful, we won’t need a special HPC OS; we could handle all workloads.’