From CPU to GPU
The kids’ bedroom may be a far cry from the science laboratory, but technology that once painted realistic 3D graphics for computer games and animated films could be the answer to scientists’ prayers for greater processing powers at reduced costs. Organisations as varied as Nvidia, the Barcelona Supercomputing Centre and even Intel are seriously investigating the potential advantages of deploying graphics processing units (GPUs) for complex simulations and data analysis.
If you look beneath the surface, the transition of GPU technology from special effects to scientific research is not so surprising. Both scientists and computer graphics designers need to process large volumes of data, very quickly, while modelling sets of many interrelated equations to produce results that are as realistic and reliable as possible. The fluid dynamics equations used to model the swirl of dust as a racing car speeds away into the distance are not so far away from the fluid dynamics equations used to model air flow in the atmosphere when predicting climate change.
There is no question that scientists are demanding the ability to process larger volumes of data in shorter periods of time. Advances in laboratory techniques now mean experts are presented with more data than their single- or dual-core desktop PCs know what to do with, and mathematical simulations must run in more detail than ever before.
‘Everyone’s compromising the problems they can solve today due to limitations in the computational power,’ says Ryan Schneider, CTO of Acceleware, which produces high-performance computing solutions for scientific applications using Nvidia’s graphics processing units. Some experts still question whether current GPUs are suitable for widespread application – there are potential issues with both the way they are programmed, and the way they store data – but if they do fulfil the potential they have shown in areas such as medical image processing and seismic data analysis, GPUs could be a ready-made solution to these problems. Computing experts in the past had hoped that this acceleration could be achieved by spreading the load over clusters of many multi-core CPUs, but GPU advocates believe that the majority of this computing will be performed on GPUs in the future, with a few CPU cores acting as managers to control the majority of the intensive work being farmed out on the GPUs.
The GPU is unlikely to make the CPU redundant, the latter has a greater memory capability and is more efficient at orchestrating and managing the different jobs being performed by the computer, and making decisions on how best to handle a problem. Instead, GPUs will act as highly efficient processing farms – where data is offloaded from the CPU to perform repetitive calculations at a very fast rate. The processed data would then return to the CPU, which would use the results to form a solution to the problem.
The GPU’s power comes from the sheer number of processing cores on each chip. Acceleware currently uses GPUs with 128 cores, but Ulrich Duvent, a software developer for GPU-Tech in France, says that some GPU chips can run up to 320 computations at the same time. In addition, computers may be built with more than one GPU chip, to allow literally hundreds, and possibly thousands of computations to run at the same time. Due to their established commercial use, GPUs are also readily available and relatively inexpensive compared to the number of multi-core CPUs you would need to achieve a similar number of cores.
‘GPUs are effective at performing rapid calculations due to their pipeline, which allows up to 320 computations at the same time on a single GPU. The CPU can do parallel computations, but this is limited by the number of cores available,’ explains Duvent. ‘In addition to GPUs containing more processing cores than CPUs, their memory is faster due to a larger bus width.’ This means they can transfer information to and from their memory more quickly than CPUs – a process that otherwise could create a bottleneck for some applications. ‘The frequency (Ghz) of CPUs is at a phase where it’s not evolving much, whereas the GPU frequency is currently rapidly increasing,’ he adds, suggesting that GPUs will be able to process data at even faster rates in the future. These benefits are leading many scientists to choose GPUs over clusters of multi-core CPUs.
Acceleware offers solutions for three key markets at the moment: electromagnetic simulations (of the performance of mobile phone antennas, for example), seismic data processing for oil and gas exploration, and biomedical imaging. The company hopes to add new application areas to its repertoire in the near future. ‘The techniques are very similar, so we can move into additional markets very easily,’ stated Rob Miller, vice president of product marketing and development at Acceleware. Within the realm of biomedical imaging, Acceleware’s combination of hardware and software is used to compile the data collected from CT scanners and MRI imagers into a meaningful image that could be analysed to determine the movement of a drug in the human body, and its efficacy at the targeted location, during drug trials.
Miller claims that cost-equivalent CPU clusters would have taken 1.5 hours per subject to reconstruct these images, whereas the Acceleware system could transform the data within just three or four minutes. ‘Instead of waiting until next week for the results, it’s down to a few hours,’ he explains. There are many similar results in all three application areas, he says, including examples of mobile phone manufacturers dropping their simulation times from 10 hours to just 15 minutes. In these cases, the advantage isn’t just the speed of the simulations; the additional power also allows the manufacturers to analyse important parameters involved in the design of the equipment, such as the interaction of the electromagnetic waves with the surrounding environment, that otherwise would have taken too long using traditional technologies.
Not everyone, however, is convinced.
James Reinders, the director of marketing for Intel’s software development division, believes that behind the hype of GPU processing there are fewer successful applications than would be expected: ‘GPUs offer high peak processing powers, but so far they’ve had difficulty producing many real-world applications,’ he says. ‘The opportunity to offload some work is there, but we are still concerned about the power consumption, memory bandwidth and the limits of structure-level parallelism. GPUs don’t fundamentally change these equations.’
CST used Acceleware’s GPU-powered technology to accurately model the electromagnetic signals given off by phones.
Reinders believes the core factor that limits GPU efficacy is the difficulty in programming code to run on both GPUs and CPUs, which these new high performance, hybrid systems would require. At the most fundamental level, GPUs and CPUs use different instruction sets – the basic machine code that underlies all the operations performed on the computer. This makes it difficult to write code that can be translated to both instruction sets.
In some cases, the CPUs and GPUs even store numbers in a different format – either to the single- or double-precision floating point standard – meaning it is difficult to transfer data between the two chips without inaccuracies creeping it. These errors may not make a huge difference for small graphical simulations, but they could be intolerable for precise scientific calculations.
Another fundamental difference that complicates the programming is their architecture. CPUs and GPUs store and transfer data to memory and cache in different ways. The code for a program would need to account for this, but again, it is difficult to produce code that can easily move the data backwards and forwards between the different processors when this is the case.
‘With multi-core processing, each core has the same instruction set. It is very easy to produce code, without worrying where the data ends up if the same code runs the same everywhere,’ says Reinders. ‘Dealing with different components that have different capabilities is not something that compilers can traditionally target. ‘If you’re writing a simple program it’s pretty easy to fetch memory from the main CPU, send it to the GPU, and then fetch it back again. But when you have to keep this up, it’s very difficult to program and to debug.’
The bad news is, most processing chips used to accelerate computations suffer from the same limitations. Many computer scientists had hoped that FPGA boards, or cell processors, could also be used, in addition to GPUs, to accelerate different tasks, but Reinders believes these too would suffer the same problem.
It’s debatable whether these issues are fundamental to the hardware itself, or whether they can be solved with innovative software solutions that deal with the complications of communicating with the different types of processors directly, ironing out these difficulties for the user. These solutions would allow the user to program in a more natural fashion. In the past, many scientists were daunted by the idea of programming on more than one CPU core, but innovative solutions from the likes of Interactive Supercomputing and The Mathworks have now eased this process, and it’s possible the same will be true for heterogeneous architectures using GPUs and CPUs.
Both GPU-Tech and Acceleware are taking steps towards this. Acceleware provides combined software and hardware solutions that have taken the difficulties out of the initial implementation, and GPU-Tech provides software libraries that already contain the code for the challenging steps in the programming.
Acceleware provides both hardware (pictured left) and software solutions to ease the use of GPUs for computing tasks.
‘It can be difficult if you have no knowledge of graphics programming,’ concedes GPU-Tech’s Duvent, ‘as it requires that the data be effectively managed and that one has a knowledge of the graphic card architecture and addressing modes for effective programming. This is done using low-level assembly languages or via a graphics APIs like Direct X or Open GL.’
‘Our Ecolib Libraries and API help with this by providing access to GPU computing without necessarily having to learn the low level programming languages available. The API allows developers to address the GPU in standard C++ code using the operators available and takes care of the data management on the GPU. More experienced developers can also use scripts to program but they need a bit of GPU programming knowledge like how to organise and access data to optimise performance.’
Some of the problems, however, are unquestionably due to the difficulties of housing consumer hardware designed for the desktop PC – which may be bulky and energy inefficient – in the streamlined servers used for high-throughput scientific and financial calculations. Hardware for these servers must also be reliable and capable of correcting, or at least acknowledging errors when they occur – a capability that would have been pointless in commercial GPUs used for computer games, according to Simon McIntosh-Smith, VP of customer applications at ClearSpeed. To solve these problems ClearSpeed has developed its own accelerator chip – in some ways similar to a GPU, but designed for professional clusters rather than desktop PCs. The chip includes error correcting memory, which can detect and often solve memory corruption. The chip also has more cores than a GPU, but it runs them at a slower rate, which reduces the heat loss of the device.
Intel, too, has decided to solve these problems by building a new accelerator chip from scratch. They are working at a new type of chip – a hybrid that combines elements of both GPUs and CPUs – to solve both these problems with the hardware and software.
‘The first thing we’re doing is rejecting the idea that we should introduce something difficult to program into people’s computers – the solutions just don’t tend to catch on in the long term,’ says Reinders. ‘Whatever we introduce, we want programming models that are easy to use and that preserve people’s existing investments in software.’ The first result of this work will be seen in the Larrabee multi-core chip, to be unveiled at some point in 2009.
Whether traditional GPUs catch on in the meantime remains to be seen. The Barcelona Supercomputing Centre – the most powerful supercomputer held in Europe at the time of writing this article – is currently investigating a heterogeneous approach, using GPUs, multi-core CPUs, cell processors and FPGAs in its next upgrade in three year’s time. If the BSC does integrate GPUs into its architecture, it could be an important proof of concept that may convince smaller outfits to follow.
Ultimately, it’s likely that processing accelerators, in whatever form, will play an important role in tomorrow’s supercomputing. ‘It feels like they will become ubiquitous,’ predicts Simon McIntosh-Smith, of ClearSpeed. ‘There will be room for all the different types of accelerators.’