Accelerating the development of HPC

Since their introduction to HPC, accelerators have seen increasing success as more people have developed the knowledge and skills to use them and programmers have adapted their previously serial code to run more efficiently in parallel.

Two of the three next-generation US national laboratory supercomputers, funded through the US Department of Energy (DOE), will make use of Nvidia GPUs tightly integrated within the IBM Power architecture.

In addition, the DOE’s FastForward2 project, funded in conjunction with the US National Nuclear Security Administration (NNSA), aims to develop extreme scale supercomputer technology still further. The project awarded $100 million in contracts to five US companies which included Nvidia, AMD, and Intel.

These programmes, and similar projects in Europe, demonstrate that accelerators are a critical tool for HPC as it approaches the exascale era. Parallelising code and running it on power-efficient accelerators has become a necessity, if we are to reach the ambitious power consumption targets set out by the US and European governments.

Roy Kim, group product manager in Nvidia’s Tesla HPC business unit, stated: ‘It’s an interesting time right now for HPC. A few years ago, Nvidia GPUs were the only accelerators on the market and a lot of customers were getting used to how to think about and how to program GPUs. Now if you talk to most industry experts, they pretty much believe that the path to the future of HPC, the path to exascale, will be using accelerator technology.’

An ecosystem, not just a GPU

A large part of the success of GPUs has revolved around Nvidia and its ability to generate a large skill-base around its GPU programming language, Cuda. Not only did it have a head start due to its dominant position in the consumer GPU market, but it also invested heavily in higher education. This enabled the development of academic programmes such as the Cuda centres of excellence, in addition to GPU centres and other activities to increase adoption of the technology.

The end result of all this activity is that Nvidia managed to generate a software ecosystem based around students and computer scientists who understand Cuda and how to make the best use of Nvidia GPUs, which helped to spur technology adoption.

Kim said: ‘The higher education community really fuels the rest of the industry. Whether it’s HPC or enterprise computing, the next wave of developers comes from higher education and we recognise that. We have these academic programmes as well as a lot of resources invested, to ensure that parallel programming is taught on GPUs.’

Kim stressed that a lot of the success that Cuda has had is because it made parallel programming easier, so that developers were able to spend time optimising applications or creating new code rather than having to grasp the concepts with parallel programming.

However, Nvidia is not the only horse in the race today, as companies like AMD and Intel set their sights firmly on the server and HPC markets.

Jean-Christophe Baratault, senior business development manager HPC GPU Computing at AMD, explained that AMD has made significant improvements in energy efficiency and memory bandwidth in recent years. This can be seen by AMDs at the top of the Green500 in the most recent list, published last year. The Green500 is a list of the top supercomputers, rated by energy efficiency rather than pure computational power.

APU rather than GPU?

Baratault said: ‘Another benefit is performance: double precision performance, which is very important for HPC; as is the memory bandwidth, because I would say that eight out of ten applications are memory bandwidth-limited. On top of that, we have very large frame busses up to 16 BG so there is no equivalent today per GPU.’

One thing to unique to AMD is its full OpenGL acceleration. Although OpenGL is mainly used for professional 3D graphics rendering, there are a growing number of HPC applications both for number crunching and 3d graphics rendering in HPC that can make use of this technology.

‘This is unique; it is only with our boards’ said Baratault. ‘It is not applicable to all of the HPC applications for sure, but I think with this unique functionality we can address some specific workloads.’

One thing that Baratault was keen to stress was AMD’s commitment to HPC, which has been revitalised in recent years. ‘We showed the world last year with the Green500 that we have the most energy-efficient GPU; we have not yet talked about potential future products, but it has been highlighted during the AMD final show day in New York a few weeks ago that, for the new GPU architecture, the focus will be on energy-efficiency.’

At that meeting, the company announced its roadmap for new CPUs and GPUs, but it also announced that a new 64 bit accelerated processing unit (APU) would be coming to commercial laptops this year.

‘The big question is more on how do you save energy? Will the PCIe bus be a solution to address the challenges of exaflop computing – because you have to send the data, and we are talking about terabytes of information?’ said Baratault.

‘We think that the future is based on system-on-chip like the APU that AMD has,’ he said.

Baratault went on to explain that the first APUs were designed strictly to address the consumer gaming market, but the newly announced Carrizo APU is the first 64 bit x86 APU to be released for consumer laptops later this year.

Baratault stated: ‘The reason why it is going to be interesting is because you will be able to take your existing code in OpenCL and start preparing your code using this Carizzo APU as the test vehicle. So that, whether you are an academic or an ISV, you can be prepared for what I think is going to be the big revolution at AMD that we announced during our final show day, a multi-teraflops HPC APU.’

Open frameworks spur technology adoption

All manufacturers of processors, not just accelerators, are facing similar challenges. As the industry meets the hard limits of materials science, it cannot continue to shrink semi-conductor technology at the same rate as before. Companies have thus been forced to look for more innovative solutions.

Rajeeb Hazra, VP of Intel’s architecture group and GM technical computing said: ‘We are starting to plateau, not as a company with a product but as an industry, on how quickly we can build performance with the old techniques, by just increasing the frequency.’

The increases in clock speed over previous CPU technology, which would give software developers a free performance increase just by running their software on the latest CPUs, is now coming to an end. In order to continue to increase performance, processor manufacturers must exploit parallelism within application code.

Hazra stressed that the key to increasing performance is first to clearly identify which sections of code can be parallelised and which cannot. Intel is developing a suite of products that can address both the highly parallel code with its Xeon Phi, and the serial code with its Xeon CPUs.

Hazra went on to give an example of ray tracing used in graphics rendering and visualisation: ‘Each ray does not rely on information from another, so each can be computed in parallel,’ he said.

Hazra said: ‘Amdahl said the world is not all highly parallel or highly serial, there is a spectrum and the amount of performance gain you can get through parallelism is gated by the proportion of how much is parallel and how much is serial in an application.’

Hazra concluded by highlighting that the years of work that have been put into the development of its Xeon CPUs directly address the serial portion of code; now Intel has developed the many-core architecture to address the more parallel sections of code.

‘What we have done for years with Xeon, is provide a fantastic engine for the serial and the moderately parallel workloads. What we have done with Xeon Phi is extend that surface to the very highly parallel workloads.’

Hazra said: ‘There are many applications that are highly parallel and there are some that cannot easily be parallelised. So what we need is a family of solutions that covers this entire spectrum of applications, and this is what we call Amdahl spectrum.’

The need for a skill base

No matter the competency of the technology, if it is not adopted by the broadest spectrum of users then it will fail to gain traction in a market that is fiercely competitive. This is made more difficult as, in the case of new processor technology, applications must be re-written, a skill base needs to develop around the architecture, and this takes time.

While Nvidia has a head start with its own programming language and tools, AMD and Intel have decided to look at open programming frameworks. As Baratault explained, this gives some freedom to application developers as they are not necessarily locked in to one hardware platform. But the overwhelming reason is that it helps to speed up adoption of the new technology as the hardware manufacturers are tapping into an already established base of skilled programmers and the codes and techniques that they have developed to optimise code. It means that new developers do not have as steep a learning curve as they try and adapt to the new technology.

AMD has opted for OpenCL, which has an open framework designed around supporting many different heterogeneous computing platforms, from GPUs to FPGAs or DSPs. Baratault explained that he thought the adoption of OpenCL would be a big benefit to AMD.

‘Code portability is a key topic; it means that a user is not tied down to one vendor; it is also the best way to leverage programmers’ expertise,’ said Baratault.

‘We see more than 1,000 ongoing OpenCL projects; it is growing because it is an ecosystem supported by various hardware vendors that have seen the benefits in shooting for such a programming framework.’

Intel has perhaps the easiest job in this regard, as it is extending languages, models, and development tools from the Xeon CPU family across to the Xeon Phi. Hazra said: ‘There is a lot of legacy code; there is a lot of knowledge about how to write such code that has been built up in the software industry; and we did not want to lose that. It would be akin to burning your entire library and starting anew.’

Hazra also explained that because programmers were using similar techniques, such as threading and vectorisation to parallelise code, investigating speedup on Xeon Phi would never be a wasted exercise because the improvements would still impact CPU performance.

Hazra said: ‘That is a huge economic gain for companies that have huge applications and do not have the money to throw away on months of work’. He also stressed that this meant programmers could use Xeon CPU as a development environment for Xeon Phi code using AVX instructions for example.

The question of PCIe

The question raised by Jean-Christophe Baratault of AMD about the long-term future of the PCIe bus, has troubled the HPC community for as long as it has been looking to face the challenges of exascale computing. As applications increase in scale and complexity, there is an ever-increasing need to drive more data to the processors, whether CPU, GPU or accelerator. But data movement has its own energy cost. This has led accelerator manufacturers to come up with new strategies to integrate accelerators more efficiently into the compute architecture – in essence moving the accelerators closer to the data or at least giving them more access to it.

Nvidia was the first to announce its own version of this technology, called NVLink, a fast interconnect between the CPU and GPUs that allows them to move data more efficiently.

Kim said: ‘NVLink is going to give you five to 12 times more bandwidth than what is available today through PCI express, and so that again gives you the ability to move lots of data to where the computing engines are.’

He went on to explain that the next generation of IBM Power processors would also integrate with NVLink, which was a big part of the IBM bid for the CORAL project.

Kim said: ‘The next generation of IBM’s Power processors are going to integrate with the technology in their CPU and so if a customer deploys a power system with GPUs in the next gen set of systems you will get a high-performance interconnect between CPU and GPU.’

Intel has also been working on its own interconnect technology. It acquired Cray’s interconnect business in 2012. So far, very little is known about the new interconnect, other than that it will be called Omni-Path, and that it will be optimised for HPC deployments.

Intel’s Raj Hazra explained that Intel would be announcing more on this technology at ISC this year, and he explained some of the rationale behind investing in this technology.

Hazra said: ‘What we are working on is Omni-Path, an interconnect tuned for HPC, and integrating that with our processor itself. Interconnects are extremely important in hyperscale systems; they take traffic back and forth, and that is how you get aggregate compute or aggregated parallelism.’

He explained that this provides benefits including increased energy-efficiency, higher memory bandwidth, and increased compute density due to the more tightly integrated components. He pointed out that this enables a much more efficient use of resources.

Hazra said: ‘Today the PCIe express card has to have its own memory, but when you integrate that into the CPU then it can use system memory, and it is much more efficient to schedule traffic. This means that you can architecturally innovate with many more degrees of freedom when you are closer to the CPU than when you are in just an I/O device.’

Hazra concluded: ‘We believe it is a real game changer integrating a high performance interconnect fabric in with the CPU and we can do this because we have the ability based on our Moore’s Law advantage.’

Never before has the HPC community had such a choice in the type of processors and different technology platforms available on this scale. It is impossible to tell which ones will ultimately see the most widespread adoption – indeed the future may be one where several different architectures co-exist – but this competition helps to generate the innovation needed to reach exascale.