The processing challenge
Mainstream processors are designed to be general purpose, all-rounders. The number crunching needs of scientists and engineers have often gone beyond what can be delivered by standard systems. Over the years, many special purpose platforms have been used to provide the compute cycles required. Some of these have been entire systems, such as Cray parallel, vector supercomputers that delivered very high performance, at a very high price. An alternative solution has been to boost the compute capability of a standard system by adding a coprocessor or accelerator.
In the 1980s, array processors were the size of a fridge, cost tens of thousands of dollars and delivered a mighty 12 megaflops. Array processors provided a low-cost route to boost the floating point performance of minicomputers of the day. Along the way, other devices have found limited traction, including Intel’s i860 processor, Digital Signal Processors (DSP) and Field Programmable Gate Arrays (FPGA). This philosophy holds true today, but the format, performance and cost have radically changed. Today’s devices are on PCIe cards the size of a book, offer a peak performance in excess of one teraflop, and cost only a few thousand dollars – note that the first system to deliver one teraflop was built by Intel’s defunct Supercomputing Systems Division in 1996, cost $55 million and filled 76 cabinets with 9,072 Pentium pro processors.
The problem today is further complicated by the high power consumption of electrical components. The peak performance of the fastest supercomputer is expected to advance from one petaflop to one exaflop during this decade, but the power consumed by the system must be constrained if system operation is to be affordable. An improvement in compute power delivered per watt consumed of around a factor of 100 is required if exascale systems are to be feasible. One approach to delivering more compute power per watt is to use a very large number of relatively low-performance, low-power-consuming processors that can deliver better aggregate performance per watt than a small number of high-performance, high-power-consumption processors.
The most widely used compute accelerator today is Nvidia’s family of GPGPUs (General Purpose Graphical Processing Unit). The company has recently launched a new family of GPUs, the Tesla K20 and high-end K20X. Intel has also joined the battle with the launch of its Xeon Phi family. Though there are other options, the vast majority of compute accelerators sold during 2013 will include Nvidia K20/K20X or Intel Xeon Phi components.
The market opportunity for K20 and Phi is more than just high-end supercomputers, also covering departmental systems and HPC workstations. The drive for their adoption is the need for more compute performance while consuming less power. The barrier to much wider adoption is software – both the complexity of programming these devices, and the lack of availability of a broad portfolio of applications. At the very high end there are more people with the right skills, and people willing to put up with programming pain – while in the mid-range and on the desktop, people just want to get their job done. They don’t care how many cores it has, or what the underlying architecture is, they just want it to work – and fast.
The big fight in 2013
In the blue corner, weighing in at 1.011 teraflops and boasting 60 Pentium cores with a 512-bit wide SIMD unit is the Intel Xeon Phi 5110P, whose father, Xeon, powers many of today’s supercomputers. While in the green corner, weighing in at 1.31 teraflops and powered by 2,688 single precision and 896 double precision cores is Nvidia’s Tesla K20X, the next generation of the most popular accelerator used in supercomputers today.
The table below shows the technical â€¨details, but does not, perhaps, tell the whole story, which will be explored in eight â€¨gruelling rounds.
Note that the Intel Xeon Phi performance figures reported are for the pre-production SE10P that uses 61 cores instead of 60, and a slightly higher clock speed than the production Xeon Phi 5110P, so the equivalent figures for the 5110P will be marginally lower.
Round one: peak double precision (DP) performance
If you look at the specification and do the maths you see that these devices have similar peak performance, but achieve it in very different ways. The K20X edges this round, winning it 10-9.
Round two: peak single precision (SP) performance
SP performance is also important as not all calculations require the accuracy that DP offers – and both of these devices offer higher SP performance than DP. Another benefit of using SP is that the size of data is halved, so the effective memory and memory bandwidth are doubled. When a compute problem has been optimised it often becomes a data management problem – and if SP is used, you have only half as much data to manage. Many application areas can use SP calculations, at least partially, including bioinformatics, electronic design automation (except for SPICE – Simulation Program with Integrated Circuit Emphasis), seismic analysis, defence and weather forecasting. Many supercomputer applications today use mixed precision to derive the benefits of SP where possible, but also to maintain the accuracy of their results. The K20X wins this round comfortably, but without a knock-down, so once again the score is 10-9.
Round three: memory â€¨and bandwidth
As was noted above, data management is crucial to supercomputer applications, so the greater memory size and memory bandwidth of the 5110P edges this round for Intel, 10-9.
Summary of rounds â€¨one to three
Great theoretical performance is all very well, but that needs to be translated into real performance, and many applications rely on tuned mathematical libraries. DGEMM and SGEMM are double and single precision versions of general matrix multiply functions. The STREAM benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels.
It is no surprise that the K20X wins on the matrix multiply tests, but the greater memory bandwidth of the 5110P failed to deliver a winning STREAM score – so perhaps Intel is fortunate to be only marginally behind after the first three rounds.
Round four: power consumption
The power consumption of these two devices is close (Intel 225 watts, Nvidia 235 watts), so this round is a draw, 10-10. Indeed, it is likely that the design targets of both devices were to deliver one teraflop of peak performance inside a 250 watt power envelope.
Round five: price
The RRP of the 5110P is $2,649, while Nvidia says the price for the K20 family is up to its channel, but it expects it to be in the range $4,000 to $4,500, so Intel wins this round 10-9.
However, it is not as simple as that. Nvidia’s price was set in a market where it had little effective competition, and there is a mass of Cuda software already developed that runs on the device. As the incumbent vendor with a mature product, Nvidia is able to charge a premium. The company also leverages its GPU technology in commodity graphics cards, so it has the volume to support more competitive prices if its position in HPC is threatened by the Xeon Phi. It is anticipated that Nvidia will drop its prices as Xeon Phi grows market share.
It is also worth noting that the price of high-end supercomputers is often very flexible. Many sales that achieve a prominent place in the Top500 or Green500 lists are seen as being strategic by the vendors, so they will bend over backwards to win them.
Round six: programming approaches
This is perhaps the most important round – as there is no point in having a very powerful computer if no programs are available for it – but it is also the most complex one to call.
Nvidia GPUs are generally programmed using Cuda, which provides extensions to the standard C, C++ and Fortran languages to support the programming of an accelerator that exploits a high degree of parallelism. Cuda was introduced in 2006 and is attracting 2,000 downloads a day, so is a pervasive but proprietary approach, although it could be claimed that Cuda is a de facto standard.
Intel has a strong family of software development tools used to program its mainstream Xeon product line. These tools can also be used to quickly build applications for the Xeon Phi. The big question is how efficiently will most applications run on the Xeon Phi without being re-architected for an accelerator model? The answer is that a few applications run very efficiently with relatively little optimisation work, but efficient execution for most applications running on the Phi would require significant effort – a similar amount of effort as is required to achieve efficient execution on a GPU. If retargeting applications at the Xeon Phi using Intel’s tools was easy there would be a large portfolio of Phi-optimised applications out there – but this is not the case, which suggests that it is not quite as easy as Intel has been suggesting.
So the key issues for this round appear to be the de facto standard of Cuda for Nvidia, and the widely used Intel tools that support the OpenMP standard and have been retargeted for the Phi. Listening to experienced HPC industry professionals, it appears to be almost a religious debate rather than a reasoned technical discussion. Time will tell.
Intel makes a big thing of describing the Xeon Phi as a coprocessor, not an accelerator. But both K20X and Phi feature as PCIe cards in systems that use standard x86 processors, and most developers will take existing applications and offload the computationally intensive parts to the accelerator/coprocessor, so it may be a moot point.
The Xeon Phi has three modes of operation – only two of which are available on Nvidia GPUs. The common approach is offload. That is, the main program runs on a normal processor and offloads computationally intensive portions of code and related data to the accelerator. This is the model that has been used since the days of array processors. The alternative methods that Intel offer are symmetric (where the workload is shared between the normal processor and what Intel calls the coprocessor), and many-core only (where the whole program runs on the Phi). The symmetric approach can also be used with GPUs, and the many-core only approach may be of limited use because of the amount of memory available on the coprocessor.
Many HPC applications spend much of their time executing library functions, so the availability of highly-tuned mathematical libraries is very important. The K20X has broad library support, but the functions available for the Phi in offload mode are, as yet, limited, although this will surely change during 2013.
Nvidia has also been working with the HPC industry to provide a standard approach for programming accelerators. This has resulted in the OpenACC initiative, which uses compiler directives that specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an accelerator, providing portability across operating systems, host CPUs and accelerators. OpenACC is supported by supercomputer vendor Cray and compiler companies CAPS and PGI as well as Nvidia, and is being discussed for incorporation into a future version of the mainstream OpenMP standard.
There are arguments in favour of Nvidia’s Cuda/OpenACC approach, and Intel’s focus on its standards supporting tool chain. But users don’t care about the details of such debates – they want tools that work and are easy to use. When OpenACC becomes part of OpenMP and users can build a single version of an application that can (with minimal tuning) run effectively on both platforms, then the industry will have achieved something worthwhile. For now, it’s 10-10.
Round seven: applications
Many popular HPC applications have already been tuned for Nvidia GPUs, and while similar work is underway for the Xeon Phi, it is some way behind, so Nvidia wins this round 10-9. Key application areas are computational chemistry, material science, climate and weather, physics and CAE. Key applications that are already accelerated on Nvidia GPUs include Amber, Charmm, Gromacs, Lammps, Namd, DL_POLY, QMCPack, Quantum Espresso, Chroma, Denovo, Ansys Mechanical, MSC Nastran and Sumulia Abaqus. Intel is working with Accelrys, Altair, Ansys, CD-Adapco and MSC Software, among others. This work is important, but Intel is playing catch-up.
Round eight: the Top500 list
The Top500 list describes the 500 fastest computers in the world, and is published twice a year at the US and European supercomputing conferences, SC and ISC respectively. Three years ago only seven of the Top500 systems used accelerators – either IBM PowerXCell 8i or Clearspeed CSX600. Since then, the number of accelerated systems has grown to 62. This rapid growth has been driven by Nvidia, which supports 50 of these 62 systems, while there are seven pre-production Xeon Phi systems in the latest list. Of the top 100 systems, Xeon Phi features in four systems while Nvidia GPUs support 13 systems – including the fastest computer in the world, the Cray XK7 Titan system at Oak Ridge National Laboratory, which uses the K20X.
An alternative metric is the Green500 list that ranks the most energy-efficient supercomputers. The upper reaches of that list are dominated by IBM BlueGene/Q systems, but top ranked is the National Institute for Computational Sciences’ Beacon system at the University of Tennessee that uses Xeon Phi. Third on the list is the Titan system – with an AMD GPU-accelerated entry at number two.
So Nvidia lands some piercing blows with its strong position on the Top500 list, including the top spot, while Intel counters with several new entries for pre-production versions of Xeon Phi, including top position on the Green500 list. Nvidia wins this round 10-9.
There is little doubt that Intel’s participation in this market segment will validate the approach and will help grow the market for Intel, Nvidia and others. But although Nvidia may lose market share to Intel through 2013, the potential market will grow more quickly than they lose share – so both Intel and Nvidia will have a good year.
What does the fight scorecard look like?
Both of the combatants have won important rounds, but no-one has yet landed a knock-out blow. Nvidia is the incumbent and there is a large body of GPU software available, while Intel is, well, Intel. Its argument about using a common software platform as mainstream Xeon processors will be persuasive to many, even if it is not entirely convincing. Bottom line – who will the winner be? In 2013, both will be winners. Intel will take market share from Nvidia, but the overall accelerator market will grow, so a year from now we will see more Nvidia GPU and more Intel Xeon Phi systems in the Top500 list. A rematch in a year’s time should prove even more interesting.
With more than 30 years’ experience in the IT industry, initially writing compilers and development tools for HPC platforms, John Barr is an independent HPC industry analyst specialising in the technology transitions towards exascale.