HPC: predictions for 2014

Share this on social media:

John Barr looks into his crystal ball to see what is in store for HPC in the course of this year

A year ago, this annual look at trending topics and technologies in HPC concluded that the most important issue was applications: and the tools, standards and skills required to build them. That didn’t change during 2013 and, with many technology roadmaps focused on consolidation during 2014, applications are worth a more detailed look, as are others (including processor technologies and HPC in the cloud) that will also be pertinent for HPC in 2014.

HPC applications
HPC applications are more complex than most enterprise or consumer applications. Whereas the enterprise space is dominated by relatively few, very widely used applications, the HPC landscape is different. HPC is driven by many applications, in each of more than a dozen vertical segments. Parallelism and heterogeneity within HPC systems at chip, system, cluster, and cloud scale mean that – in order to deliver the best performance – applications must be tuned for specific architectures.

HPC applications are run on a wide range of platforms. At the low end, a user may be sitting in front of a multi-processor, multi-core, GPU-accelerated workstation. In the mid-range, applications may be run on a departmental cluster, whereas, at the high end, a supercomputer owned by an enterprise, or a cloud-based HPC facility, may be used. The range of target architectures, job management tools, and application licencing regimes, means that it can be difficult to make the best use of this complex environment. For independent software vendors (ISVs), the cloud is both an opportunity and a threat. The cloud enables ISVs to deliver a broader, more flexible set of offerings to their customers, but it also makes it easier for a user to explore alternative options.

One of the issues where progress is needed in order to enable a more flexible use of a range of HPC platforms, is software licensing. While some companies delivering commercial HPC applications are taking an enlightened approach and providing flexible licensing to meet the evolving landscape of HPC platforms, some well established companies are more concerned with protecting their short-term revenue streams – which makes fully exploiting the potential of HPC more difficult, more costly and less appealing to some users. Significant progress on flexible licensing for HPC applications can be expected from many ISVs during 2014.

HPC tools and standards
OpenMP is a high-level approach that allows programmers to parallelise codes for multicore and multiprocessor systems. OpenACC takes this one step further, using pragmas to tell the compiler which code segments should be accelerated. Although OpenACC is blazing the trail for accelerator development tools, the theory is that, once the best approach has been identified, this (or something like it) will be supported in OpenMP. It is important for the HPC industry to have a widely used standard approach to programming accelerated systems, so it is hoped that the major participants continue to drive towards this goal as fast as is reasonably possible.

OpenMP and OpenACC offer an evolutionary strategy for moving serial codes to parallel systems, and parallel codes to accelerated systems. Once the hot spots in a code have been identified, adding pragmas in that area can help the compiler to optimise these time-consuming sections of code. Although this strategy won’t always give the best performance without additional tuning, it does mean there is a very low barrier for entry into the world of accelerated computing.

The accelerated computing market is evolving rapidly, while standardisation is generally delayed until a market is more stable. A standard way of building applications that use accelerators (most likely with OpenMP) would provide a benefit to users, in that they would only have to port a code once and could then target whatever accelerator was the flavour of the day. Having said that, tuning would be required to get the very best out of each target architecture – but at least an application would be portable between accelerated architectures. Progress is being made (the latest OpenMP standard – OpenMP 4.0 – includes support for accelerators that is similar to that offered in the initial version of OpenACC), but faster progress would help the market for accelerators grow, and make life easier for users.

This year, 2014, will see a number of important compiler vendors producing OpenMP 4.0-compliant compilers (which includes the first cut at accelerator directives). While the underlying philosophy of what is offered by OpenMP for accelerators is the same as that provided by OpenACC, there are differences in implementation that may make it difficult to reconcile the latest version of OpenACC (which relies on greater compiler automation) with an easy evolution of OpenMP. Perhaps the battle between OpenMP and OpenACC will be fought, not within the HPC industry, but in the mobile world where the likes of Nvidia (with the Tegra K1, programmed with OpenACC) and Texas Instruments (with the Keystone 2, programmed with OpenMP) target a market with a potential for sales of billions of devices during 2014.

One of the issues relating to standardisation is that both Intel and Nvidia think that they are the incumbent. In a way, this is true for both companies. Intel is the incumbent, in that a large number of parallel applications targeted at Xeon use OpenMP; while for Nvidia, the majority of users targeting accelerators today use CUDA. So, while in the long term both companies may support a standard programming model for accelerators, in the short term they both have to be pragmatic and consider next quarter’s numbers. The next step for OpenMP will be a refinement of the current support for accelerators, probably two years away, by which time there is a danger that OpenACC will have diverged so much that it is difficult to reconcile the two. In the meantime, shared memory between mainstream processors and accelerators will make the OpenMP approach easier to work with. OpenACC needs to demonstrate that it is working to advance accelerator programming capabilities of future versions of OpenMP, if it is to avoid being marginalised.

CUDA and OpenCL work closer to the metal, with CUDA being Nvidia’s proprietary development environment and OpenCL supporting a wide range of platforms (including Nvidia GPUs). Compiler company PGI produced a CUDA compiler for x86 systems. The company has since been acquired by Nvidia and continues to support the x86 version, so although CUDA is proprietary, it does offer a level of code portability. The low level options (CUDA and OpenCL) often deliver better performance, but it can be hard work for a programmer to get it right. The high level options (OpenMP and OpenACC) are much easier to use, but may not deliver the same level of performance out of the box.

Back in the 1980s, I worked for a software company that provided compilers and development tools for the HPC industry, with a strong focus on array processors (the accelerators of the day). Writing in the company magazine about the hardware platforms emerging around 30 years ago, I noted: ‘More sophisticated software tools will be needed to drive the machines efficiently, but software technology does seem finally to be catching up with the array processor hardware technology.’ In retrospect, I was being rather optimistic. Accelerator hardware today may cost a fraction of what it did way back when, and be the size of a book instead of the size of a fridge, but the promise of easy, automated, programmability remains just out of reach. Maybe next year.

HPC skills
Benchmarking of HPC applications is important for two reasons. First, it aids understanding of performance and can help focus new ‘tuning’ work in the right place. Secondly, it can identify the best architecture for a specific application. But to deliver the best performance for a given application, and to understand how to write meaningful benchmarks, the user often has to become an HPC expert in addition to being a domain expert – something that many scientists and engineers either don’t want to do, or don’t have the time to do.

HPC systems are becoming more complex, exploiting multi-core, multi-socket systems in clusters that are supported by heterogeneous accelerators. While some universities are doing an excellent job at promoting HPC skills to both their students and industry, parallel programming is still a black art for many programmers. There is a significant skills gap, with many organisations training their own HPC staff as they cannot find skilled HPC programmers. More HPC training is required for industry, academia, and even in schools to excite the next generation of programmers to consider HPC as a career that can be both fun and rewarding.

Processors and accelerators
One change expected for 2014 is the arrival of 64 bit ARM processors, which will drive the interest in ARM for both enterprise and HPC users. While x86 is a great processor family, it is relatively expensive and does not offer the same opportunities for collaboration in an open ecosystem that ARM does. Licensees of the 64 bit ARM core designs include AMD, Broadcom, Samsung, and STMicroelectronics. Although ARM is already being used for HPC by the Mont Blanc project, which is funded by the European Commission, some people in the industry think it more likely that ARM processors will be used in a supporting role, such as storage systems.

According to leading HPC analyst firm Intersect360, 85 per cent of accelerated systems use Nvidia GPUs, while four per cent use Intel Xeon Phi. At the high end, 37 of the systems in the TOP500 list use Nvidia GPUs, while 12 use Xeon Phi. The Knights Landing variant of Intel’s Xeon Phi won’t be with us until 2015, and Nvidia remains tight-lipped about when the next generation Maxwell GPU will appear – but the expectation is that Maxwell will ship during 2014. In addition to being rumoured to be four times faster than Nvidia’s current Kepler GPU, the next-generation Maxwell device will augment the GPU with ARM processor cores (a development from Project Denver), and will also offer Unified Virtual Memory. In the meantime, Nvidia’s Tesla K40 GPU (which at 12 GB has double the amount of memory of its predecessor) broadens the potential market for GPU-accelerated applications. Nvidia also has the Tegra K1 which offers a high-performance, heterogeneous architecture with a very low power consumption. Although this was developed for the mobile market, it will be deployed in new HPC architectures.

In 2013, Imagination Technologies acquired MIPS Technologies – remember that MIPS processors were the engines that powered many of SGI’s most successful HPC systems. During my research for this article, an off-the-wall idea was put to me: the idea of a multicore MIPS64 processor integrated with an Imagination Technologies GPU being targeted at the HPC market. Both architectures are supported by OpenCL, so perhaps it’s not a daft idea! Another, different architecture being used in HPC is TI’s Keystone 2, which combines ARM and DSP cores in a package that is well suited to the emerging needs of handling the data deluge generated by the ‘internet of things’ and other big data sources.

HPC in the cloud
My recent article ‘HPC in the cloud – is it real?’ (Scientific Computing World Oct/Nov 2013 page 38) identified many offerings for HPC in the cloud, and a great deal of interest from users, but few companies have yet committed the majority of their HPC workload to the cloud. But I think that this will change in 2014, as many of the hurdles to exploiting the cloud for HPC applications are overcome. Although the use of cloud-based HPC facilities for production runs is still at an early stage, a significant number of companies are exploring how to exploit HPC in the cloud in order to meet peak demand, or to avoid complicated and time-consuming HPC procurements.

But HPC in the cloud is not the best answer for all users. There are still privacy, trust, security, regulatory, and intellectual property issues that must be addressed, and a cloud-based solution will not tick these boxes for all organisations. As more HPC-tuned cloud facilities start to offer a diverse range of technologies and business models, more users will be able to take advantage of the flexibility of using the cloud for their HPC workloads. This will not only move many users from in-house facilities to the cloud, but will bring new users to HPC – users who have neither the skills nor the interest to procure and maintain complex, accelerated clusters – but who can benefit from running their applications on these types of systems.

Other HPC issues
There is lots of hype surrounding the HPC industry, related to both products and trends. Perhaps 2014 will be the year when we stop believing some of the hype and take a more realistic perspective. One example of hype is that the industry will deliver a programmable, affordable Exascale system that operates with a reasonable power consumption by the end of the decade.
Although progress has been made, and there are now a handful of systems whose peak performance exceeds 10 Petaflop/s, the current power consumption is still a factor of 20 too high, and product roadmaps do not project that this will be solved by the end of the decade. At the other end of the scale, very good HPC systems can be built using low-cost components (e.g. last year’s processors) instead of always going for the latest, greatest, most complex and expensive devices.

The consolidation of HPC and big data technologies will continue during 2014. Many HPC vendors have invested in high-end storage capabilities, and the challenges of big data will bring new opportunities for the HPC industry beyond traditional HPC market segments. But big data isn’t just about storing the data; it’s about processing it as well. Many big data solutions today developed outside of traditional HPC market segments, but use a typical HPC approach of batch-processing large lumps of data. An emerging approach is to analyse streaming data in flight, when a response is required within milliseconds, or even microseconds, or the moment is gone. Some of the large internet companies such as Google, PayPal, and Twitter are exploring these techniques in order to analyse their customer’s behaviour.

The rapid evolution of the internet of things will bring many new opportunities to exploit HPC tools and techniques in embedded or distributed environments.