The rise of AMD and Arm as credible competition, alongside Intel and the use of other processing technologies such as IBM Power9 is driving new competition for market share in the HPC CPU market.
Combined with, or supplemented by, accelerator technologies where necessary, there has not been this much choice in CPU technologies for HPC systems for a number of years. This can complicate matters when choosing a new system or deciding to update or create new code as there needs to be careful benchmarking and exploration of the technological possibilities. However, the situation also provides an opportunity for many niche applications to find the right technology as there are more options available, each of which have their own specialisation.
Examples of this can be found across the processor market, AMD, for example, has announced a new 64 core server CPU which has set new benchmarks in SPECrate and High Performance Linpack (HPL). While the benchmarks are notable, the system also requires liquid cooling and uses considerable power.
IBM, on the other hand, is working on increasing the options for memory bandwidth. At the Hot Chips 19 conference earlier this year, the company highlighted its plans for IBM Power9 Advanced I/O which is claimed to deliver more than 600GB/s sustained memory bandwidth. These 14nm chips will support OpenCAPI 4.0 and an Open Memory Interface (OMI) which delivers an alternative to DDR at a higher bandwidth.
Intel is pushing towards AI specialisation with its Neural Network Processor for Training or NNP-T, developed by its AI team developed through the 2016 acquisition of Nervana Systems, a startup company that specialises in deep learning. Also announced at the Hot Chips event, Intel revealed new details of its Nervana neural network processors, Intel also presented details on hybrid chip packaging technology, Intel Optane DC persistent memory and chiplet technology for optical I/O.
Arm has a number of CPUs that are being developed in conjunction with its partners such as the Cavium ThunderX 2, which has already been announced in systems such as Sandia National Laboratories ‘Astra’ system built by Hewlett Packard Enterprise (HPE).
Arm and Fujitsu are also developing a processor for the RIKEN ‘Post K’ supercomputer, currently known as the Arm64fx. At the HPC user forum in April this year Satoshi Matsuoka, head of the Riken Center for Computational Science gave a presentation on the Post K computer and the new processor developed in conjunction with Arm and Fujitsu.
‘Compared to other processors that are not HPC-optimised but optimised for things like web workloads, this is a processor that is totally HPC optimised. It has 1TB/s memory bandwidth, it has an on-die network which is essentially a 400GB network-integrated onto the die,’ stated Matsuoka.
‘It is similar to Cascade Lake in terms of FLOPs but it has a much higher memory bandwidth and it also has various AI supports such as FP16 and FP8 but it is still a general-purpose CPU it is not a GPU,’ Matsuoka added. ‘It runs RedHat, for example, right out of the box, it runs windows too.’
Matsuoka also stressed that the design is intended to be very energy efficient. As it is based on the Arm chip design this is to be expected but Matsuokastated that ‘In some of the benchmarks, we have seen an order of magnitude improvement of per watt performance on things like CFD applications on a real chip - this is not simulated.’
This could be potentially huge in terms of sustained performance for many real-HPC applications. Although we will have to wait a bit longer to see how this performs in the final post K system once it has been completed.
‘We take this chip and we build the largest supercomputer in the world. I cannot disclose the exact number of nodes yet but it will be the largest machine ever built with more than a 150,000 nodes,’ comments Matsuoka. ‘What is important is not so much the FLOPS. We all know that for real HPC applications it is the bandwidth that counts. The machine has a theoretical bandwith of more than 150 petabytes per second bandwidth which is about an order of magnitude bigger than any other machine today.’
Rather than focus on purely double-precision flops, the Post-K system will use the Arm64fx processor and the Tofu-D network to sustain extreme bandwidth on real applications such as seismic wave propagation and CFD, as well as structural codes. Post-K is expected to deliver more than 100 times the performance of the previous system for some key applications. However, the system will also include big data and AI/ML infrastructure.
The push towards AI
Intel’s announcement of the Nervana systems suggests that the company is pushing to capitalise on the huge growth in AI. The company announced two Nervana systems NNP-T (Neural Network Processor) for training networks and NNP-I for inferencing.
Intel Nervana NNP-T is built to deep learning models at scale prioritising two key real-world considerations: training a network as fast as possible and doing it within a given power budget. The chip named ‘Spring Crest’ provides 24 tensor processors arranged in a grid with a core frequency of up to 1.1GHz and 4 x 8GB of HBM2-2400 memory and 60MB of distributed on-die memory.
Intel also claimed that to account for future deep learning needs, the Intel Nervana NNP-T is built with flexibility and programmability so it can be tailored to accelerate a wide variety of workloads – both existing ones today and new ones that will emerge in the future.
The Intel Nervana NNP-I is built specifically for inference market and aims to introduce deep learning acceleration, leveraging Intel’s 10nm process technology with Ice Lake cores to deliver high power per watt for data centre AI inferencing workloads.
The chip named ‘Spring Hill’ is much smaller in power usage than ‘Spring Crest’ at an estimated 10-50 watts as opposed to the 150-250 watt power envelope of Spring Crest. The NNP-I chip provides on-die SRAM using Intel’s 10nm process technology featuring dual-core processors and 12 Inference Compute Engine (ICE) which provides high bandwidth memory access, a programmable vector processor and large internal SRAMs for power.
In a blog post leading up to the Hot Chips conference, Naveen Rao, vice president and general manager, Artificial Intelligence Products Group at Intel commented on the need to specialise architectural development to suit AI workloads.
‘Data centres and the cloud need to have access to performant and scalable general-purpose computing and specialised acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications,’ stated Rao.
‘In an AI-empowered world, we will need to adapt hardware solutions into a combination of processors tailored to specific use cases – like inference processing at the edge –and customer needs that best deliver breakthrough insights. This means looking at specific application needs and reducing latency by delivering the best results as close to the data as possible,’ added Rao.
Spring Hill can be added to any modern server that supports M.2 slots — according to Intel, the device communicates using the M.2 standard like a PCIe based card rather than via NVMe.
Intel’s goal with NNP-I, is to provide a dedicated inference accelerator that is easy to program, has short latencies, has fast code porting and includes support for all major deep learning frameworks.
These are some of the first steps the first step taken by Intel to compete in the AI and ML markets. The company announced its Intel Xe line of graphics cards earlier this year but beyond that, we have not seen anything from Intel that could rival Nvidia’s hold on the AI market. The development of these processors - specifically designed to process neural networks could help to build a foundation for intel, allowing the company to gain a foothold in this market.
David Yip, OCF’s HPC and storage business development manager, thinks the rise of GPU technology means that HPC and AI development go hand in hand. The increase in AI provides added benefit to the HPC ecosystem.
‘There is a lot of co-development, AI and HPC are not mutually exclusive. They both need high-speed interconnects and very fast storage. It just so happens that AI functions better on GPUs. HPC has GPUs in abundance, so they mix very well.’
In addition, he also noted that AI is bringing new users to HPC systems who would not typically be using HPC technology. ‘Some of our customers in universities are seeing take-up by departments that were previously non-HPC orientated, because of that interest in AI. English, economics and humanities – they want to use the facilities that our customers have. We see non-traditional HPC users, so in some ways, the development of AI has done HPC a service,’ added Yip.
Yip noted that the fastest system on the Top500 in an academic setting is based in Cambridge. ‘It is just over 2.2Pflops and you have to go back to about 2010 to get that kind of performance at the top of the Top500. ‘It is almost a decade ago, so there is a difference in these very large systems, but we do eventually see this kind of performance come down.’
Leaving a mark
AMD has had a strong year with notable contract wins in the US for the Department of Energy National Laboratories. It will deliver CPUs in the ‘Perlmutter’ system for Lawrence Berkeley National Laboratory (Berkeley Lab) and CPU and GPU components for the ‘Frontier’ system at Oak Ridge National Laboratory.
But AMD is not just striving for these large scale deals with the US government labs as it has released a number of processors in its EPYC line which are delivering performance that is very competitive. Following on from the ‘Naples’ 7001 line of CPUs the company has now released a new HPC focused CPU in the ‘Rome’ Epyc 7002 generation
This new processor, known as the Epyc 7H12, which has set records for performance in recent benchmark tests carried out by Atos. The tests focused on SPECrate benchmarks, as well as in High Performance Linpack (HPL) with the latter being used in the Top500 to measure the fastest supercomputers. With a clock speed of 2.6 GHz, a 15 per cent increase over the Epyc 7742 processor, and requiring around 280 watts, it is more similar to a high-end datacentre GPU than a CPU typically used in HPC. The chip also requires liquid cooling.
The Atos benchmarks made use of Atos’ Bull Sequana’s Enhanced Direct Liquid Cooling system combined with the AMD EPYC 7H12 processor. The results of the Atos measurements currently top the best-published results for two-socket nodes on four SPECrate benchmarks. Additionally, it has set a new record for the HPL Linpack Benchmark on an AMD EPYC CPU, with an 11 per cent increase in performance. These benchmarks aim to measure how hardware systems perform under compute-intensive workloads based on real-world applications.
‘We’re extremely proud that our BullSequana has achieved these world-record results. Our unique Enhanced Direct Liquid Cooling system provided the most efficient environment for achieving such performance of the AMD EPYC processor,’ said Agnès Boudot, senior vice president, Head of HPC and Quantum at Atos. ‘Our BullSequana equipped with the latest AMD chip, provides our customers with the highest available performance for HPC and AI workloads, with an optimised TCO, to support them in going beyond the limits of traditional simulation.’
‘Taking on the processing challenges of the world’s highest-performing systems meant creating a solution up to the task, which AMD achieved with the 2nd Generation AMD EPYC processor,’ said Scott Aylor, corporate vice president and general manager, AMD datacentre solutions. ‘When paired with Atos’ BullSequana and their own impressive capabilities and customer relationships, we can deliver a whole new range of possibilities to address the processing needs of the modern datacentre.’