The challenges of exascale development

Share this on social media:

Issue: 

Robert Roe takes a look at challenges facing the first generation of exascale supercomputers’ development

With the recent announcement that the first scheduled US exascale system will be AMD-based due to the delays of the Intel-developed Aurora system, the challenges of developing these record-breaking HPC systems is causing setbacks in the original exascale timeline.

Intel has struggled to deliver the 7nm processors that will be used in the Aurora system, announcing delays earlier this year. It was thought that this may delay the Aurora system for some time but this has now been confirmed by an interview in InsideHPC with US Under Secretary for Science Paul Dabbar, commenting that ‘They’re getting very close to the first machines that are going to be delivered next year. And the first one is going to be at Oak Ridge.’

This statement makes it clear that it will be the Oak Ridge system Frontier which is delivered first, with the Aurora system at Argonne further delayed until Intel can overcome its manufacturing challenges to deliver the 7nm chips.

It is now understood that the Intel 7nm chips for PC users will shift to late 2022 or early 2023 while server-grade CPUs are now planned to launch in late 2023. The company’s 7nm GPU codenamed Ponte Vecchio will also be delayed until 2023.

Chip rivals such as Nvidia and AMD are both delivering 7nm chips which have been manufactured by TSMC.

Dabbar noted: ‘We are in discussions with Intel about that. I think we’re feeling good about the overall machine. I can’t go through exactly all the different options that we’re looking at for the Argonne machine, but we have a good degree of confidence that not very long after the Oak Ridge machine – that will be delivered as part of a plan to have at least one (exascale) machine up in 2021 – but not long after that we will have the Aurora machine, the Argonne machine, also.

‘The details are still being identified about exactly what we’re going to go through with Intel and their microelectronics. But we have confidence that the machine will also be delivered, and will be delivered right behind Oak Ridge. Our major partners have different components of the hardware and software stack,’ said Dabbar.

‘One of the things that will be occurring with the exascale programme is the Shasta software stack that is … already developed to a large degree by HPE Cray. That’s something they’re developing as part of running their system. They actually have deployed early versions of that. Some of their other machines that they have that are not exascale, so they already have the earlier versions that have been de-risked.’

Dabbar stressed that a big part of the deployment process is focused on developing the integration between hardware and software. ‘The software stack that is going to be riding on top of the hardware is going to be integrated. So a lot of the discussion today, and part of the deployment, is that layering of the operating system on top of the hardware.’

While it had been thought that the Intel-based system may be delayed since news about the 7nm chips was first announced earlier this year, this is the first confirmation that the exascale system developed by intel will be delayed. This highlights the difficulties in developing these systems that must deliver performance and reliability at an unprecedented scale.

Calculated risks

Professor Satoshi Matsuoka, director of Riken Center for Computational Science, commented on the development of the Fugaku system and some of the challenges that the teams faced to deliver the hardware and software updates that made the system possible.

While it may not deliver the performance of the US systems scheduled to be delivered two years from now, the Fugaku supercomputer does highlight the development efforts of Japanese researchers and Fujitsu, who developed their own Arm-based processor and other technologies, such as the Tofu networking system used in Fugaku.

Matsuoka said early in development Riken saw the need to accommodate emerging technologies such as AI and maching learning (ML), which require mixed precision to accelerate performance. ‘We have been very flexible in accommodating these types of emerging applications, and that is also due to the fact we are designing our own hardware from scratch.

‘Designing any hardware is a balancing act, and then you try to leapfrog by using advanced technologies. In our case, that was another thing that we were able to do because we are a national project and we are trying to achieve this unobtainable goal. 

‘Otherwise, it would not be possible by the standard development processes of a single company,’ he said.

Matsuoka gave standard CPU development cycles as an example. Noting that while a speedup of 10 to 20 per cent per generation is fine for a commercial company making incremental gains to support its existing customer base, it was not sufficient to realise the ambitions of the Fugaku supercomputer. 

‘It is harder to take risks, so you make advances but you make baby steps.’ said Matsuoka.

‘Of course, if you look at the holistic lifespan of hardware architecture you will see significant advances but that is ‘still bound by the limitations of having to accommodate existing users’.

Commercial companies have to manage risk more carefully as the risk of new technology development falls solely to the organisation. In the case of Riken, the development of the processor was subsidised by the Japanese government and Fujitsu.

‘In our case, we are a national project, we can, and in fact, we are obligated to take risks because the government are financing it. If we fail, then we may never be able to do it again but it is not like a company will go under. We are expected to take these risks and so we adopted or developed technologies that are otherwise still not used or utilised in existing CPUs such as high bandwidth memory (HBM). This has resulted in this leadership performance which others, to date, have not been able to follow,’ added Matsuoka.

Moonshot drives technology development

While the development of exascale technologies are clearly much more challenging than standard HPC development, Matsuoka notes that the goals can be achieved through carefully planned partnerships.

‘Commercial technologies have skills that we do not have. We cannot do detailed chip design. Fujitsu has been doing that for 50 to 60 years, if you count the mainframe days, so they have a wealth of experience, and the engineers and systems. Those are their assets, they are one of the few remaining companies left that can design server grade general-purpose CPUs.

‘It was a great collaboration with Riken and other universities spearheading innovations, and these elements being realised as a tangible product. Unless you have a chip that is ultra-reliable then the machine will crash every second. If you leave it academia to make a chip it would be very likely that this would happen,’ said Matsuoka.

‘We do not have the capability to develop a chip that has this level of reliability but Fujitsu has been doing that for years. It was a fantastic combination of skillsets that allowed this moonshot-like development to come to fruition,’ added Matsuoka.

The Riken centre has previously focused on the use of Sparc-based silicon for its previous supercomputers. The K computer, for example, the predecessor to the Fugaku supercomputer at Riken today, was based on Sparc processor technology developed at Fujitsu. While it may seem that switching from one architecture to another, which uses completely different Industry Standard Architecture (ISA), would suggest that much of the work would be replacing the old technology with new, there is significant technology transfer from the K computer to Fugaku.

‘If you look at the processor architecture, what the processor has to do on a very basic level is to understand the instruction set. Instruction set architectures are like languages. You can convey the same meaning but there are different languages that you may choose to use,’ said Matsuoka.

‘However, if a language is only spoken by a thousand people, then you do not have the knowledge and resources that would be available in a more common language.’

‘While Sparc, at one time, was very mainstream, in the heydays of Risk – maybe 20 to 30 years ago – the importance of Sparc diminished over time, and it has been replaced with X86 and other technologies,’ noted Matsuoka. ‘Fujitsu stuck with Sparc because they had considerable assets in Sparc-based codes.’

When planning the initial stages of Fugaku around ten years ago, Riken and its partners were looking to adopt a new ISA for the next generation of supercomputers.

‘It would be a big risk and cost for Fujitsu to switch to another ISA, but, on the other hand, if they stuck to Sparc they would be diminished into obscurity,’ stressed Matsuoka.

‘For HPC it is critically important that we achieve generality, because we have gone mainstream.’

Matsuoka also noted that the ISA could have been x86 but ‘there are some issues with x86 both at the system level and also from a licencing point of view, which made it almost impossible’.

He continued: ‘Arm was the obvious choice because it had already become a standard in the embedded space. In the end, it was decided that we would use Arm and the rest is history,’

The legacy of the K computer

With this large change in technology and all the development that went into the new system and its associated technologies, what lessons or knowledge has been taken forward and helped the development of this new system?

From the perspective of professor Matsuoka, there has been a significant transfer of technology which benefits scientists, and also makes it easier for technology development in the future.

Matsuoka states: ‘Then the question is, is there any legacy of K embedded into the machine or the architecture? The answer from the ISA perspective is, of course, no, because it is a completely different language, so to speak, but there are many other remnants that exist.

‘The most pronounced is the fact that when Fujitsu designs its chips they divide it into the front and backend. The back end is fairly oblivious to the instruction set, because every language basically does the same thing. The backend of Fujitsu has been inherited over many years. It has been used for K, of course with evolutions, with gradual evolutions to add features but basically, it is the same backend.

‘For Fugaku, the backend is an evolution of the design but there are commonalities with K’s backend. It is not a direct re-utilisation, execution is much wider. There are memory enhancements and many other enhancements, but the baseline architecture is the same,’ he said. 

Other tags: 
Exclude from view: