Innovation at scale

Share this on social media:

Robert Roe explores advances in CPU, accelerator and networking hardware that is being designed to support exascale

Robert Roe explores advances in CPU, accelerator and networking hardware that is being designed to support exascale.

Creating the technology stack for exascale has taken years of innovation beyond simple iterative improvement of technology. In many cases, new computing architectures, networking systems, accelerators and processors are being constructed and optimised to deliver efficient exascale computing.

The drive towards exascale has often focused on delivering the highest possible raw computational power. The standard measure of exascale has generally been an exaflop or the ability to generate1018 floating-point operations per second. But this really just scratches the surface of what is required to support an exascale system Scientists require sustained application performance on real scientific codes.

Driving application performance at exascale requires a combination of computational power, I/O and memory bandwidth, and increases in energy efficiency that can make these future systems viable.

The European Processor Initiative (EPI) is an ongoing project funded by the European Commission, whose aim is to design and implement a roadmap for a new family of low-power European processors for extreme-scale computing that includes high-performance big-data and a range of emerging applications. 

The EPI technology stack includes a General Purpose Processor (GPP) research stream and an accelerator stream which supports the development of multiple accelerator technologies. This is being co-developed to deliver a European-based HPC platform for exascale computing. 

The GPP is based on the Arm ISA while the accelerator is based on RISC-V. Etienne Walter, Atos, EPI Phase 2 General Manager commented: ‘The continuation of phase one, but more clearly it’s about finishing the work initiated in phase one with the first generation of processors. It’s about improving global performance, and also the security of the chips. It’s rather similar also for the accelerator. We have several improvements in mind, we are changing the foundry technology, and we will have some denser and more complete chips for the accelerator.’

The decision to codevelop both the GPP and accelerator technologies in the same project hugely increases the complexity, but also provides a range of potential benefits. For example, EPI organisers have full control over the design specification of this hardware stack and can thus optimise both components around their given objectives. This means that each technology is optimised together. This could potentially provide better performance and efficiency as components have been designed in tandem to support the same goals.

Walter noted that this gives certain advantages that support the European Commission’s objectives: ‘It was really a benefit because we can see in the competitive landscape, we need very general-purpose processors, we need the accelerator to have the computing power that is needed for many application today and a growing number in the future. It makes sense to work on having both and trying to make them work together as well as possible.’

 ‘It would certainly have been simpler to address only one because of course you would have less discussion and probably a simpler consortium,’ Walter continued. ‘But I expect a long-term benefit from working on both aspects. For instance, the GPP stream is based on the Arm ISA. That’s one fundamental choice. The accelerator work is more based on RISC-V ISA. We expect that, in the future, we will have more RISC-V in the GPP. So we have the kind of cross-fertilisation for the GPP stream and that is just one benefit of working together.’

Walter was also careful to state that there is no single choice or architectural design that can support the entire HPC ecosystem. There has to be a careful selection of tradeoffs. In the case of EPI they have chosen a platform where the GPP stream focuses on developing a stable CPU that can support legacy X86 applications. On the accelerator side, they are proposing multiple different types of acceleration technology that can then be used to maximise the benefit for a certain subset of applications.

‘There is no optimal solution for all applications, it’s really not possible. So we have to consider different combinations and so this is why we really work on one side to ensure legacy and consistency with the Arm ISA and ecosystem,’ stated Walter. ‘Here, we can benefit from ecosystems that already exist. We can see the Fugaku system in Japan, for instance. We have quite significant systems running the Arm ISA.

The EPI project builds on the existing Arm ecosystem for HPC but also on previous European research projects such as the series of Mont-Blanc projects which investigated the use of Arm for HPC starting in 2011. ‘We have the experience from the work done within the Mont-Blanc project working on the Arm ISA. It has proved really easy to port, some X86 applications onto the arm ISA. Of course, you need to recompile the code, but, in general, we have had very little trouble doing that. So it’s quite a limited effort when compared to taking a new programming model into account. It’s not the same level,’ added Walter.

Exascale networking

CPU and accelerator technologies are just one aspect of the exascale puzzle. As the parallelism in exascale systems will be vastly larger than anything seen today it will be a significant challenge to deliver the I/O bandwidth needed to support application performance. But this is not just a challenge for exascale HPC but other markets including AI and also traditional datacentre applications.

Hewlett Packard Enterprise and Ayar Labs recently signed a multi-year strategic collaboration to accelerate the networking performance of computing systems and data centres by developing silicon photonics solutions based on optical I/O technology. This was soon followed by news that Ayar Labs had secured $130m in additional funding from Boardman Bay Capital Management, Hewlett Packard Enterprise (HPE) and Nvidia, as well as multiple new and existing financial investors which include GlobalFoundries and Intel Capital.

Silicon photonics will be used to enhance the networking capabilities and support future requirements for high performance computing (HPC), artificial intelligence (AI), and cloud computing architectures. The technology also has the potential to reduce the amount of energy used in data centres and large computing systems.

Hugo Saleh, Ayar Labs senior vice president of commercial operations, stated: ‘Within the press release, we talked about a few things. One is a future design of HPE Slingshot architecture, which has its genesis back at Cray. Today it is their high-end, Ethernet-like, networking solution that is targeted for HPC. We’re also working with HPE on advanced architectures where we’re talking about the composability of disaggregated resources, with an intelligent software stack.’

Solving problems for extreme-scale HPC

The silicon photonics designed by Ayar Labs could be used to create architectural designs that can support different configurations of hardware. ‘When we talk about the different markets we serve, we like to think of this whole cloud space and the focus there mostly is on disaggregated architectures or pooled and composable resources,’ Saleh said. ‘The reality is that it also applies to HPC. For AI and HPC, the focus is on glueless fabrics and memory semantic fabrics. AI especially is on the glueless side,’ Saleh continued. ‘So think about systems that may want to interconnect 64, 128 or 256 CPUs, seamlessly. We’re not talking about traditional, large, Xeon class CPUs, these could be smaller accelerators, or AI engines, where you want to be able to create a mesh.

‘In the AI space, I like to simplify it as: we’re trying to replicate the human brain. You’ve got a bunch of nodes, maybe each node doesn’t compute a lot, but it does a very specialised function. Then you have a lot of connections, the synapses between all those different nodes. And you need those nodes to be firing at high bandwidth, very low latency and low power to be able to create and solve these large AI problems,’ Saleh added.

The partnership between HPE and Ayar Labs aims to develop capabilities that leverage optical I/O, which is a silicon photonics-based technology that uses light instead of electricity to transmit data, to integrate with HPE Slingshot or other future networking products.

‘Whether you’re talking about HPC, or disaggregated computing, there is a real limiter on I/O.’ said Saleh. ‘In HPC, it’s usually referred to as a memory bottleneck. It’s not a memory capacity issue, it’s the ability to move the data out of memory DIMMs into the CPU and back. The other bottleneck that’s been seen and talked about quite a bit is the bottleneck on the GPU. Between the CPU and GPU transferring the data and then again, between the GPU itself and the memory.’

These bottlenecks are a growing concern for scientists and researchers using HPC and AI systems as they have the potential to limit application performance.

‘What we do at Ayar Labs is an attempt to change the physical domain that data is transmitted,’ noted Saleh. ‘Going from electricity, voltages and currents, to photons. And we do that coming straight out of the socket. So it’s not a transceiver at the back of the server, it’s not a mid-board optics. We design chiplets that sit inside of the package, that are nearly abutted to the CPU, memory, GPU or accelerator. We’re agnostic to the host ASIC. Then we transmit photons and light outside of the package for your high speed, low power I/O.’

Ayar Labs first demonstrated this technology at Supercomputing 2019, the US conference and exhibition held annually in the US. ‘We have a full test rig. We first demonstrated our technology to the HPC community at supercomputing 2019 in Denver. Since then we’ve made two public announcements that are the projects we’re doing with Intel. So Intel has themselves demonstrated an FPGA with our photonics inside of it, transmitting massive amounts of data at much lower power,’ stated Saleh.

This technology could massively increase the memory bandwidth for future HPC and AI systems. Each chiplet delivers the equivalent of 64 PCIe Gen 5 lanes, which provides up to two terabits per second of I/O performance. The system uses standard silicon fabrication techniques along with disaggregated multi-wavelength lasers to achieve high-speed, high-density chip-to-chip communication with power consumption at a picojoule range.

Ayar Labs developed its technology alongside GlobalFoundries as part of its monolithic silicon photonics platform.

‘We worked with the Global Foundries on developing a monolithic process, one that lets you put electronics and optics on the same chip,’ Saleh said. ‘A lot of traditional optics are separate; we have it all combined into one and that simplifies our customer’s life when they’re packaging all these components – it reduces power, it reduces costs and reduces latency.’

GF Fotonix is Global Foundries’ next-generation, monolithic platform, which is the first in the industry to combine its 300mm photonics features and 300GHz-class RF-CMOS on a silicon wafer. The process has been designed to deliver performance at scale and will be used to develop photonic compute and sensing applications. Ayar Labs also helped GF develop an advanced electro-optic PDK that will be released in Q2 2022 and will be integrated into electronic design automation (EDA) vendor design tools.

Case study: NTU scientists boosting traffic control AI by 200 per cent


A team of scientists at NTU has adopted Gigabyte’s G242-P32 server and the Nvidia Arm HPC Developer Kit to incubate a ‘high-precision traffic flow model’– a smart traffic solution that can be used to test autonomous vehicles and identify accident-prone road sections for immediate redress. 

The Nvidia – Arm-based solution gives the project a 200 per cent boost in efficiency, thanks to the cloud-native processor architecture that ‘speaks’ the same coding language as the roadside sensors, the high number of CPU cores that excel at parallel computing, the synergy with GPUs that enable heterogeneous computing and the ISO certifications, which make the resulting model easily deployable for automakers and government regulators alike.

Dr Chi-Sheng Shih, professor and director at the Graduate Institute of Networking and Multimedia at Taiwan University (NTU), is leading a team of scientists to develop a ‘high-precision traffic flow model’ of Taiwan’s roads and highways. The benefits of such a model are twofold. One, developers of autonomous vehicles and ADAS can conduct simulations to test their creations, while government regulators can run safety checks before greenlighting a new product. 

Two, existing ‘accident-prone road sections’– locations which exhibit a higher frequency and greater severity of vehicular accidents – can be quickly identified, so steps can be taken to prevent more accidents and save lives. The model is already being tested on roads in northern and central Taiwan. The team is in talks with Tier IV, a deep-tech startup based in Japan, about incorporating the finished product into Autoware, the world’s leading open-source software project for autonomous driving; this will pave the way for broader adoption and the possibility of commercialisation.

How are Dr Shih and his team developing the model? First, three or four sensor packets, each composed of a lidar and three cameras, are installed along a stretch of road around a hundred to two hundred meters long. During each batch of testing, the sensors gather data from the traffic flow for a duration of around two hours. Data points include the number of vehicles, vehicular speed, the distance between each vehicle, etc. Then, the data is taken back to the computer lab to be processed. The end result is a highly precise computer model that shows intricate details about the traffic flow; it is a kind of digital twin that can be used for mobility simulation and modelling, which is a key component of a smart traffic solution.

‘Our goal is to serve as the Qianliyan and Shunfeng’er of autonomous vehicles,’ says Dr Shih, citing two deities from Chinese mythology known for their far-seeing eyes and all-hearing ears. Not only can the computer model improve the positioning accuracy and safety of self-driving cars, it can also be used to analyse and fine-tune traffic flow, which is beneficial for all vehicles, autonomous or otherwise.

In 2021, Dr Shih’s team welcomed a valuable new member: Nvidia’s Arm HPC Developer Kit, an integrated hardware and software platform for creating, evaluating, and benchmarking HPC, AI, and scientific computing applications. At the core of this comprehensive solution is Gigabyte Technology’s G242-P32, a G-Series GPU Server powered by a single ARM-based Ampere Altra Processor.

Its contribution to the research project has been remarkable. By Dr Shih’s estimates, development time has been reduced by at least half, which is an efficiency boost of 200 per cent. The scientists have taken to calling Gigabyte’s Nvidia-ARM-based solution a ‘machine learning multicooker’—an all-in-solution that can train the AI, develop the computer model, transfer the data, and more. It is a real boon to the advancement of the traffic flow model, and it has made the team’s work considerably easier.

How has Gigabyte’s G242-P32 and the Nvidia Arm HPC DevKit been able to accomplish all this? The four main benefits can be summarised as follows:

Arm processors are ‘cloud-native’, meaning they follow the same RISC architecture as the computer chips used in roadside devices.

The Ampere Altra CPU has an immense number of cores–up to 80 in a single processor, making it eminently suitable for parallel computing.

The DevKit is outfitted with dual Nvidia A100 GPUs, which complement the CPU through a process known as heterogeneous computing. What’s more, the 8-channel 512G DDR4 memory provides the necessary bandwidth to handle the high data transfer rate.

The Arm solution observes the ISO 26262 safety standards, which means computer models developed with Arm can be easily deployed by companies and institutes in the auto industry.

Learn more about Gigabyte server or contact Gigabyte sales directly at


Credit: Greenbutterfly/Shutterstock

15 December 2021

Credit: Greenbutterfly/Shutterstock

15 December 2021

Credit: Larich/Shutterstock

08 September 2021

Credit: Larich/Shutterstock

08 September 2021