Intelligent interconnects

Share this on social media:

Robert Roe looks at how interconnect technology is helping to change the landscape of high-performance computing

In recent years, the movement of data – rather than the sheer speed of computation – has become a major bottleneck to large scale or particularly data intensive HPC applications. This has driven bandwidth increases in interconnect technology, effectively widening the pipe that data uses to travel across the network.

While the HPC community will still see more bandwidth increases in the future – most likely 200Gb/s, which will be introduced in the next two years. Interconnect developers are investigating new technological developments that further increase the performance  of a network by reducing the latency of communications – the time taken for a message to be passed across the network.

These developments are taking shape in two distinct directions – hardware accelerated, smart switches are on the horizon, and commercial products from both Mellanox and Bull will ship in 2016 and 2017 respectively. There is also discussion around Interconnect technology from Calient that uses pure optical switches. This could be used to produce a reconfigurable supercomputing architecture, although these products are only in the development stage at this time.

Developing technology specifically for HPC

Mellanox has been at the forefront of HPC Interconnect technology for a number of years. In the latest Top500, the twice-annual list of the fastest supercomputers, based on a LINPACK benchmark, around 47 per cent of the Top500 systems use Mellanox InfiniBand Interconnects. It should be noted that a LINPACK benchmark is used to determine the raw FLOPs performance of a HPC system, rather than its ability to pass data across the network. However, the appearance of Mellanox on this list indicates the prevalence of Mellanox technology in high-end HPC systems.

Unsurprisingly, Mellanox aims to continue the dominance that it has established in recent years. At the US Supercomputing conference, held in Austin, Texas during November 2015,  Mellanox announced its first smart switch based on the 100Gb/s EDR InfiniBand, called switch-IB2.

Gilad Shainer, vice president for marketing at Mellanox Technologies, explained that there have been several key shifts in the development of HPC technology. The first was the shift from SMP nodes to cluster-based supercomputing: ‘The ability to take off-the-shelf servers and connect them to a supercomputer requires a very fast interconnect technology’ commented Shainer.

The next shift in HPC was the move from single fast cores to a multicore or many-core architectures. This increase in parallelism and the introduction of accelerator technologies further increased HPCs reliance on interconnects, as computation was being split into increasingly smaller chunks, run in parallel across a network of servers.

‘We could not get any more performance from just increasing the frequency of CPUs’ commented Shainer.

‘The idea of running everything on the CPU got us to the performance walls of today. Now we are moving to the exascale era and that means another technology change that revolves around co-design to create synergy.’

The Mellanox solution to overcoming these performance barriers is to implement intelligent switching, where the computational overheads of managing the movement of this data are moved to the switch itself, rather than being processed on the CPU. The first commercially available products in this area are Mellanox IB2 switch and the upcoming Bull Interconnect, codenamed Bull Exascale Interconnect (BXI). Both aim to move MPI communication directly onto the switch silicon – freeing up the CPU to focus on computation.

The reasoning behind this development is that collective operations, such as the sum of series of calculations, performed in parallel on different nodes, typically requires multiple communications transactions across a network – each with its own associated latency. By reducing or removing these collective operations interconnect developers can provide a significant reduction in latency, as these operations are managed by the interconnect and not the CPU.

Shainer stated: ‘Today, when you run collective operations on the server for MPI, the server needs to have multiple transactions to complete just one collective operation. Anything you do on a server will not be able to overcome this. Moving the management and execution of collective operations to the switch enables the switch to complete MPI operations with one transaction over the network.’

Making use of internal expertise

While Mellanox has been developing this technology for many years, the introduction of the European supercomputer provider Bull, into the interconnect market could be somewhat confusing. But as their CTO Jean-Pierre Panziera explains, the company has significant experience developing the ASIC (application-specific integrated circuit) used for the hardware acceleration of its upcoming BXI Interconnect.

‘We have developed ASICs in the past that were meant for extending the capacity of SMP nodes; this is a project that was both for HPC and the enterprise markets’, explained  Panziera.

Panziera stressed that the main motivation behind Bull’s decision to develop its own interconnect technology stems from the same principle as Mellanox – co-design. ‘The second point is that we have a demand and we are working very closely with our customers. This project specifically was a co-development with our main partner CEA (the French Alternative Energies and Atomic Energy Commission. We had a partner, a use case and we had the technology, so we looked at what would make the most sense.’

While Bull has not developed interconnect technologies previous to BXI, the company is using the experience it has gained from producing ASICs in previous projects. ASICs are custom built computers for specific applications – such as offloading MPI processes. The advantage of using these custom integrated circuits is that they are specifically designed for use in one application. As such, the logic elements can be fine-tuned so that the computational power and energy requirements are exactly what is needed – providing high performance and very low energy consumption for a single application when compared to a more general purpose CPU.

Panziera said: ‘This is something that has always been in the genes of Bull, it is this experience in mastering these technologies.’

The computation which manages and executes MPI operations in the BXI interconnect is a custom-built ASIC which acts as an MPI engine- effectively a custom built computer which can be used to process MPI communications – in a similar fashion to Mellanox’ approach.

Panziera said: ‘HPC now has become a synonym with parallelism and a high degree of parallelism. Applications will often need to use thousands or even tens of thousands of nodes for your application. This puts a stress on the interconnect. Almost all applications are standardised around the MPI library for communications, for moving data between processes, between nodes.’

When asked about the potential impact of offloading MPI operations onto the switch, Panziera explained that interconnect performance is governed by two main areas, bandwidth and latency.

Panziera said: ‘It varies a lot from one application to another. If you think about the different components of performance for your application, it might be bandwidth. That is quite easy to achieve at the technology level. It is the width of the pipe in a link between the different components. If your application is some kind of LINPACK application, you do send some information across the system, but it is relatively small compared to the computation you are doing – here the impact of the interconnect will be noticeable but it will be small.’

He explained that when you start to scale applications to run across larger systems –thousands or tens of thousands of nodes – then the importance of the interconnect increases significantly.  Panziera said: ‘If you have an application that is trying to push data, the latency and the number of messages that you are able exchange on your network becomes crucial to application performance.’

‘It is really when you are pushing the application to a high level of parallelism, to a high level of performance, that you will see the most impact on your application. We think that for some applications it will be 10 to 20 per cent, and if you are really pushing it to the extreme – where you are trying to scale to the maximum, you could see a difference of up to a factor of two,’ concluded Panziera.

While Bull is developing its own interconnect technology, Panziera stressed that the company would still offer Mellanox’ InfiniBand solutions with the latest generation of its supercomputing platform named ‘Sequana’.

Panziera said: ‘It is not something that you can dictate to customers, if customers think that InfiniBand is better then we will offer InfiniBand to these customers.’

Reconfigurable supercomputing

While most technology development is an iterative process, sometimes there will be a disruptive technology that threatens to disrupt the status quo. The US-based Calient is aiming to do just that with the use of its purely optical switches for HPC interconnect.

Daniel Tardent, VP of marketing at Calient, said: ‘We build a switch that is pure optical layer circuit switch based on 3dMEMS technology using micromirrors. This is switching at its most simple level; we are not doing any kind of packet-based switching whatsoever.’

While this technology is still in the early stages of development, Calient’s switch concept offers the prospect of redefining supercomputing architectures – enabling HPC users to reconfigure their supercomputers from a shared pool of resources. Calient’s proposed solution uses single mode fibres for a switch system based on 320 input/output ports, which could support any combination of compute elements.

Tardent explained: ‘In between those two planes are micromirrors that can be adjusted by applying electrostatic voltages onto the back of the tiny mirrors. By angling the mirrors on both sides of the matrix, you can bounce light from any input fibre to any output fibre, which gives you a 320-port purely optical based switch.’

Tardent was quick to distance himself from conventional interconnect providers. He explained that Calient was not trying to compare itself to companies with packet-based switching because the company was offering a different approach from supercomputing. He said: ‘We can re-architect the distribution of compute resources to deliver optimum efficiency. We are fundamentally going to re-architect the relationships of how things are connected to give you the best performance.’

Tardent explained that there are three main areas that set Calient’s optical switch apart from the competition. ‘Because it is pure physical layer, pure light, we are completely agnostic to bit-rate and protocol. It doesn’t matter if it is 10GB, 40GB, 100GB, 400 GB – as long as it is single mode light we can switch it.’

The other key factor for HPC is the extremely low latency of a purely optical system. ‘Once you setup a connection the latency through the switch is only 30 nanoseconds,’ Tardent confirmed.

Today the relationship between the GPU or other logic elements of HPC system are fixed. Once an architecture is constructed, you can switch the components but the integral structure remains the same. Calient’s vision of supercomputers in the future removes the fixed nature of the architecture; an optical switch could match any combination of GPUs with servers, storage, or other components as they are needed.

If the systems could be managed and reconfigured with minimal latency and effort, then this concept could pave the way for a new adaptable supercomputing architecture – underpinned by a very flexible interconnect technology.

Tardent commented: ‘One of the things that make the Calient style of optical circuit switch work in this scenario is the extremely low latency. Even if you were going to think about doing this with a packet-based technology, it just doesn’t have the latency required to make this work.’

Another aspect of this technology that could be useful to high-end HPC systems is the inherent redundancy offered by a reconfigurable architecture. Using the Calient switch, a hardware failure could be rerouted to another resource in the pool, as long as there were spare resources available. However, this technology is still in development and it is not clear how the switch will be reconfigured. At this time, reconfiguring the system may need to be done manually, based on pre-determined scripts that would be input by an operator. ‘The other option is that the resource management systems in the facility can drive the reconfiguration of the optical switch’ commented Tardent.

‘Either way the reconfiguration time is going to be in the order of less than a second to reconfigure the whole thing.’

While it is still too early to tell which technology will be the most successful, it is clear that interconnects will to play an increasingly important role in increasing HPC application performance. Whether by developing entirely new technologies, or by increasing the intelligence of the switch, HPC users must address the performance bottlenecks that face an increasingly data-intensive industry.

‘If you want to overcome performance barriers, then you need to look into the entire application, the entire framework of what you want to execute. You need to use every active component in the data centre as part of the application,’ concluded Shainer.