Connections changing in HPC
‘I don’t know the network of the future, but it’s something Ethernet.’ With this play on a quote by Seymour Cray about programming languages, John Taylor, VP of technical marketing at Gnodal, puts his finger on one of the major trends in HPC interconnect – the re-emergence of the perennial favourite, Ethernet.
When computers started getting larger and more powerful, and able to handle enormously large problems, it soon became apparent that the Ethernet designed for office and enterprise applications was becoming a bottleneck in terms of latency. In response to this, a number of technologies were developed and they eventually converged on what we today know as InfiniBand.
The major benefit of InfiniBand, in a nutshell, is RDMA (remote direct memory access). Instead of the need to pass data packets through the operating system as is done with Ethernet, data transfers are made directly from memory to memory. Such transfers require no work on the part of CPUs, caches or context switches, and transfers continue in parallel with other system operations. The result is latency far lower than possible with standard Ethernet, a feature very important for today’s sophisticated HPC architectures.
RDMA on Ethernet
Meanwhile, engineers have also found a way to use RDMA techniques in conjunction with Ethernet. The two ‘flavours’ available today are RoCE (RDMA over Converged Ethernet) from Mellanox Technologies and iWARP (Internet Wide Area RDMA Protocol) from Intel. The latter firm got its InfiniBand technology early this year when it made big news with the purchase of the InfiniBand technology from QLogic, the only competitor to Mellanox in this market. The resulting True Scale fabric is based on this InfiniBand technology. Interestingly, a few months later Intel also purchased interconnect technology from Cray, including the Gemini interconnect as well as the upcoming Aries, an interconnect chipset follow-on to Gemini with a new system topology called Dragonfly. It is designed to work in Cray’s next-generation supercomputer, previously code-named ‘Cascade’, the XC30. When it ships next year, the Aries XC interconnect will plug directly into the on-chip PCI-Express 3.0 controllers on the Xeon E5-2600 processors, which will let Aries speak directly to any other device, such as CPU or a GPU, with PCI-Express 3.0 port.
Back to RDMA products shipping today, what’s the key difference between InfiniBand and iWARP? Basically, iWARP needs TCP (Transmission Control Protocol) to operate. Most Ethernet solutions run the TCP on the server CPU, which means that iWARP requires CPU cycles and does not really totally bypass the CPU; in a sense, it’s a software emulation of RDMA.
Some interesting comments about the QLogic acquisition come from Intel’s Joe Yaworski, product marketing for the company’s Fabric Products division and who came to Intel as part of the acquisition of the QLogic InfiniBand technology where he was director of marketing for HPC programs. He maintains that InfiniBand was originally designed for data centres and not HPC, whereas the InfiniBand-based True Scale was designed from the ground up for HPC. It uses a PSM (Performance Scale Messaging) library to handle very small messages efficiently, which he feels is important in HPC where 90 per cent of the messages have 256 bytes or less and there can be tens of millions of them in a large cluster. He adds that True Scale has excellent collectives performance, where in parallel computing data is simultaneously sent to or received from many nodes.
While you may think of Intel as a chip vendor, in the networking space it primarily sells adapters and switches and at the time of writing it does not sell InfiniBand silicon. Note, though, that the PSM is executed by a Xeon processor. Intel has not moved interconnect technology onto processors at this time, but Yaworski notes that the ultimate goal is to drive the fabric closer and closer to the processor. This will leverage continually improving processor performance and lead to better performance through even lower latency, reduced component count and lower costs, as well as improved power consumption. As for future plans in this regard, he sees it replacing the PCI bus as the means of processors communication over the fabric but ‘that’s all I can say today.’
InfiniBand still the speed king
In dealing with InfiniBand, we face a new set of acronyms. An InfiniBand link is a serial link operating at one of several possible data rates: single data rate (SDR), double data rate (DDR), quad data rate (QDR), fourteen data rate (FDR) and enhanced data rate (EDR). On a per-lane basis, the SDR connection’s signalling rate is 2.5 Gb/s in each direction per connection, while DDR is 5 Gb/s, QDR is 10 Gb/s, FDR is 14.0625 Gb/s and EDR is 25.78125 Gb/s. The InfiniBand Trade Association has also published a roadmap which includes higher future schemes including HDR (High Data Rate) and NDR (Next Data Rate); Mellanox has announced its intentions to achieve 200 Gb/s by 2016.
However, there’s not always agreement on how much speed is actually needed. Consider some comments from Intel’s Yaworski. He states that bandwidth is actually the least important performance factor and refers to a study done at QLogic, which is now being repeated. The study looked at a large number of MPI apps in areas such as oil and gas, molecular dynamics and other sciences, and profiled their MPI requirements. The results were that all these apps ran fine in 20 Gb DDR, and that QDR was overkill. More critical are the message rate, latency and collectives.
To show what is possible with InfiniBand, consider one of Mellanox’s latest products, the Connect-IB adapter, which is now in sampling. In what the firm refers to as ‘world-record performance results’, the Connect-IB dual-port 56 Gb/s FDR InfiniBand adapter achieves throughput of more than 100 GB/s using PCI Express 3.0 x16 and more than 135 million messages per second – 4.5 times higher than previous or competing solutions. The Connect-IB line consists of single- and dual-port adapters for PCI Express 2.0 x16; each port supports FDR 56 Gb/s InfiniBand with MPI ping latency of <1μsec.
In addition, it will soon be possible to use InfiniBand over long distances and thus extend its use beyond a single data centre network, to deliver higher performance to local, campus and even metro applications. The MetroX TX61100, supporting up to 10km over 10 Gb/s speed, will be available in 1Q 2013; the TX6200, supporting up to 10km over 40 GB/s speed, will be available in 3Q 2013.
Ethernet activity heating up
While there are only a few companies that dominate the InfiniBand market, there is much more activity in Ethernet for HPC. Indeed, Ethernet has a long history in HPC – it’s only been in the last few years that InfiniBand has more systems in the Top500 than Ethernet, reports Gnodal’s Taylor.
The benefits of Ethernet are obvious. First, it’s basically free at the server level and soon all servers will have 10G Ethernet, a trend being forced by the enterprise sector where Ethernet is still the ubiquitous networking scheme. In addition, you also have a far wider choice of peripherals in areas such as storage. Another benefit is that many people are familiar with the technology and there’s a large skill base, which is important given the high demand for HPC experts. Finally, there’s the fact that the market for Ethernet devices and peripherals is larger and so prices are generally lower. Speeds are also going up. 100G Ethernet is starting to arrive in metropolitan area networks and less so at the device level, and the IEEE is starting work on a standard for 400 Gb/s.
To implement RDMA at Gnodal, the company developed its own ASIC called the Peta so it can apply its own technology to avoid congestion at low latency. This device is used in the company’s GS family of switches. The trick, explains Taylor, is that it obeys the Ethernet standard at the device level, but once a packet enters a Gnodal network, it morphs into another protocol using the Peta. It supports 72 devices at 10G or 18 devices at 40G, and it’s possible to put multiple ASICs in one product. The company specs single-switch latency of 150ns, which Taylor claims is three times better than any other Ethernet vendor. They also get 66ns between two switches, and it’s possible to build switches with as many as 64k ports.
Because of RDMA, Ethernet performance tends to be comparable to InfiniBand in terms of latency. Intel, for example, recently published a white paper comparing the Ethernet-based version of iWARP RDMA vs. InfiniBand when running PAM-CRASH software from the ESI Group. It states that: ‘In lab testing, simulations of a car-to-car frontal crash gave nearly identical results between Ethernet networking with iWARP technology versus InfiniBand.’ The iWARP system used the NE020 10 Gb/s Ethernet server cluster adapter with a Gnodal GS7200 switch, while the InfiniBand system used Mellanox QDR ConnectX-2 adapter (note: not the firm’s latest ConnectX-3 56 Gb/s adapter) with a Mellanox IS5030 switch.
Which to use?
There’s no doubt that Ethernet is gaining ground on InfiniBand in terms of latency. But the selection of which scheme to use is typically based on far more than pure latency, comments Gilad Shainer, VP of market development at Mellanox. System architects must look at the total system view. What do you want? Are you building upon an existing Ethernet storage infrastructure? How important is scalability? Does price play a role? How easy will it be to manage the infrastructure?
For large-scale infrastructures, says Shainer, InfiniBand clearly has a competitive edge over Ethernet. According to a report from Mellanox, InfiniBand is the most used interconnect on the list, connecting 224 systems – 20 per cent more than the 189 Ethernet-connected systems. In fact, InfiniBand is the most used interconnect in the Top500 supercomputers – 52 per cent of the Top100, 53 percent of the Top200, 50 per cent of the Top300, 46 per cent of the Top400 and 45 per cent of the Top500. Further, FDR InfiniBand is the fastest growing interconnect technology on the Top500, with a 2.3 times increase in number of systems versus the June 2012 list. Finally, InfiniBand connects 43 per cent of the world’s most powerful petaflop systems on the list.
Shainer also points to Microsoft’s Azure cloud platform, where that software giant states that with the help of InfiniBand they have 33 per cent lower costs per application and 90 per cent efficiency on a virtualised environment, approaching that of a physical environment. On the other hand, anyone already committed to Ethernet to maintain compatibility with existing infrastructure or because they have a big investment in Ethernet technology, expertise and management tools will be attracted to RDMA over that fabric. You can get close to InfiniBand-like latency but with a much lower learning curve and take advantage of the ubiquitous Ethernet ecosystem.
This all is leading to increased use of Ethernet-based interconnects in HPC systems. In his analysis of the Top500, Gnodal’s Taylor acknowledges that Gigabit Ethernet is now being squeezed by InfiniBand as the number of servers required to enter the Top500 increases. InfiniBand is still the ‘flavour of the month’, he notes, while custom interconnects are growing and occupying more in the Top10. He predicts that the trend for 10 GB/s Ethernet will follow the Gigabit Ethernet line and gain in popularity as it becomes ‘free’ on servers and will be able to scale due to emerging standards and with Ethernet fabrics providing high utilisation.
Best of both worlds
If you’re undecided as to which to go with, note that with iWARP, InfiniBand and RoCE, LAN and RDMA traffic can pass over a single wire.
A further interesting fact is that today you can purchase hybrid adapter cards that can support multiple schemes. So it’s possible, for example, to establish an InfiniBand connection while at the same time using Ethernet peripherals and with no changes required to Ethernet-based network equipment.
In the case of Mellanox, its Virtual Protocol Interconnect (VPI) enabled adapters make it possible for any standard networking, clustering, storage and management protocol to seamlessly operate over any converged network leveraging a consolidated software stack. With auto-sense capability, each port can identify and operate on InfiniBand, Ethernet or Data Center Bridging (DCB) fabrics.