Computer - reconfigure yourself!
Imagine HPC hardware that could optimise its own configuration – even down to the architecture of the CPU running the algorithms – based on the application. Such a virtual processor could accelerate often-used algorithms by several hundred per cent compared to a standard CPU. It sounds wild, but it is possible with a type of IC known as an FPGA (field-programmable gate array). The underlying technology has been around for decades, but it has been almost the exclusive domain of engineers writing code in specialised hardware-description languages such as VHDL or Verilog. However, the pieces are starting to come together to help average HPC users tap into this enormous power.
First, though, what is an FPGA? As shipped, this class of IC is simply a collection of unconnected logic elements, interconnect and memory. When the device is on the workbench or even inside a piece of equipment, you can download a bitstream that configures these elements for one or more specific tasks, and they can be reprogrammed any number of times. In fact, a single large FPGA can implement thousands of function blocks sometimes called cores (and these cores are not to be confused with CPU cores). The dominant suppliers of FPGAs are Xilinx and Altera. As for device size, Altera’s Stratix IV E devices hold 2.5 billion transistors that implement as many as 680K logic elements, 22.4 Mbits of RAM plus 1,360 18x18-bit multipliers.
Once programmed, these become data-flow devices that do not respond to a sequence of software instructions; rather, the instructions are actually hard-wired into the device. You push in the raw data on one side, and the result of the algorithm comes out the other. And, because the internal logic is dedicated to the task, they consume comparatively low power (a GPU might need 200W; an FPGA can be far less than 25W). Further, they are fast – accelerating some operations by a factor of a hundred or more. Don’t be misled by that number, though; only key portions of a given algorithm are accelerated, so the speed increase for the total application is more modest; another concern is the bottleneck due to memory transfers – data from the application must be transferred to and from the FPGA, and that takes time.
Maxwell in Scotland
To study new levels of computational performance for real-world industrial applications on FPGA-based HPC systems, the EPCC (the supercomputing centre at the University of Edinburgh) founded FHPCA – the FPGA High Performance Computing Alliance. It is funded by Scottish Enterprise and builds on the skills of several industrial partners.
The first fruit of this alliance is Maxwell, an HPC system that consists of 32 blades housed in an IBM Blade Center. Each blade contains one Xeon processor and two FPGAs, which in turn are connected by a subsystem that enables all 64 FPGAs to be connected together in an 8x8 toroidal mesh. Maxwell is intended to demonstrate the feasibility of running computationally demanding applications on an array of FPGAs, and it is also being used by researchers around the world as a test bed for tools and techniques to port applications to such systems.
FHPCA has come up with some interesting findings. ‘While we learned early that it is moderately straightforward to create an FPGA-based machine,’ comments Dr Mark Parsons, commercial director of the EPCC, ‘we also learned about the complexities of programming FPGAs. We don’t today have the tools that really meet the needs of general-purpose programmers. We’ve seen that you almost always need VHDL expertise to get code running properly. Even so, we’re still very positive about this technology and firmly believe that FPGAs have a place in HPC.’
Maxwell has achieved impressive results in several demonstration projects. Financial-option pricing based on the Black–Scholes equations has been accelerated by more than 300x per node, taking less than a minute to do what previously took four hours; 3D video frames now need 10 seconds rather than a full minute to analyse; oil and gas simulations run more than 5x faster per node than on a cluster of 3GHz Xeon processors.
Comparison of conventional processors, conventional accelerators and FPGA accelerators. Image courtesy of Nallatech.
Similar benchmarks come from XtremeData, which sells FPGA hardware for integration into servers. That firm indicates that when a system exploits three levels of parallelism – task-level, instruction-level and data-level – an FPGA operating at 200MHz can outperform a 3GHz processor by an order of magnitude or more while requiring only a quarter of the power. In bioinformatics, for instance, FPGAs have demonstrated 100x acceleration over conventional processors while running genome sequencing algorithms such as Blast. Medical-imaging algorithms for 2D and 3D CAT image reconstructions have routinely shown at least 10x acceleration. Common signal-processing algorithms such as the fast Fourier transform show performance factors of 10x over the fastest CPUs.
Trending to the mainstream
It’s only recently that such power has become available to HPC users. In the past 15 years, says Allan Cantle, president of Nallatech, the industry has moved away from customised massively parallel platforms, such as those offered by Cray and Silicon Graphics, to clusters of industry-standard servers. Meanwhile, several key trends have emerged that are making FPGAs much more viable in commercial HPC. First, FPGA vendors have encouraged such activities: Altera has been actively working with partners such as XtremeData, Mitrionics and SRC Computer as described later in this article; Xilinx had a research project for three years to help vendors of hybrid systems incorporate their chips, but this effort was recently put on the back burner because of the economic downturn. Even so, Xilinx chips are being used in a number of accelerator products.
One important step in moving FPGAs to the mainstream was when AMD and Intel opened up their CPU sockets to coprocessors from other companies. AMD’s program for its Opteron processors through open HyperTransport links or through PCI Express is dubbed Torrenza; roughly two years ago, Intel – which, until then, had closely guarded who had access to its Front Side Bus (FSB) – opened up its FSB CPU socket to third-party devices. Intel has meanwhile set up the QuickAssist Technology Community, a virtual organisation of ISVs, IHVs, embedded OEMs and developers who are committed to simplifying accelerator use on Intel architecture platforms.
This, in turn, allowed other suppliers to develop bus-based boards with FPGAs and even modules that drop directly into CPU sockets on motherboards or servers and thereby allow direct chip-to-chip communications. Finally, a number of firms are working on software that will allow programmers to write FPGA algorithms in C and other familiar languages and eliminate the need for them to have expertise in hardware-description languages.
Boards and drop-in modules
FPGA boards and modules are becoming available from multiple sources. One hardware partner for the FHPCA mentioned earlier is Nallatech, which provides PCI-X and PCI Express cards as well as modules that fit into the Xeon FSB. The module specs a system-memory access time of 110ns and host communication bandwidth of 8GB/s peak (5GB/s sustained), and it also has up to 256GB of directly coupled server memory. This module is supplied as part of an integrated platform based on the Intel Xeon MP 7300 4-socket server.
Latency of getting data in and out of a bus card can impact the effectiveness of accelerators, argues Nallatech’s Allan Cantle. The ideal situation is to have no latency and infinite bandwidth, and this target can be more easily approached by plugging the FPGA into a Xeon processor, where it can see system memory directly (‘zero copy processing’), rather than using a bus card that operates in the I/O space memory. System-memory access using a socket module is roughly 100ns on the FSB, whereas the best case using a PCI card with DMA is 4 to 5μs. It’s also advantageous, he adds, if the application avoids ‘memory thrashing’ by reducing the number of transactions to system memory. Note, however, that this opinion that plug-in modules are the best approach is not universally shared; the company Alpha Data specialises in bus-based FPGA boards, which are attractive for industrial applications.
Nallatech provides several software tools for helping create a virtual processor in an FPGA. The first is DIME-C, which is based on a subset of ANSI-C, and the company’s tools provide all the APIs necessary to access the FPGA through simple function calls from the host processor. Another approach is to use tools such as the Impulse C compiler from Impulse Technologies. Once the C code is functional, you can compile it directly to a bitstream and run it on a chosen FPGA.
In addition, it is possible to use high-level algorithmic tools such as Simulink from The MathWorks together with the Xilinx System Generator. You first build and debug DSP systems in Simulink using the Xilinx Blockset for tasks such as signal processing, error correction, maths, memories and digital logic; the Blockset also can import Matlab functions, such as to create control circuits. From the Blockset you can generate VHDL or Verilog code within Simulink and then download the bitstream to the FPGA.
Another well-known supplier of FPGA modules is XtremeData, whose line of insocket accelerators are based on Altera FPGAs and come in several varieties depending on the socket they plug into. The XD2000F replaces an Opteron chip in an AMD Socket F, and it contains two Stratix II FPGAs. The XD2000i plugs into the Intel FSB socket of any Xeon DP system and features three Stratix III FPGAs. The XD1000 fits into an Opteron 940 socket and comes with one Stratix II FPGA. For support beyond regular compliers, these modules work with Altera’s DSP Builder, which has a block that reads Simulink model files and generates VHDL files.
Tier 1 vendors get the FPGA bug
Just recently, XtremeData joined with HP to deliver what it believes is the first in-socket FPGA accelerator-enhanced standard servers qualified by a Tier 1 server vendor. That module is available for the rack-mounted HP ProLiant DL165 and DL185 servers.
Rather than supply modules, Convey Computer has developed the HC-1, a 2U rack dual-socket computer with one Intel Xeon dual-core processor and several custom processors based on a Xilinx Virtex 5 FPGA. In use, instructions executed by the coprocessor appear as extensions to the x86 instruction set; applications can contain both x86 and coprocessor instructions in a single instruction stream. The system memory is based on eight controllers supporting 16 DDR2 memory channels, so the system provides more than 80GB/s of bandwidth. Convey software environments are based on Linux, and application development takes place using industry-standard Fortran, C and C++ tools.
Another supplier in this niche is DRC Computer Corp, which has been shipping its Xilinx-based Reconfigurable Processing Unit since 2006. Its RPU family modules plug into an Opteron socket and interface with the HyperTransport bus, the DDR memory and other motherboard resources. It supplies RPUs to a number of customers including Celoxica (hardware-accelerated market data technology), systems houses such as Synective Labs and X-ISS, XLBiosim (for biosimulation applications) and Cray.
For its part, Cray (which was formed in 2000 when Tera Computer Company acquired the assets of Cray Research and changed its own name to Cray Inc) now supplies an optional RPU blade in its Cray XT5h system. The module plugs into an open socket in a multi-way Opteron system, and a 6.4GB/s direct connection between the processor and the Cray SeaStar2+ interconnect network further reduces latency.
A name even more closely associated with Cray is SRC Computers, which was founded by Seymour Cray to come up with a new type of supercomputer using off-the-shelf microprocessors; his previous company, Cray Computer Corp, ran into financial difficulties and was merged into Silicon Graphics (just recently purchased by Rackable Systems), which spun it off into a separate business unit and then sold it to Tera Computer. Sadly, Cray died in an automobile crash just weeks after founding SRC.
While most FPGA modules plug into a CPU socket, SRC’s Series I MAP, which is based on Altera’s Stratix II FPGA, plugs into the DIMM memory slots of a computer motherboard; in this way, it can be used with either AMD or Intel-based commodity motherboards. Users program the Series I using ANSI standard C or Fortran with the help of the Carte programming environment.
When it comes to software support for programming FPGAs, several names appear often. One of these is Mitrionics, which introduces the concept of a hardwareindependent Mitrion Virtual Processor (MVP) that completely separates software from the hardware. The Mitrion software development kit comes with a compiler for Mitrion-C, a high-level language designed to let programmers take advantage of ultra-fine grained parallelism and synchronisation together with tightly coupled CPU/FPGA coprocessing. When porting code to other FPGA-based systems, the programmer need only change the details pertaining to the specifics of the machine organisation. Mitrionics assembles all the necessary hardware and software and sells these as development platform.
Another frequently-mentioned company is Impulse Accelerated Technologies, which also favours the hardware-independent approach. Its Impulse C and CoDeveloper tools include a software-to-hardware compiler, parallel optimiser and platform support packages for various FPGAs and FPGA-based platforms.
One thing you won’t hear much about right now is major software applications being sold in FPGA-aware versions. Dr Parsons from the FPGA High Performance Computing Alliance says: ‘We’ve spoken with ISVs about making their codes FPGA-compatible, but we found little activity. ISVs are customer-driven, and there’s little customer demand at this time. Another thing that makes it difficult is that there is no standardisation in terms of memory interfaces, so you can’t write code that works across different types of FPGA accelerators, even across a single vendor’s product range. So, for the time being, FPGA acceleration remains the domain of custom codes.’ But for those codes, the benefits can be enormous.