Collaborating on AI and ML processors

Achronix and Bittware collaborate on a new FPGA designed for cloud, AI and machine learning applications, writes Robert Roe

Increased demand for artificial intelligence (AI) and machine learning (ML) applications is driving demand for accelerator devices that can support these new workflows. This is driving innovation in the accelerator market companies design products to meet these new demands.

In October Bittware and Achronix announced a strategic collaboration with Achronix to introduce the S7t-VG6 PCIe accelerator product – a PCIe card sporting the new Achronix 7nm Speedster7t FPGA. This new generation of accelerator products offers a range of capabilities including low-cost and highly flexible GDDR6 memory that aims to offer HBM-class memory bandwidth, high-performance machine learning processors and a new 2D network-on-chip for high bandwidth and energy-efficient data movement.

‘BittWare has a 30-year track record of successfully designing and deploying advanced processing technologies for demanding applications,’ said Jeff Milrod, president of BittWare. ‘Achronix is bringing fresh approaches, architectures and implementations to the FPGA market that we are excited to leverage with the introduction of our S7t card. We will now be able to offer leading memory bandwidths, as well as high-speed storage, network and host interfaces, while achieving new levels of price/performance and energy efficiency. I am confident that the innovations in the Speedster7t, combined with BittWare’s extensive experience and depth of accelerator card IP, will provide compelling platforms for data centre, cloud infrastructure and enterprise solutions.’

‘We are delighted to collaborate with BittWare and the wider Molex group, to launch the new VectorPath S7t accelerator card’ said Robert Blake, CEO, Achronix. ‘Market response to the new Achronix Speedster7t FPGA family has been overwhelmingly positive. In order to provide our customers with the ability to rapidly evaluate and go into volume production with the Speedster7t devices at card and server-level, we needed a partner with the deep design expertise and logistical scale needed to supply our growing global customer base. BittWare, as the leader in FPGA-based PCIe cards and servers, was a clear choice.’

Innovation for high-performance applications

Designed for ML and AI applications this card could offer an alternative to Nvidia for these workloads by aligning the FPGA technologies with memory and networking interfaces that are more readily used by the wider HPC community. The card features a QSFP-DD (double-density) cage, and the board supports up to 1x 400GbE or 4x 100GbE using the 56G PAM4-enabled Speedster7t device. An additional QSFP port supports 2x 100GbE, and a 4x OCuLink connector supports NVMe attached storage. Sixteen channels of GDDR6 graphics DRAM handle high-bandwidth memory requirements, providing up to 512GB/s.

The increased focus on memory-intensive applications and networking options in this card make it clear it is designed to tackle high-end computing problems that demand performance such as cloud or ML and AI. Combining these features with reprogrammable logic and modern memory and networking interfaces provides a more easy to use FPGA design that the company hopes can drive new users to this technology. Steve Mensor, VP of sales and marketing and Achronix commented: ‘We worked together collaboratively to design this product, it is targeted at accelerated edge and cloud computing applications. It is intended to be a high-volume product for enterprise applications.

‘All of the compiler tools from Achronix as well as all of the board-level tools from Bittware that make it a complete product so companies that are new to this will be able to jump in right away and start designing accelerator type applications very quickly,’ added Mensor. ‘It was a partnership that was formed over the last year or so as a collaboration of these two companies to build this product.’

Craig Petrie, VP of marketing at Bittware, part of the Molex Group, highlighted the partnership between the two companies and the benefits of being part of the Molex group of companies.

‘We are an amalgamation of two companies who focused on FPGA technology, Nallatech and Bittware. Those two companies have merged under the Molex Group, and we are the Molex arm for HPC and data centre processing with FPGA technology,’ said Petrie.

‘We have got a 30-year track record going back to the beginning of when FPGAs where invented. Coupling this new product to the Molex organisation we can take this product to a global customer base and we have got the global infrastructure and resources of Molex to let us qualify, validate and support the complete lifecycle for customers that are deploying high volumes of FPGA products.’

Petrie’s comments highlight the company’s plans for this card to compete in the high-volume, high-end markets such as HPC and the backing of the parent company Molex could help to further justify supporting this new technology as it will be supported by a well-established company with a global customer base.

Today’s high bandwidth applications can easily overwhelm the routing capacity of a conventional FPGA’s bit-oriented programmable-interconnect fabric but the Speedster7t architecture uses a high-bandwidth, two-dimensional network on chip (NoC) that spans horizontally and vertically over the FPGA fabric, connecting to all of the FPGA’s high-speed data and memory interfaces. This high-speed network running over the FPGA programmable-logic fabric could help to enable the card to compete with other accelerator technologies such as GPUs.

The Speedster7t NoC supports high-bandwidth communication between interfaces and custom acceleration functions in the programmable-logic fabric. Each row or column in the NoC is implemented as two 256-bit, unidirectional, industry-standard AXI channels operating at a transfer rate of 2 Gbps.

Machine learning processors

The new card also features a large array of programmable math compute elements, organised into new machine learning processors (MLP) blocks. Each MLP is a highly configurable, compute-intensive block, with up to 32 multiplier/accumulators (MACs), that support integer formats from 4- to 24-bits and various floating-point modes including native support for Tensorflow’s Bfloat16 format as well as the highly efficient block floating-point format which dramatically increases performance for ML applications.

‘These features and the tight integration of MLP blocks with embedded memory blocks eliminate the traditional delays associated with FPGA routing, ensuring that machine learning algorithms can be run at the maximum performance of 750 MHz,’ added Petrie. ‘This combination of high-density compute and high-performance data delivery results in a processor fabric that delivers the highest usable FPGA-based tera-operations (TOps) per second.’

Critical for high-performance compute and machine learning systems is high off-chip memory bandwidth to source and buffer many high bandwidth data streams. To achieve the needed level of bandwidth, Speedster7t devices include hard GDDR6 memory controllers to support high-bandwidth memory interfaces. With each of the GDDR6 memory controller capable of supporting 512 Gbps of bandwidth, the card uses up to 8 GDDR6 controllers in each device so it can support an aggregate GDDR6 bandwidth of 4 Tbps.

Mensor also addressed the concerns of users that, in the past, FPGAs where only suitable for a limited number of applications and could not provide the general usability of technology such as a GPU: ‘I do not think it was that FPGAs were only suitable for certain applications, it was more than FPGA economics means that they can be relatively pricey and they have a power profile which meant that you could often do something in a standard product or an ASIC then the power could be dramatically lowered,’ he said.

‘In these type of applications today, where you get into high compute requirements and not just simple programmability, now you are getting into the type of solutions that are consuming hundreds of watts – and if there is any level of inefficiency, like with a CPU, now you are going to be on the other end of the scale where the acceleration factor makes the FPGA much more power-efficient,’ added Mensor.

Petrie also added to the reasons behind the more broad direction for FPGA technology. ‘There have been a number of inhibitors in the adoption of FPGA technology and these are being addressed; Steve has already touched on some of them.

‘These things were relatively expensive and that relegated them to extremely high-performance applications. If you wanted to play with an FPGA you basically couldn’t. Making this technology much cheaper and more ubiquitous has been really important,’ said Petrie.

‘The other thing that has been an inhibitor to adoption is programmability. There have been innovations such as the NoC and they [Achronix] have hardened a lot of the IP that users would normally have to have written themselves. The IP is now ready to go and that provides a major step up for new users in particular,’ Petrie concluded.

Collaborating on AI and ML processors

Innovation for high-performance applications

Machine learning processors

Topics

Read more about:

Editor's picks

Where hype meets reality: why breakthrough computing technologies must prove their scientific worth

Free Online Panel Discussion | LIMS innovation boosts precision and security

On-Demand: Optimise your HPC storage strategy

On-demand | AI in Life Sciences: Practical applications in small molecule design

Protecting bioanalytical data integrity from bench to report

Why AILNs are the future of scientific discovery

Future-proofing your lab: key considerations for upgrading or switching chromatography data systems