A strategy for HPC that goes beyond HPC
The boundaries are blurring. Once, high-performance computing was split: high-performance technical computing and high performance business computing. Now, the advent of big data is making the distinction untenable, a development accelerated by the convergence of HPC and the cloud, as analysed by Robert Roe on page 18.
While HPC is being pulled in this direction by external market forces, it became clear at the US Supercomputing Conference, SC14, held this year in New Orleans in late November, that the technologies underpinning technical high-performance computing are now changing in response. Paradoxically, the announcement of the largest US Government investment in technical supercomputing for many years will transform business computing.
It was the announcement of an HPC strategy that goes beyond HPC.
The US Government initiative, reported on page 20, will foster the development of compute technologies for HPC but these are not intended solely to achieve ever-faster machines as milestones on the road to Exascale technical computing. These technologies are also explicitly intended to transform the much wider, and financially more important sector of the economy, that is enterprise computing. ‘There are game-changing elements to what we are doing,’ Ken King, general manger of OpenPower Alliances at IBM, told Scientific Computing World.
European companies appear to have reached parallel conclusions too. Bull, for example, emphasised how it intends to widen the scope of its operations by taking technologies developed for the HPC market and reaching out to enhance its position in IT for the enterprise.
One of the public, and perhaps rather glib, justifications for investing in exascale has been that it would make petaflop computing cheaper, more accessible and more widespread – bringing powerful computational techniques within the reach of even quite modest engineering companies. Now it appears that the ramifications of Exascale reach more widely still – beyond technical high performance computing into business and commercial applications.
The joint Collaboration of Oak Ridge, Argonne, and Lawrence Livermore (Coral) was established in late 2013 to streamline procurement and reduce the costs of investing in the development of supercomputing – procurement contracts being a long-standing method by which the US Government provides ‘hidden’ subsidies for its high-tech industry. The US Department of Energy (DoE) chose a consortium including IBM, Nvidia, and Mellanox, to build two supercomputers for the Oak Ridge and the Lawrence Livermore National Laboratories. Argonne’s machine will be announced later.
But the event’s full significance lies elsewhere than in the niche application of supercomputing. Instead, the partners in the winning consortium see it as a way to open up the world of enterprise computing to their new technologies that, in their view, offer a way to master the swelling volume of data that commercial companies have to cope with – not only in technical applications such as engineering simulation and design, but also in commercial applications such as business intelligence.
Commenting on the DoE announcement, both Sumit Gupta, general manager of accelerated computing at Nvidia, and David Turek, vice president of technical computing OpenPower at IBM, stressed the importance of the design chosen for Oak Ridge and Livermore not just for scaling up to ever faster and more powerful machines, but also for ‘scaling down’, so to speak.
Turek maintained that he had always been slightly sceptical of the line of argument that Exascale would inevitably deliver cheap petascale computing: ‘It’s easy to say but hard to do,’ he commented. In particular, IBM had found that its Blue Gene programme had offered very limited economies for smaller systems.
The fundamental lesson was, he said: ‘You have to pay attention to it from the beginning. We’re making it explicit and real.’ The Coral project was designed to be a one-node construct and economies of scale ‘in both directions’ were built in from the outset. ‘We didn’t want to have to say to customers: “You have to buy a rack of this stuff”.’
Sumit Gupta from Nvidia also focused on the wider implications of the technology for applications outside the specialist area of high-performance computing. ‘Accelerators in high-performance computing are clearly well established today – GPUs are mainstream,’ he said. But he sees the partnership with IBM as a way for Nvidia GPUs to make the transition to enterprise markets. IBM, he continued, ‘knows about data centres and is the preferred provider for many in enterprise computing. We have opened our GPU out, using NVLink, to other processors,’ he pointed out, ‘and the partnership with IBM takes GPUs into the mainstream DB2 market.’
David Turek made the same point – that this was not a technology being developed for a niche application in supercomputing but had wider ramifications across the whole of business and enterprise computing: ‘Coral is within the mainstream of our strategy. We have an eye to Coral as a way to serve our business needs.’
Ken King, general manager of OpenPower Alliances at IBM, elaborated on the theme, stressing that data-centric computing rather than number-crunching was at the heart of the new vision. With the explosion of data in the market, he said: ‘How are companies going to be able to process that data? You need innovation up and down the stack and you’re not going to get that with a closed structure.’
The solution, he continued, was to build solutions that minimised the movement of data for example by building compute into the storage. He also cited the need to get GPUs and CPUs working together and managing the workflow so as to achieve increased performance with minimal data movement. Nvidia’s NVLink interconnect technology will enable CPUs and GPUs to exchange data five to 12 times faster than they can today.
The combination of innovative solutions and minimal movement of data was, he claimed, a compelling strategy and that was the way in which IBM, in partnership with Mellanox and Nvidia had approached the Coral bid.
But he stressed that the solutions were not just for the likes of the Livermore National Laboratory: ‘Small companies are going to have data analysis problems. It’s a market-changing statement we’re making with this.’
Like IBM, the European supercomputer manufacturer Bull believes that high-performance computing is changing and it appears independently to have come to very similar conclusions to IBM about the future direction of high-performance computing. Just as IBM had emphasised data-centric computing so Bull’s strategy is to combine exascale and big data together to offer the capabilities of numerical computing and analysing large amounts of data.
Claude Derue, Bull’s IT services marketing director, stressed that, in future, there would be a need to tailor computer systems to the specific needs of the customers, much more than had been done hitherto. Atos, he continued, had IT expertise in many vertical markets while Bull had the expertise in technology. ‘We are a step forward compared to other vendors. The future of HPC will be to fit with the vertical market needs and Bull plus Atos are in a unique position to provide this.’
Bulls’ strategy, again thinking on parallel lines to IBM, is to widen the scope of its operations by taking technologies developed for the HPC market and reaching out to enhance the combined company’s position in IT for the enterprise. ‘There is a double opportunity,’ Derue concluded – both for HPC and the enterprise IT sector.
Bull’s announcement was an affirmation of confidence in the company’s high-performance computing business, following the take-over of Bull by Atos earlier this year. Derue was positive about the development: ‘We are at the beginning of a new story.’ Bull used to be predominantly a European company, he said, but following the merger: ‘We can rely on the Atos organisation to deliver around the world. Atos is a clear asset for Bull HPC.’
The announcement by Bull has five major components: an open exascale supercomputer, code-named Sequana; a matching software stack, known as the bullx supercomputer suite; a new fast interconnect, code-named BXI; a range of servers with ultra-high memory capacity, known as the bullx S6000 series; and a set of services to assist customers to develop their applications and make the most of exascale.
The new generation of BullXI interconnect is intended to free the CPU from the overhead of handling communication – communication management is coded into the hardware, according to Derue. The ultra-high memory capacity servers, the bullx S6000, are intended to address applications – for example genomics – that require in-memory data processing. The first model to become available is fully scalable up to 16 CPUs and 24 TB of memory.
Sequana is deliberately designed to be compatible with successive generations of different technologies (CPUs and accelerators) and can scale to tens of thousands of nodes. It will take advantage of Bull’s liquid cooling systems in order to ensure energy efficiency, and the first version will be available in 2016. Derue said: ‘We are paving the way to Exascale. With our solution, 100 petaflops systems are possible.’
But in all this, Bull too has its eye on scaling in both directions. It is interested in providing powerful computing cheaply to the smaller enterprises.
Because Sequana is modular in concept, designed as a group of building blocks, customers will find it easy to deploy and to configure for their own needs, Derue said. But it has also been conceived as a platform that can integrate different types of technologies, so, Derue continued, it should enable smaller customers to take advantage of modern technologies.
The flexibility to tailor systems to the customer’s preferences is one of the selling points of Eurotech’s ‘Hive’ (High Velocity) system. It is so-called not only because of the ‘high velocity’ computing it offers but also because, as it is encapsulated in Eurotech’s distinctive ‘brick’ format, a computer consisting of many of these elements somewhat resembles a beehive.
The Hive is an addition to its Aurora line of supercomputers, offering the possibility not just of Intel and Nvidia but also Arm processor technology in a very energy-efficient water-cooled system.
The concept had been introduced at ISC’14 in Leipzig in the summer, but now, according to Eurotech’s Giovanbattista Mattiussi, it has been translated into a proper product.
The idea, he said, is to extend the company’s product line so the Hive will be available in several versions: CPU only; CPU plus accelerator (which could be either a GPU or the Intel Phi ‘co-processor’); and an extreme accelerated version which would include ARM-based processors, in particular the Applied Micro X Gene 64 bit realisation of the Arm architecture.
Hive offers a stripped-down architecture to get more performance but with lower energy consumption, he said. The system has a new cold plate derived from industrial refrigeration that is cheaper and lighter than previous versions.
According to Mattiussi, the company is working with partners to define the configurations that will be appropriate for different applications.
The market segments the company has its eye on include high-energy physics (QCD), bioinformatics, molecular dynamics, CAE, machine learning, finance, GPU-based rendering, and seismic migration.
Managing innovation - collaboration centre stage
How will innovation for Exascale be managed in future? Perhaps the most significant part of the announcement in November that a consortium of IBM, Nvidia, and Mellanox had won the orders for two new US supercomputers was that it is a consortium, rather than a single company that had won the bid. This raises an interesting question: if a company the size of IBM cannot develop exascale technology by itself, can other computer companies offer credible exascale development paths unaided?
IBM has decided that the key point in its strategy is to open up its Power architecture as a way of fast-tracking technological innovation – collaboratively, rather than by one company going it alone. IBM briefings during the week of SC14 understandably had an air not just of the cat having got the cream but rather the keys to the whole dairy. Together with Nvidia’s Volta GPU and Mellanox’s interconnect technologies, IBM’s Power architecture won the contracts to supply the next-generation supercomputers for the US Oak Ridge National Laboratory and the US Lawrence Livermore National Laboratory.
On the Friday before the US Supercomputing Conference, SC14, opened in New Orleans in late November, the US Government had announced it was to spend $325m on two new supercomputers, and a further $100m on technology development, to put the USA back on the road to Exascale computing (see page 20).
Although £325m is now coming the consortium’s way, Ken King, general manager for open door alliances at IBM, stressed that: ‘From our perspective, more important than the money is the validation of our strategy – that’s what’s getting us excited.’ As Sumit Gupta, general manager of accelerated computing at Nvidia, put it in an interview: ‘IBM is back. They have a solid HPC roadmap.’
The decision marks a turn-around in IBM’s standing; its reputation was tarnished when, after four years of trying, it pulled out of a contract to build the Blue Waters systems at the US National Center for Supercomputing Applications (NCSA) at the University of Illinois in 2011. Originally awarded in 2007, the contract was reassigned to Cray, which fulfilled the order.
At SC14, the consensus was that the announcement was an endorsement of IBM’s decision to open up its Power architecture to members of the OpenPower Foundation and thus build a broad ‘ecosystem’ to support the technology. Gupta pointed out that IBM could have tried to go it alone, but decided to partner with Nvidia and Mellanox via the OpenPower Foundation, and work with them on the bid. ‘Opening the Power architecture – this is the new roadmap and validates what we have done together. When given a fair choice, this is the preferred architecture’.
The fact that both Oak Ridge and Livermore chose the same architecture was seen as a powerful endorsement of this technology development path, particularly as the two laboratories were free to choose different systems because they are funded from different portions of the US Department of Energy (DoE) budget – Oak Ridge from the Office of Science and Livermore from the National Nuclear Security Administration.
David Turek, vice president of Technical Computing OpenPower at IBM, pointed out that Livermore has no accelerator-based applications but is now choosing heterogeneity and, he claimed, it was the application engineers at Oak Ridge who were pressing most strongly for the system.
The third member of the Collaboration of Oak Ridge, Argonne, and Lawrence Livermore (Coral) project, Argonne National Laboratory, is also funded by the Office of Science within DoE and is therefore constrained to choose a different system from Oak Ridge’s. The Argonne announcement has been deferred into the New Year.
The delay has prompted speculation that Argonne too would have preferred the Power-based solution. After all, Argonne’s current machine is an IBM Blue Gene/Q – called ‘Mira’ – that already uses 16-Core PowerPC A2 processors. But the laboratory was constrained by the purchasing rules to opt for another choice.
Cray is not participating in the Coral bidding process, so it is not clear who the alternative provider might be to whom Argonne can turn. However, Paul Messina, director of science for the Argonne Leadership Computing Facility, said: ‘There were more than enough proposals to choose from.’ The Argonne machine will use a different architecture from the combined CPU–GPU approach and will almost certainly be like Argonne’s current IBM machine, which uses many small but identical processors networked together -- an approach that has proved popular for biological simulations. While the Coral systems would perform at about 100 to 200 petaflops, Messina thought that their successors would be unlikely to be limited to 500 petaflops but that a true Exascale machine would be delivered by 2022, although full production level computing might start later than that.
Gupta’s view that opening up the Power architecture was the new roadmap was echoed by IBM’s David Turek.
He said: ‘We could not have bid for Coral without OpenPower. It would have cost hundreds of millions of dollars and taken us years. Why waste time and money if we could leverage OpenPower to within five per cent of its performance peak? We have lopped years off our plan.’ And in that accelerated development pathway, OpenPower ‘is critical to us’.
He cited the tie-up with Mellanox: although IBM has smart people in networking, he said, by itself it did not command enough expertise. Mellanox had unveiled its EDR 100Gb/s InfiniBand interconnect in June this year, at ISC’14 in Leipzig, and this will have a central role in the new Coral systems. However, Brian Sparks from Mellanox pointed out that the company intends to have a stronger interconnect available for Coral than EDR: ‘200G by 2017 is on our roadmap.’
IBM announced the ‘OpenPower Consortium’ in August 2013 and said it would: open up the technology surrounding its Power Architecture offerings, such as processor specifications, firmware, and software; offer these on a liberal licence; and use a collaborative development model. However, Turek said, IBM had not outsourced innovation to OpenPower: ‘The bulk of innovation is organic to IBM.’