Supercomputing - the reality behind the vision
A casual observer from outside the high-performance computing (HPC) community watching our events, news sites and discussions might easily conclude that HPC is about getting as much money as possible from your funding agency or board, and then buying the most Flops capacity (crudely ‘calculations per second’) possible. It’s even better if this is done by choosing a computer system that is in some way unique – we like ‘serial number 1’. We then proudly issue press releases declaring it as the biggest supercomputer in [xyz] – where the category [xyz] is carefully chosen and defined such that your supercomputer is at the top of the pile in [xyz].
This game of getting the most Flops possible has been boosted in recent years by the emergence of new processors, with a greater proportion of silicon devoted to calculating units: graphics processing units (GPUs), especially from Nvidia; and Intel’s Xeon Phi (which my brain still defaults to calling MIC because I’ve known it as that for so long). GPUs and Phi (is the plural Phis?) promise maybe an order of magnitude more Flops for a given dollar or power budget than traditional processors.
So, a big budget, a data centre full of racks with as many cores as possible, and plenty of GPU/Phi cards wedged in to get that Flops capacity [score?] as high as possible. Now what?
Well, this pile of silicon, copper, optical fibre, pipework, and other hardware makes an imposing monument that politicians can cut ribbons in front of and eager managers can give tours around. But something else is needed to make that pile of processed sand, metal and supporting gubbins into the powerful multi-science instrument that the funding agency sought, or the engineering design capability that convinced the company management. That something else is a complex ecosystem of system architecture, software, and people.
Well designed and implemented system architecture is required to make sure Flops engines (whether GPU, Phi or CPU) can do useful work. I’m not going to delve into that, only to say it is the art of balancing the desires of capacity, performance and resilience against the frustrations of power, cooling, dollars, and space. Characteristics such as having most of the Flops promise residing in GPUs or Phi co-processors, or larger than average scale, or ‘serial number 1’, all make this more interesting.
But even perfectly architected hardware is powerless without software. Software is the magic that enables the supercomputer to do scientific and engineering simulations. Of course, it is not really magic, even if it sometimes seems that way. Software is a complex collection of applications (maths, science and engineering knowledge crafted into bits), middleware (to make the entire ecosystem chug along smoothly) and tools (to fix it when it doesn’t). In fact, whisper it loudly, software is infrastructure – yes, infrastructure. It needs investment to create and maintain, it takes time to build and usually provides capability for a multitude of use cases and hardware platforms. Software can [should] be a highly engineered asset that, in many cases, is worth far more than the lump of tin that usually attracts the ‘infrastructure’ label.
Application software encapsulates some existing understanding of the relevant maths, science and engineering of a problem or system. This virtual knowledge engine is combined with an understanding of the hardware and cooperating software resources (e.g. communication libraries) into a set of methods and processes that enable a user to study and predict the behaviour of the [science/engineering] problem or system, or to test that encapsulated understanding.
Hopefully, the keen-eyed reader will have noticed the critical word in that preceding paragraph. It was ‘user’. Delivering science insight or engineering results from this powerful tool of hardware and software requires users. In fact, it requires an ecosystem of people: the scientists/engineers who understand how to apply the tool effectively; computational scientists and HPC software engineers to develop and optimise the application software; HPC experts to design, deploy and operate the hardware and software systems; and professionals to develop a HPC strategy, match requirements with solutions, procure capabilities, and ensure a productive service.
Just as we need a roadmap for hardware technology and a recognition that software needs long-term investment, we also need a long-term plan for the people. We need to invest in this part of the ecosystem too. The component units (that’s us lot) have a long preparation time (education) together with a plethora of exits-from-useful-service (from the predictable such as retirement, to the unpredictable and fast-acting such as a better job offer). And, because the demand for HPC and the complexity of HPC is growing, we need more people of varying skill sets. If we want the best capability of people with sufficient capacity, then we will have to invest in developing and funding them appropriately when in place.
Getting a HPC capability to deliver the best science or engineering is harder than just Flops/dollar or Flops/Watt – or to put it another way, science/dollar is not the same as Flops/dollar. But when the ecosystem of hardware, software and people is properly resourced and balanced, our causal outside observer might not see HPC at all – just an incredibly powerful scientific instrument or a capability-defining engineering design and validation tool.
Andrew Jones is VP of The Numerical Algorithms Group’s HPC expertise, services and consulting business. He is active on twitter as â€¨@hpcnotes