The Exascale Computing Projects, Doug Kothe, discusses the lasting impact of exascale development
Can you tell our readers about yourself and your role at the Exascale Computing Project (ECP)?
So in terms of my current role, yes, I'm the director of the Exascale Computing Project. It is a seven-year almost $2 billion project that started in 2016. I joined the project at the beginning, actually a year before 2015, after I had led a five-year effort to build what we call a virtual nuclear reactor, essentially a new high-fidelity simulator of operating reactors. That experience was something that positioned me well to run ECP.
AT ECP I initially stood up all the applications in the project, and there are 24 of them, plus some other efforts. For the first two years, Paul Messina at Argonne National Lab was the inaugural director of ECP. I took over actually in the fall of 2017 so I've been on point for five years. I'm also dual-hatted here at Oak Ridge National Lab. I'm now the associate laboratory director for our Computing and Computational Sciences Directorate (CCSD). My passion is in building applications to address problems of worldwide interest.
How will exascale development impact the wider HPC community?
The Advanced Scientific Computing Research (ASCR) Office in the Office of Science and the Advanced Simulation and Computing Office, part of the National Nuclear Security Administration (NNSA), both realised that they needed to make a concerted investment in software. As application and software developers, we couldn't agree more. Often these sorts of investments are driven by the passion and commitment of the scientists. In other words, in the past, we had not developed this area as much as we should.
At ECP our sponsors realised that they needed to make a heavy investment in software, many years before the arrival of a new system. This huge investment in software, that's just not boutique software for one system, but is essentially the software tools and technology will be the scientific and engineering tools for our nation. And in many cases, the world for decades to come.
A good application can live for decades through many systems. It's essentially like constructing a large scientific instrument. So in this case, I like to think of our applications as the beginning of a new app store for the nation. And our software stack is a new dynamic OS. This stuff is going to be around long past when I retire. Just like the DOE will build a large neutron source or large light source to be a nationwide scientific instrument.
You mention that investment has changed? What is different about the ECP?
At the Department of Energy, there have always been investments in software development, but generally, at least in my career, never in such a concerted and integrated way. By concerted I mean, ample investment for innovation, agility and trial and error.
We're all about agile software development, which means, frankly, sometimes you fail early and often. The other thing is to bring all the activities together under one roof. There were some growing pains and culture clashes initially, but that has given us a huge return on investment.
The fascinating thing with ECP is that we put together this huge project, and we've got around 85 different teams working together and working with each other. We developed this inherent codependency upon each other that I've never seen. We are working with our sponsors right now to ensure that this codependent ecosystem is sustained well beyond ECP, and I'm very confident that's going to happen.
Can you give an example of this codependency?
At ECP we were afforded the opportunity to build integrated teams. And in some cases, we forced it, because certain domain scientists work in their own bubble. That doesn't mean they are not very successful, but the whole thing about ECP is bringing people together. This has led to substantial dividends. As an example, we were building some abstraction layers, one out of Sandia called Kokkos, and one out of Lawrence Livermore called Raja that helps to demystify and to some extent hide the complexity of heterogeneous hardware.
Many of our application teams didn't know about that development, because they were at other labs or other institutions, or in some cases, didn't care, because they didn't think it was going to help them.
You highlighted heterogeneous hardware. How has this trend affected the development of exascale?
Accelerators are here to stay. And I'm going to call it an accelerator, not a GPU, simply because what we're seeing now is hardware designed to accelerate certain operations. GPUs are probably miss-named because they are designed for graphics and to accelerate integer operations. But they do darn well with integer, logical and floating point operations, which is where scientific computing comes in.
ECP recognised that accelerated node computing is here to stay whether it's your laptop, your desktop, a cluster down the hall, the cloud or Frontier, a node is going to be an eclectic mix of hardware. If you don't have software that recognises how to lay out data, and how to utilise that hardware to exploit all these floating point operations – you're going to be in trouble.
I don't want to imply that this is easy going from one piece of hardware to another. The whole porting exercise is a contact sport. But the point is, we've designed our software to compartmentalise the pieces that we know can be accelerated. they have been separated with certain data structures that are more amenable to these accelerators. It doesn't mean we might have to make some mods or changes.
But in many cases, we have re-architected the software to be much more agile and flexible. To give you an example, the Kokkos abstraction library from Sandia National Lab, basically handles all your data structures for you. You might say, “I would like a three-dimensional 64-bit floating point array. I hand that to Kokkos and it looks at the hardware and says, "Okay, I know how to do this; I know how to lay it out."
To some extent, application developers aren't used to this. The mindset can often be "I can do this, I know how to do this, I don't need you". This has become more of a codependent situation where you need to take advantage of the effort put into certain libraries. There's not enough time and resources for you to do it well, or at all. We're making a strong push to show folks what we've done. Not necessarily that we are finished. But here’s a really good start. You can see what we've done, you can add to the stack, and there are capabilities to add components to develop it further. It's a real thrill, real exciting development.
What do you hope is the lasting impact of the ECP?
Our Department of Energy sponsors, empowered leadership, myself included, to make tough decisions about funding and integration. That doesn't mean that "here's a bunch of taxpayer funds go forth and do great things". We are reviewed, scrutinised and advised all the time, and we should be because we're empowered with US taxpayer funds, and we have to do the best we can. The point is, being part of a large project, we were given the flexibility to make tough decisions.
That experience has been indispensable, and I feel that, as a result of that experience, we've not only developed a current next-generation group of fantastic software and application developers, I think we've developed a group of new scientific leaders. I'm excited as I watch these young kids, as I call them, go out and lead other big endeavours. I'll probably do it from the golf course, frankly but it's been fantastic.
Doug Kothe is the Director for the Exascale Computing Project and Associate Laboratory Director, Computing and Computational Sciences Directorate at Oak Ridge National Laboratory.