Skip to main content

Asynchronous tasking: Solving complex engineering challenges at exascale

Martin Berzins

Martin Berzins is a Professor of Computer Science at the University of Utah’s School of Computing and the SCI Institute.

Credit: The PASC Conference

Martin Berzins, a researcher in computational science, discusses the development of large-scale, task-based software for solving complex engineering problems on extreme-scale computers. 

Drawing on decades of experience spanning applied mathematics, computer science, and engineering, Berzins explains how the UINTAH framework and the Kokkos library enable high-performance, portable simulations across diverse architectures, including the latest exascale supercomputers.

Berzins is a multidisciplinary researcher in Computational Science whose work spans applied mathematics, computer science, and engineering. His research focuses on the development of partial differential equation software for solving challenging engineering problems across a wide range of applications on extreme-scale computers.

He is a Professor of Computer Science in the School of Computing and the Scientific Computing and Imaging Institute at the University of Utah, and also a Visiting Professor at the University of Leeds. In 2003, he joined the University of Utah, where he served as Associate Director (2003–2005) and then Director (2005–2010) of the School of Computing. From 2005 to 2014, he was Co-Editor in Chief of Applied Numerical Mathematics. In 2012, he became Recipient Program Manager of the US Army Research Laboratory Collaborative Research Alliance in Multiscale Multidisciplinary Modeling of Electronic Materials (MSME), bringing together nine universities to advance electronic materials by design. Since 2013, he has been the Computer Science lead for the DOE NNSA PSAAP2 Carbon Capture Multidisciplinary Simulation Center at the University of Utah.

Can you tell me about your role at the University of Utah and how you became involved in this HPC project?

Berzins: I sit somewhere between applied mathematics and computer science, and I have always worked on mathematical problems in computer science. I started when I was around 15. In 2003, I moved to the University of Utah after spending several years at the University of Leeds. It was simply time to do something different.

Utah was attractive due to the Scientific Computing and Imaging Institute (SCI), founded by Chris Johnson, who invited me to come. Back in 1998, I had spent a sabbatical there, working with the group as they began tackling very large-scale engineering problems. One of the key ideas that emerged was pioneered by Stephen Parker, a graduate student of Chris’s at the time. Steve developed a framework called UINTAH, a task-based code for solving large engineering simulation problems.

The word comes from the Uinta mountain range, which sits about 40 miles behind me as I speak. It is the only east-west mountain range in the continental United States. The word itself is a Native American term meaning “running water by trees,” which captures the feel of that landscape.

UINTAH evolved from work on viewing programs as collections of tasks, combined with a runtime that schedules them. Steve Parker went on to Nvidia, where many people from Utah now work. After he left, I took on the responsibility of further developing UINTAH. My focus was to make the execution of tasks asynchronous. Originally, tasks were executed in a fixed order. By moving to asynchronous scheduling, we ensured there was always work ready to run, avoiding idle waiting for messages or data.

This idea is at the core of your question about portability and performance. Structuring a program as a set of tasks, carefully specifying their dependencies, means you never stop in parallel computing. You always have something else to execute. That is the central principle behind UINTAH. It is not unique to UINTAH. Similar ideas have been incorporated into other codes, such as Charm++ from the University of Illinois, led by Sanjay Kale, which has been in development since the late 1980s. Task-based programming has roots dating back to the 1960s. But it has never been mainstream. It is more difficult than more traditional models.

So why is it not widely adopted if it works so well? 

Berzins: Because for many simpler engineering applications, conventional codes written with MPI or OpenMP scale well enough. If communication patterns are straightforward, or if you have enough manpower to brute-force the problem, you do not need a more complex system. It is only when applications become very complicated that manual control of communication breaks down. That is when task-based runtimes really shine.

The problems we work on are indeed very complex. For example, we often model coupled fluid and solid dynamics, as well as extreme events such as fires and explosions. Our colleagues in chemical engineering here in Utah have given us many challenging problems. One example is a real accident in Utah where a semi truck carrying 30,000 pounds of explosives overturned on a dangerous highway bend. The truck caught fire and, within minutes, exploded, leaving a crater 60 feet across and 20 feet deep. Fortunately, the area was evacuated, but the blast was massive. We modelled that event, simulating the truck, the explosives, the packaging, the fire, and the detonation process. This problem required a task-based approach to handle all the complex physics at scale.

The more complicated the physics, the more attractive task-based programming becomes. But it is more complex to implement. You have to manage how tasks access data carefully. Two tasks might safely read the same data, but if one writes while another reads, you have a hazard. The UINTAH runtime manages this with a data warehouse, which tracks all variable accesses and ensures correctness.

How does asynchronous tasking differ from other models?

Berzins: The runtime also monitors which tasks are ready, which are waiting on communication, and whether tasks should run on CPUs or GPUs. This separation of concerns is key: the runtime system handles execution, while the application developers focus only on writing physics tasks and their dependencies.
By contrast, in MPI, you explicitly manage communication. You send messages and wait for responses. With asynchronous tasking, you relinquish control over when and where execution occurs. The runtime decides. For developers used to MPI, that can be difficult to accept.

Around 2010 to 2012, we hit a wall: our code was not scaling because too much time was spent in MPI waits. Tasks were sitting idle, waiting for data. At that point, I told my graduate students, Justin Luitjens and Ching-Yu Meng, that the runtime had to be asynchronous. 

They redesigned it, and from that point forward, UINTAH scaled cleanly across Department of Energy supercomputers. That work was carried forward by many students and collaborators, including Alan Humphrey, John Holmen (now at Oak Ridge), Mark Garcia (formerly at Argonne), and Alan Sanderson, with strong support from colleagues at Intel, Argonne, and the Kokkos group at Sandia.

With that runtime, we were early adopters of new supercomputers, often running at full scale while other groups were still struggling to adapt. For example, we were among the first to run effectively on Aurora.

What changes at exascale? How can scientists adapt their applications to best suit the larger-scale resources?

Berzins: It is less about the number of nodes and more about the power of each node. Machines like Aurora and Frontier have fewer nodes than older CPU-only systems, but each node is far more powerful, with multiple CPUs and GPUs. The imbalance between compute and communication grows. Communication speeds have improved, but not enough to keep up with GPU performance. GPUs can perform hundreds of floating-point operations in the time it takes to move a single word from main memory. That means GPUs are often starved for data. So you need careful runtime management to keep these powerful nodes busy without being bottlenecked by communication or memory transfers.

This is where Kokkos comes in. Around 5 to 7 years ago, we began thinking systematically about portability to GPUs. Kokkos is a DOE-developed library that abstracts both execution and memory management. With Kokkos, you can write one code that runs on CPUs, Nvidia GPUs, AMD GPUs, and Intel GPUs without modification. The library restructures loops and memory layouts to suit each backend.

For UINTAH, combining asynchronous tasking with Kokkos was crucial. One part provides resilience against delays and ensures that there is always work ready. The other part makes that execution portable across architectures. Together, they allow our simulations to run efficiently on every DOE exascale machine we have tested.

We chose Kokkos because it was one of the first serious efforts in portability, developed at Sandia, while a parallel project called RAJA emerged from Lawrence Livermore. Both programming models have since been adopted widely. We knew the Kokkos developers and began collaborating with them approximately ten years ago. They even added support for asynchronous tasking at our request, though relatively few people use it. 

Over time, their implementation evolved, and we adapted our code accordingly; however, the collaboration has been excellent. Their long-term goal is to integrate Kokkos into the C++ standard library, which would greatly expand adoption.

Performance portability does mean you may not achieve the absolute best performance possible, because vendor-specific tuning is required for optimal performance. However, for million-line scientific codes, achieving efficient execution everywhere is far more valuable than squeezing out the last few per cent on a single architecture. DOE’s Exascale Computing Project recognised this and focused heavily on porting major codes to GPU-based systems using frameworks like Kokkos. It was an outstanding project that brought together national labs and delivered real results.
There is also a social dimension. Scientific software represents billions of dollars of investment across DOE, NSF, DOD, and other agencies. Rewriting those codes wholesale is not feasible. Most groups do the minimum necessary to keep things running, and inertia is a strong force. That is why new models take time to gain widespread adoption. Typically, it takes a decade for today’s cutting-edge approaches to become mainstream in production codes. So by running at exascale now, we gain a kind of ten-year head start.

Looking ahead, there is also interesting work at the intersection of task-based runtimes, AI, and quantum computing. For example, researchers at RIKEN in Japan have proposed task-based approaches as a way to bridge classical and quantum computations, where latencies are unpredictable. Similarly, AI and engineering codes could both be expressed as tasks, allowing them to integrate naturally within the same runtime. This is an area of future opportunity.
Finally, self-adaptivity is vital. On any parallel machine, scalability collapses as soon as you wait for data. The asynchronous runtime eliminates that bottleneck. It makes UINTAH self-adaptive, automatically adjusting to communication delays, machine load, and network topology. That adaptability has been the key to running effectively on all the largest DOE machines.

Martin Berzins is a Professor of Computer Science at the University of Utah’s School of Computing and the SCI Institute.

Topics

Read more about:

HPC, Engineering, Breakthroughs

Media Partners