Skip to main content

Harnessing HPC: Cambridge’s approach to AI and computing at scale

Deepak Aggarwal, Principal HPC Systems Manager at the University of Cambridge

Deepak Aggarwal, Principal HPC Systems Manager at the University of Cambridge

Credit: Deepak Aggarwal

Deepak Aggarwal is the Principal HPC Systems Manager at the University of Cambridge, where he leads the storage portfolio for Research Computing Services and oversees the AI Research Resource (AIRR), a national service supporting cutting-edge artificial intelligence research. 

With a background in physics and hands-on experience setting up HPC clusters from scratch, Aggarwal brings both a user’s perspective and deep technical expertise to his role. At Cambridge, he manages complex infrastructures that serve an extraordinarily diverse user base, from clinical researchers and physicists to computational chemists, astronomers, and even humanities scholars who are increasingly adopting AI-driven approaches. Beyond Cambridge, he contributes to the UK HPC community as Secretary of the HPC SIG, fostering collaboration and knowledge-sharing across institutions.

The rise of GPUs has transformed the HPC landscape, reshaping how systems are built and how researchers interact with them. Where once CPUs dominated, today’s clusters are increasingly GPU-centric, primarily driven by the explosive growth of AI and machine learning. Aggarwal and his team deliberately maintain hardware diversity across NVIDIA, Intel, and AMD GPUs to support both AI-driven workloads and traditional HPC codes that are more difficult to port. This shift has profound implications for infrastructure: it demands new models of accessibility, user onboarding, and workload optimisation, ensuring that researchers from all disciplines can take advantage of advanced resources without steep technical barriers. In this conversation, Aggarwal discusses the challenges and opportunities of this transition, the role of community in advancing HPC, and how Cambridge is shaping the next generation of computational research infrastructure.

Can you tell me about your role at Cambridge University?

Aggerwal: "I'm currently working as a Principal HPC Systems Manager at the University of Cambridge, within Research Computing Services (RCS), part of University Information Services. I lead the storage portfolio for our research computing services and oversee a national service called AIRR, the AI Research Resource, for which we manage a system called Dawn.I also provide infrastructure support to the broader research community, both within Cambridge and for external users.

Additionally, I serve as Secretary of the UK HPC SIG community. This group comprises individuals from academia who manage HPC services within their universities, and we organise quarterly meetings where I assist with planning and delivery.

How does the HPC SIG UK community help its members?

Aggerwal: The main objective of the SIG community is to share knowledge. For example, if Cambridge develops a solution for our users, it may also apply to other universities, thereby avoiding the need to reinvent the wheel in isolation. At these quarterly meetings, a structured program is in place, where people give talks on a wide range of topics, from cluster installation to new AI tools for managing support tickets. The aim is always to share knowledge, and the community benefits greatly from it.

We also have an active Slack channel and email list. For instance, if someone is considering new storage hardware or cooling solutions, they can ask the community for feedback before approaching vendors, ensuring honest opinions. These meetings are deliberately informal; nothing is recorded so that people can speak openly about technical and professional challenges, including career pathways in professional services compared to academic roles.

On a day-to-day basis, user requirements drive our services. I manage three types of storage: high-performance storage for HPC workloads, project storage for teams, and tape facilities for long-term data archiving and backups. These services exist because of demand from various user categories, not because we designed them in isolation.

Our main HPC service is CSD3, a large cluster comprising a mix of compute nodes, including CPU-based and GPU-based nodes, and spanning multiple GPU vendors, such as NVIDIA, Intel, and AMD. This ensures users can run their workloads on the most suitable hardware. For example, Dawn is entirely Intel GPU-based and supports national AI research projects.

What is the makeup of the user base at Cambridge?

Aggerwal: Our user base is very diverse, ranging from the Clinical School, physics, computational chemistry, and astronomy, to the humanities, which increasingly utilise AI tools that require GPUs.

We offer various access methods to accommodate individuals with different skill levels. Experienced users can connect via Linux terminals, while beginners can use Open OnDemand, a browser-based UI that makes submitting jobs as simple as running applications on a laptop.

Supporting diversity in skills means meeting users where they are. We do not require everyone to learn Linux before using HPC. Instead, we provide tools that allow researchers to run their workloads immediately, essentially treating HPC as an extension of their laptop. This approach is aligned with the government’s roadmap, which emphasises rapid access to computational resources rather than lengthy training requirements.

At Cambridge, gauging user requirements can be challenging due to the community's large and diverse nature. Sometimes solutions arise from administrative challenges rather than direct requests. For example, we had different onboarding processes for internal, industry, and national users. With the launch of AIRR, which allows any UK researcher to access resources at Cambridge, Dawn, or Bristol, Isambard-AI, we needed a unified system.

We adopted a federation model using MyAccessID from GÉANT and the open-source project management tool Waldur. This allows users to authenticate with their home institution’s credentials, while giving PIs control over project membership. It simplifies onboarding, ensures proper offboarding at project end, and removes the burden of managing external credentials. Both Cambridge and Bristol now use this model for AIRR.

Has the shift from CPU to CPU impacted your role?

Aggerwal: The shift from CPUs to GPUs has also transformed HPC. When I started, everything was CPU. My first large cluster consisted of 80 per cent CPU and 20 per cent GPU, and we struggled to utilise those GPUs effectively. NVIDIA invested heavily in training and libraries, which made adoption easier. Over time, the balance flipped, and our new systems are now GPU-centric.

AI has accelerated this trend. Large language models and AI workloads require GPUs, and researchers want quick access. While AI codes are relatively easy to port between GPU types, traditional scientific HPC codes, such as CFD, molecular dynamics, or MPI workloads, are more complex to migrate, especially legacy codes written decades ago. This creates challenges in avoiding vendor lock-in.

We deliberately maintain hardware diversity with NVIDIA, Intel, and AMD GPUs to prevent dependence on a single vendor. We are also working on projects like federated container services, which abstract away GPU differences and automatically optimise workloads for whatever hardware is available. This heterogeneity reduces complexity for users while promoting sustainability and flexibility for providers.

Ultimately, the HPC community is in transition. AI has brought new users from non-traditional fields, making accessibility and flexibility more important than ever. At the same time, we must continue supporting traditional HPC users, many of whom depend on highly optimised, domain-specific codes that remain CPU and MPI-heavy. Balancing both is the challenge ahead."

"There are various kinds of communities. We are building infrastructures that act as a bridge between researchers and the computational power they need. Our biggest users come from the Clinical School, but we also have users from physics, computational chemistry, astronomy, and many other fields. We even see increasing use from the humanities, particularly with the rise of AI tools in their disciplines. They also need high-end GPUs for their workloads.

In principle, any user who requires computational power, regardless of their HPC experience, can utilise the services. For those experienced in Linux, they can SSH into the terminal and run jobs directly. For those without Linux knowledge or HPC terminology, we provide a browser-based service called Open OnDemand. It has a simple interface: you log in, fill in details, and submit a job much like you would on a Windows application on your laptop.

How does this impact services and training, particularly for new users?

Aggerwal: Today, there is a strong push from new users running AI workloads on laptops, whether Linux or Windows. The challenge is to transition them to HPC without a steep learning curve. Supporting diversity in skills means meeting people where they are, not forcing them to learn everything we know. We want them to use their existing skills with the computational power we provide.

Some users may choose to learn more, but for those who just want to run workloads, we do not want the learning curve to delay them. The government’s current roadmap also emphasises providing resources as quickly as possible so that researchers can run workloads from day one. It should feel like a replacement for their laptop: take the code, move it to HPC, and run it.

This shift has expanded the community. We now have many users from non-computational backgrounds, including the humanities. Five or ten years ago, nobody expected that. The AI boom has driven demand and brought in these new groups, which means we must be flexible in enabling access, just like public cloud providers. You log in, spin up resources, and run jobs as if you were on your laptop, even though they are actually running on a cluster. At Cambridge, we have been doing this successfully.

As for how I got into HPC, I began my professional journey in 2011 after completing my master’s degree in physics. I joined a nuclear fusion institute in India as a nuclear analyst, working on ITER, the International Thermonuclear Experimental Reactor in France. I ran my simulations on a small HPC system at the institute, which was my first exposure to clusters.

Later, there was a strong push to develop computational facilities. In 2016, I joined a team setting up a large cluster called the Butterfly Machine. I learned how to set up an HPC cluster from procurement to commissioning, benchmarking, porting user applications, and training researchers to run jobs. I worked closely with users, helping them understand HPC terminology and how to transition from workstations.

In my current role, I do not work directly with users as much, as it is more about developing solutions, while the RCS community handles engagement. However, because I was a user before becoming an admin, I understand their pain points and how they communicate. That background has been invaluable.

At the institute, when we launched the cluster, we grew from 50 users to 200 in three years, out of a total staff of 500. Many researchers stopped buying personal workstations once they saw the benefits of shared resources. One big challenge was engaging isolated users who were hesitant to seek help. Some were very active, but others stayed quiet. To reach them, I started an HPC newsletter and published around 40 issues, one every month."

Deepak Aggarwal is the Principal HPC Systems Manager at the University of Cambridge
 

Media Partners