Our recent roundtable witnessed those working in scientific research discuss their needs around data storage strategies. With scientific research generating vast amounts of data, choosing the right storage infrastructure – whether on-premises, hybrid, or cloud-based – is becoming more complex, of course. Legal and regulatory requirements, AI-driven data processing, environmental sustainability, and resource scarcity are all key factors shaping the future of data management.
There are often multiple governing factors that have led to any given storage set-up, as CRUK’s Nigel Berryman, Head of IT and Scientific Computing, explains: “We’re primarily driven by value, but also historically we got our fingers burned when we bought a proprietary solution. As a result, we stick to open-source standards. We generally find a product that we like and then we’ll stick with it for a few years. We’ll then jump to another product if we need to. The last time we changed was because the product we were using was discontinued, so we switched to another vendor.
“We have an annual budget that is cyclical with the cross charging. We cross charge use for our storage and our cluster usage, which we try to [do] at cost, without making an internal profit. We have to forecast what we’re going to need in the next year. That judgement is now becoming more difficult with the push towards AI.
“We’ve got a small GPU cluster now and we can’t add GPUs fast enough; as soon as we add them, more people are using them. It’s getting very difficult to get ahead of the curve. Congestion is a common complaint.
“So, storage is less of an issue at the moment. With NVMe, performance issues have gone away and now it’s really just about capacity; we can add extra capacity with spinning discs quite cheaply and then it’s down to the users to decide whether they need it later to be on spinning disc or NVMe.”
The university is a 'complex research environment'
Tore H. Larsen, Chief Research Engineer HPC, Simula Research Laboratory, said his set-up is based on personal experience more than anything else: “I stuck to what I know,” he says. “In the 2010s, while working in the seismic industry, I adopted a certified CXFS design, from when I worked in Silicon Graphics, to use FDR Infiniband. The eX3 infrastructure has a 200Gbit HDR Infiniband backbone for BeeGFS/BeeOND RDMA, but with failover to the nodes’ 25/100Gbit Ethernet. We currently plan to migrate to NDR as the backbone. On a subset of nodes, we have PCIe networking from Dolphin Interconnect Solutions, which is also used as the BeeGFS failover network. We collaborated with several Norwegian companies in the HPC interconnect space, most prominently Dolphin Interconnect Solutions, NumaScale and ScaleMem.”
Deepak Aggarwal, Principal HPC Systems Manager, University of Cambridge, says that one size doesn’t fit all when it comes to scientific research. “We have three different storage services,” he expands. “One of the tiers is the high-performance storage, which is the scratch file system that we have on all HPC servers. “Another one that we are testing on Dawn is NVMe-based, which is very high throughput and low latency storage for AI users. This will allow AI researchers coming through the AIRR programme to get the performance they need for large-scale AI workloads.
“The university is a complex research environment with a wide variety of use cases. For example, we have projects that don’t need much compute power, but they do need storage space. For these people, we provide a resilience tier, using Dell Powerscale, where data integration and replication are the keys.”
Jonas Lindemann, HPC Director, Lund University, has seen storage decisions change over the years. “Storage has been something of a pain historically,” he says, “but we have gone from building our own file systems and storage servers to working with more vendor-complete solutions. We ended up on IBM Spectrum Scale, which we’ve found to be a very flexible storage system. We use that both for sensitive and non-sensitive data.”