Skip to main content

Grappling with the growth of scientific data

It’s no surprise to readers of Scientific Computing World that scientific data is increasing exponentially. And ever-advancing storage technology is making it easier and cheaper than ever to store all this data (vendors will soon be shipping 840TB in a single 4U enclosure). So what’s missing? How about: how to keep track of all that data? How to find what you are looking for in these multi-petabyte ‘haystacks’? How to share selected data with your colleagues for collaborative research, and then make it available to support the mandate that published results must be reproducible? How to ensure the consistency and trustworthiness of scientific data, selective access, provenance, curation and availability in the future? How to find data that was created years or decades ago but is needed now? And how to identify and remove data that’s no longer needed, to avoid accumulating useless ‘data junkyards’?

Metadata is the key

The solution has been around for decades: it’s metadata. Metadata, or data about data, lets scientists find the valuable data they are looking for. Metadata especially helps find value in data that’s been created by others, no matter when or where. Without rich metadata, scientists increasingly risk spending their time just looking for data, or worse, losing it – instead of exploiting that data for analysis and discovery.

Physicists are the high priests of metadata, and astronomers their first disciples

In addition to inventing the World Wide Web to support its amazing work, big science physics pioneered the use of metadata to manage the moving, processing, sharing, tracking and storing of massive amounts of data among global collaborators. Physicists have been using metadata to manage really big data for decades, developing their own bespoke metadata and data management tools with each new project. Cern actually developed three separate metadata systems to manage the two storage systems used in their ground-breaking LHC work that famously captured 1PB of detector data per second in search of the elusive Higgs boson.

So when NASA needed to keep track of all the data coming from the Hubble Space Telescope, it consulted the physicists at the Stanford Linear Accelerator (SLAC) BaBar experiment, and applied their metadata-based techniques to astronomy. Data collected from Hubble over the decades is meticulously annotated with rich metadata so future generations of scientists, armed with more powerful tools, can discover things we can’t today. In fact, because of rich metadata, more research papers are being published on decades-old archived Hubble data than on current observations.

General solutions to managing metadata

So what if your organisation isn’t part of a multi-billion dollar, multinational big science project with the resources to build a custom system for managing metadata? Good news, there are a couple of broadly available and generally applicable metadata-oriented data management systems already used by hundreds of scientific organisations: iRODS and Nirvana. These ‘twin brothers from different mothers’ were both invented by Dr Reagan Moore (a physicist of course!), formerly with General Atomics and the San Diego Supercomputing Center, and now with the Data Intensive Cyber Environments (DICE) research group at the University of North Carolina. iRODS is the Integrated Rule-Oriented Data System, an open source project developed by DICE. Reagan Moore discussed the system in his article ‘How can we manage exabytes of distributed data?’ on the Scientific Computing World website in March 2014.

Nirvana is a commercial product developed by the General Atomics Energy and Advanced Concepts group in San Diego, from a joint effort with the San Diego Supercomputing Center’s Storage Resource Broker (SRB).

(‘Taking action on big data’ is a recurrent theme for North Carolina, as Stan Ahalt, director of the Renaissance Computing Institute (RENCI), professor of computer science at UNC-Chapel Hill, and chair of the steering committee for the National Consortium for Data Science (NCDS), discusses in his article on these pages.

How they work

These systems have agents that can mount pretty much any number and kind of file or object-based storage system, and then ‘decorate’ their files with rich metadata that is entered into a catalogue that sits on a standard relational database such as Postrgres or Oracle. GUI or command-line interfaces are used for querying and accessing the data. Data can then be discovered and accessed through an object’s detailed attributes such as creation date, size, and frequency of access, author, keywords, project, study, data source, and more. All this data can reside on very different, incompatible platforms crossing multiple administrative domains, but now tied together under a single searchable global name space. Several processes run in the background of this federation that move data from one location to another, based on policies or events, to coordinate scientific workflows and data protection like the systems at Cern. These systems can also generate audit trails, track and ensure data provenance and data reproducibility, and control data access – exactly what’s needed to manage and protect scientific data.

Metadata is the future of scientific data management

Scientific big data, and the metadata-based techniques that manage it, are no longer the reserve of big science. Increased sensor resolution from more and more sequencers, cameras, microscopes, scanners and instruments of all types are driving a deluge in data across all science. Fortunately, robust tools are readily available for effectively managing all this data. Now it’s up to you to use them!

Media Partners