Drowning in data: a European perspective
From seismometers that measure every sigh and tremble of the Earth to high-definition scans of mediaeval manuscripts, researchers are applying digital techniques to create data pools of unprecedented depth. Our potential for understanding our world and ourselves has never been greater, thanks to digital data.
But this digital bonanza raises challenges – not least that, far from spreading and joining to form an ocean, our pools of data are instead growing deeper. Researchers and digital engineers are becoming so busy shoring up the ‘wells’ of data, ensuring they don’t collapse under their own weight, that we run the risk of losing a broader perspective. Like the scientist who over a lifetime of specialisation learns more and more about less and less, we risk ending up knowing everything about nothing.
Hyperbole, perhaps, but the challenges of managing exponential data volumes while maintaining our abilities to cross-reference subjects from different domains are real enough. Fortunately, researchers are on the case. In Europe, the last decade has seen great efforts to collect, catalogue, curate and preserve our new digital heritage. European research policy has emphasised the importance of ‘research infrastructures’ – trans-national digital laboratories to underpin this new era of data-driven discovery – and initiatives like the European Strategy Forum on Research Infrastructures (ESFRI) have resulted in significant and successful endeavours to shore up the wells across a range of disciplines.
Around 10 years ago, big data was the preserve of high-energy physics and astronomy. It is testament to our digital ingenuity that the problems of managing and sharing vast data sets can be found in any and every modern field of research. That the study of the Earth’s climate is one of these is probably no surprise. The European Network for Earth System modelling (ENES) is a European research infrastructure that brings together around 20 climate research and modelling centres to better understand the climate and our impact upon it. ENES has created a standardised environment for the preservation and exchange of tens of petabytes of simulation and satellite observation data.
What ENES attempts for the sky, the European Plate Observing System (EPOS) aims to do for the ground beneath our feet. EPOS is integrating the activities of a large number of Earth research infrastructures across Europe and has particular challenges in the assimilation of data from the ever-increasing network of high-capacity sensors – so-called broadband seismometers – across geologically active parts of the continent. International data standards help, of course, but the problem is still one of assimilating and managing tens of millions of individual data files in a dynamic environment.
Perhaps the most interesting impact of the digital research revolution is in the humanities. The digitisation of speech, the scanning and digital interpretation of texts and manuscripts over the last decade has created a wealth of data and research opportunities. CLARIN is the European Common LAnguage Resources and Technology Infrastructure (the ‘T’ is silent), an organisation spanning nine countries that preserves and provides access to digital language data collections. Twenty years ago, there could have been no CLARIN; now there are more than 30 centres together managing petabytes of rare spoken language data. One of CLARIN’s biggest challenges is in the assimilation and cross-referencing not of data but of metadata, ‘the data about the data’. Language is one of the richest, most diverse dimensions of human culture and capturing and describing it in ways that can be harmonised, correlated and reasoned about is no easy matter.
Though they cover three different domains, the above projects have one thing in common: they are keystones in the European Data project, EUDAT. EUDAT is the largest, most significant ‘horizontal’ infrastructure project in the ESFRI roadmap, an infrastructure of infrastructures that aims to bring together the discipline-specific activities of initiatives like ENES, EPOS and CLARIN and find common ground among them. EUDAT’s first goal is to tap into the physical infrastructure of some of Europe’s leading computing and data centres to create a digital preservation network of connected disk and tape to counteract the increasing risk of losing something important.
With that done, how many of the software services offered by different disciplines are actually common, and could be provided in a generic way? This is EUDAT’s goal – and the benefits aren’t just economic, of course. If we can standardise the underlying infrastructure and services, sharing the content becomes so much easier. The worldwide web tells us that.
And Europe is not an island. All these activities, whether generic or discipline-specific, are international. Science is today a global endeavour, science in its broadest possible sense of ‘systematised knowledge’, and the challenges of digital science are too great to be tackled in corners. The Research Data Alliance, a new coordination body for global research data, hopes to emulate the very best of the Internet Engineering Task Force, doing for the nitty-gritty of global research data sharing what the IETF has done for the internet: ‘wouldn’t it be better if this just worked the same way here, here and here?’ The RDA wants to bring together data practitioners from across the spectrum to sit down with problems like this, and solve them, one small step at a time, steadily reducing the barriers between our wells of data.
Rob Baxter is software development group manager at EPCC, University of Edinburgh