CRISP: tackling the data deluge in international science
‘Big data’ has become a buzzword – misused and misunderstood by many. For scientific facilities, however, it is a definite reality.
Scientific research is based on data. Detectors and instruments are improving at a fantastic rate and producing more and more data. The Large Hadron Collider at CERN, even before its current upgrade, was producing 200,000 DVDs’ worth of raw data a second. It’s not the only one: the European Synchrotron Radiation Facility in Grenoble is doubling its data production to one or two petabytes a year; and the Square Kilometre Array radio telescope project is expected to produce 400 gigabytes a second.
This presents great opportunities. Scientific data is valuable, and research is only as good as the data that informs it. However, IT systems are struggling to cope with this data deluge. CERN, for example, has to throw away most of the data produced by the LHC, as we are simply unable to capture it. If CERN was able to find the Higgs Boson with only a fraction of the data produced by the LHC, imagine what we could achieve if we were able to capture more, or all, of that data.
In recent years, European physicists have pooled resources to address the data deluge – and the opportunities for innovation and new breakthroughs by overcoming it – through the CRISP project (Cluster of Research Infrastructures for Synergies in Physics). CRISP brings together 11 research facilities, including the ESRF and CERN, collaborating to develop and streamline IT technology. This isn’t simply about finding the most powerful technological solutions. With pressure on funding and science budgets, it’s also about minimising cost, freeing up budget for true discovery, innovation, and research. As the project enters its final year, key issues of concern have been identified and new collaborations and projects are starting to take major steps to address them.
The obvious place to start is the initial collection of data and, in particular, the search for ways to select automatically only the most relevant data to be recorded for analysis. In this area, CRISP’s partners are looking at pre-process data solutions, developing an architecture that can assess whether data is useful and throw away what is not, filtering data and prioritising it so we don’t have to record as much.
However it is not just capturing the data that is problematic. The data deluge has ramifications the whole way through the supply chain.
Transport of data is a major issue for today’s scientists. Modern researchers often require the use of a variety of facilities for their research, analysing samples or taking measurements using a number of different instruments. With limited access to high-demand instruments, and significant pressure on researchers’ time, the preference is for short visits to each instrument, so that scientists can turn up, gather their data and take it away for analysis. However, the reality is not so simple. Burning the vast streams of measurement data to a disk or USB stick would take longer than capturing it in the first place.
The alternative is a system that allows scientists to have their data follow them around ‘virtually’ so they can access it from wherever they are working. However, at present, scientists access different data from different facilities with different usernames and passwords. It’s a dilemma that should be familiar to any internet user trying to keep on top of all their email, online shopping and social media accounts.
To address this, CRISP is helping to standardise data-access through a common identity system, or federated identity, allowing identities to transcend facilities or countries, and providing scientists with a single online identity or login to access all their data in one place – regardless of where it’s from.
But it’s not just a scientist’s own data that is difficult to access. Getting hold of previous research conducted at each facility is essential to ensure research teams can build on existing work without wasting time and effort re-doing old experiments. However, this requires a more intelligent approach to the way historical data is archived, making it easily more searchable and retrievable. Here again CRISP is aiming to make a difference through the development and availability of accurate and thorough metadata.
Progress has been made through CRISP, but there are still obstacles to overcome and issues to be resolved, and many of them are non-technical. Certain types of research data, such as medical and satellite records, need to be handled with special sensitivity and legal processes that aren’t yet in place, while different countries have very different data-protection and data-privacy laws, which also need to be taken into account.
For the latter of these concerns, help may be at hand from a somewhat unlikely source. The global controversy and debate sparked by revelations over PRISM – the mass electronic surveillance and data-mining programme – is pushing EU politicians to standardise data protection laws across Europe. The benefits to European science may be inadvertent but could be significant. However, this is but one part of the data deluge puzzle and it is only through CRISP’s broad remit, covering the entire data management cycle and bringing together the best minds from European scientific research facilities that we will deliver the advances that will underpin future scientific discovery.
Laurence Field is a researcher in CERN’s information technology department and the IT and data management topic leader within the CRISP project