FEATURE

Transforming discovery through integrated, comprehensive data management

Dan Bedard describes the progress that the integrated Rule-Oriented Data System is making in removing the data management roadblocks that inhibit the wider research use of data

By now the readers of Scientific Computing World will have heard plenty about the promise of big data and the challenges that inhibit realisation of that promise. Technologies that generate data – such as distributed sensor networks, social media, and low-cost genome sequencers – enable researchers to study problems at unprecedented levels of detail. Studies that will transform our understanding of issues such as climate change, intergenerational poverty, and debilitating diseases are now within our reach. To unlock the promise of these prolific data sets, however, we need to address several tricky data management problems.

The realities of modern-day scientific research complicate data management. Scientific problems are multidisciplinary and messy; data sets are often distributed among many institutions, each with its own storage technologies and data management practices. And while collaborative research requires data sharing, research on sensitive data requires security and protection of sensitive personal data. For example, social science data often includes health and education records that must be anonymised or secured. In addition, the available analytic methods – the tool belt of data processing and analysis – continues to evolve, compelling researchers to preserve data for an essentially unlimited lifespan.

As head of the iRODS Consortium, based at the University of North Carolina’s Renaissance Computing Institute (RENCI), I see our members grapple with these competing forces every day. The data management questions asked by research and business organisations include:

  • Can we find a consistent, sensible mechanism for accessing and administering our data, which spans departments and institutions?
  • What tools can we use to organise and explore subsets within our data?  And
  • How do we implement policies that ensure the integrity, security, privacy, and efficient processing of our data?

The iRODS Consortium was founded to sustain the integrated Rule-Oriented Data System (iRODS), free open-source software that provides policy-based management of unstructured data (i.e. files). iRODS presents a standard interface to data that is spread across multiple file systems and object stores, enabling a multitude of web clients, command line tools, and APIs to access the user’s data. Files in iRODS are associated with system and user-level metadata in a central, indexable catalogue. The iRODS rule engine implements data management policies for access control, retention, and any automated task imaginable across a data grid. To enable broad collaboration, iRODS deployments can be federated, a process that allows different data sets with independently defined management policies to appear to function as a single entity.

How are organisations using iRODS to take control of their data?

In June, users from industry, academic research centres, and government gathered to share their experiences at the seventh annual iRODS User Group Meeting, hosted by the iRODS Consortium in Chapel Hill, North Carolina.

Jon Nicholson of the Wellcome Trust’s Sanger Institute, based at Hinxton, near Cambridge in the UK, explained how it is using iRODS to manage petabytes of genome sequence data. After sequencing, the aligned data files are annotated with metadata indicating parameters such as the study ID and whether or not human DNA is included in the sequence. Using iRODS rules, the researchers automate several critical tasks.

Checksums, an error-detection technique used in data transfers, are calculated on the data and stored as metadata. iRODS uses these checksums to verify periodically that the data has not been corrupted. Human DNA is automatically separated from non-human DNA and moved to a secured storage location. Replicas of the data are stored in multiple locations to protect against data loss due to equipment failure.

Once stored, researchers use queries against the metadata catalogue to locate data of interest. For example, they can query data according to its study ID. The Sanger Institute divides projects into ‘iRODS Zones’ with distinct data management policies; federation of the zones allows controlled data-sharing between projects.

Data federation through iRODS will be a critical capability for eMedLab, a collaborative bio-research project funded at £9 million by the UK’s Medical Research Council to provide a shared offsite data centre that supports ‘Data-Driven Discovery for Personalised Medicine’. The eMedLab collaboration includes iRODS Consortium members – the Sanger Institute and University College London – as well as The Francis Crick Institute, Queen Mary University of London, and the European Bioinformatics Institute.

In a related presentation at the User Group Meeting, Vic Cornell from DataDirect Networks (DDN), which is also an iRODS Consortium member, discussed how Imperial College London (ICL) uses iRODS to comply with data management policies required for publicly-funded research in the UK. ICL plays a lead role in UK MED-BIO, a collaboration that includes partners from the Institute of Cancer Research, the European Molecular Biology Laboratory-European Bioinformatics Institute, the Universities of Oxford, Swansea and Nottingham, the MRC Clinical Sciences Centre, and the non-profit organisation MRC Human Nutrition Research. The project seeks to bring together the data, infrastructure, and expertise needed ‘to enable major advances in understanding the aetiopathogenesis of chronic human diseases’. The proof-of-concept system demonstrates the utility of iRODS in implementing mandated policies, such as those that:

  • Maintain associations between data sets and unique persistent identifiers;
  • Ensure that data sets are preserved for a prescribed period of time following the last access; and
  • Guarantee that archived data sets are not altered or corrupted.

The projects highlighted here are but a small slice of the data-intensive research endeavours underway using iRODS. In genomics alone, there are nation-scale sequencing studies spinning up in the United States, Canada, Australia, Japan, South Korea, Singapore, Thailand, Kuwait, Qatar, Israel, Belgium, Luxembourg, and Estonia. Collaboration between institutions and disciplines is increasingly commonplace, and yet there will never be a one-size-fits-all data management plan for all organisations and disciplines. Also, organisations and research teams will continue to need long-term data preservation methods, so that data sets can be revisited and reanalysed as new analysis techniques emerge.

At the iRODS Consortium, we feel privileged to play a role in enabling these new data-driven efforts that will help fulfil the promise of big data. Through painstaking effort and by paying close attention to the needs of our users, we know that data management roadblocks can be overcome. More importantly, we believe the transformative discoveries that result will bring us closer to solving important problems related to human health, climate change, environmental sustainability, poverty, and more.

Dan Bedard is executive director of the iRODS Consortium, housed at RENCI at the University of North Carolina, Chapel Hill. For more information on iRODS or the consortium, please visit irods.org.

Analysis and opinion
Feature

Robert Roe looks at research from the University of Alaska that is using HPC to change the way we look at the movement of ice sheets

Feature

Robert Roe talks to cooling experts to find out what innovation lies ahead for HPC users

Analysis and opinion