How can we manage exabytes of distributed data?

With the exabytes of data that are being generated today, it has become essential to integrate networking technology and data management technology to manage the movement and storage of data. Policy-based data-management systems provide a way to proceed. They represent perhaps the latest stage in the evolution of data-management systems from file-based systems, to information-based systems, and now to knowledge-based systems.

File-based systems focused on the management of bits, and provided a standard I/O interface for reading and writing files. Information-based systems added support for information about the files, including provenance, descriptive, and structural information – stored as metadata. Knowledge-based systems add support for procedures that either extract or generate information, and enable the processing of data within the storage environment.

For 16 years, the Data Intensive Cyber Environments (DICE) group at the University of North Carolina at Chapel Hill has been developing data-management systems called data grids – software that makes it possible to organise distributed data into sharable collections, while enforcing access controls. The original system, the Storage Resource Broker (SRB), focused on ensuring consistency across all operations performed in a distributed environment. Implemented as middleware, the SRB was installed where data would be stored.

Applications included: the BaBar High Energy Physics project, which moved two petabytes of data between Palo Alto, California and Lyon, France; the US National Optical Astronomy Observatory, which managed the migration of data from telescopes in Cerro Tololo, Chile, to archives in Illinois; and the United Kingdom’s e-Science data grid. The SRB provided a standard I/O interface, while managing metadata about the distributed files. The applications managed hundreds of millions of files.

Despite SRB’s success in managing data and information, users requested the ability to modify consistency constraints and implement multiple types of data-management policies. A requirement from the UK e-Science data grid, for example, was to create a collection in which files were permanently managed and could never be deleted. But, at the same time, it was desirable that administrators should be able to replace corrupted files, and users update their own files. This implied the need to manage at least three different constraints on data deletion within the same system: no deletion allowed; deletion by administrator; and deletion by file owner.

The DICE group developed a policy-based system to extract knowledge about management policies from the software, and apply the knowledge via computer-actionable rules. Effectively, every software-encoded consistency constraint was replaced by a policy-enforcement-point. Actions by clients were trapped at the policy-enforcement-points. By searching the rule base, an appropriate rule could then be identified, which controlled the execution of a workflow that applied the required management policy. This meant that the knowledge needed to manage the system could be captured in computer-actionable rules. The system was no longer restricted to managing files and static representations of information. Instead, a data-management system could use rules to control the system and dynamically change the rules in a rule base.

The integrated Rule Oriented Data System (iRODS) was developed over the past seven years, and has replaced the SRB. Within iRODS, policies can be enforced for: preservation (authenticity, integrity, chain of custody, original arrangement, retention, disposition); or for data publication in a digital library (descriptive metadata annotation, arrangement, creation of presentation versions such as image thumbnails); or for sharing in a data grid (access controls, distribution, caching); or for reproducible data-driven research in a processing pipeline (workflow procedures, workflow provenance, workflow re-execution); or for validating assessment criteria (repository trustworthiness, compliance with regulations).

Today, viable data-management systems automate enforcement of management policies within storage controllers, administrative tasks such as data migration, and the validation of assessment criteria. They capture knowledge, and automate processing of data within workflow pipelines. The automation of these tasks corresponds to the creation of knowledge procedures that can be applied by a policy-based data-management system.

Through policy-based data management systems, it will be possible to implement feature-based indexing of data collections. Discovery of data can be driven by the presence of desired features within the data set, instead of descriptive metadata. This requires the ability to apply a procedure to the data, determine whether the desired feature is present, and build an associated index. Policy-based systems can control the execution of the associated procedures.

Through policy-based data-management systems, it will be possible to link virtual collections to virtual networks, and access data by name instead of network location. A data-management system can be integrated with network routers, such as the OpenFlow technology, and dynamically define the network path that is used to access a file. If a file is replicated within the logical collection across multiple storage locations, the request for a file can be automatically routed to the closest copy.

These applications imply that policy-based systems will become pervasive, and migrate into storage controllers and into the internet. The knowledge required for processing or transferring data can be captured as procedures that are automatically applied under policy-based control.

Reagan W Moore is lead developer of iRODS and principal investigator for the DataNet Federation Consortium at the University of North Carolina at Chapel Hill

How can we manage exabytes of distributed data?

Topics

Editor's picks

The convergence of HPC and AI: Innovation in the post-Moore’s Law era

Online Panel Discussion | Optimise your HPC storage strategy

On-demand | AI in Life Sciences: Practical applications in small molecule design

On-demand Webcast: Transform your labs with cutting-edge AI solutions

Centralising analytical data from mass spectrometry in drug discovery and development

AI-driven Laboratories: Navigating Challenges and Embracing the Future

Choosing a flexible digital platform for drug discovery