Data lakes and cloud computing
For research and development organisations, the rise of instrument and process automation is leading to a phenomenal increase in the amount, variety and complexity of scientific data that is gathered. All this data needs to be made available so it can be integrated into projects and new scientific approaches, both now and in the future. The requirement to be useable has been growing over the past decade and is reaching a critical point.
Instrument data is driving new science and, as organisations move to large image-based and high-density data structures to support their work e.g. phenotypic screening, the data types used are advancing from the simple text formats of old. To ensure these new data types are (re)useable in R&D and are consumable by existing and emerging technologies such as Artificial Intelligence (AI) and machine learning, the data has to be accessible, clean and adequately tagged with metadata. These high value ‘data lakes’ can become silted up and quickly turn into swamps if data is not properly tagged with all relevant contextual information – projects, tested molecules, results, downstream use, conclusions, derived data, related data etc.
Designing and keeping data lakes in good health requires constant work and effort, but cloud computing strategies like new storage (S3) and adaptive indexing technologies (NOSQL, Triples) will help. While some people think of data lakes, or even data, as a static picture after it has been captured, in reality, data needs to be continually enriched and augmented with learnings. Often, informatics organisations consider the data as the record – and in some cases, it is – but it does not have to be cast in stone and ‘stored’. Intellectual property (IP) records can be captured and stored in other systems – while the working data is stored in other data structures and ‘put to work’.
Enrichment is a hot topic in the pharma informatics domain. We have seen the emergence of many tools that all essentially do the same thing: make data more consumable or discoverable by scientists and computers. Semantic enrichment or natural language processing has been around for many years and has shown good benefits particularly in the healthcare domain, where it is used to extract and normalise data from clinical trials.
In Pharma R&D, the enrichment approach is gaining traction with the prevalence of new technologies and commercial offerings. Ontological, taxonomical and semantic tagging are set to become mainstream as the technology and application integration becomes easier and vendors deploy their tools in the cloud.
A corporate data lake must be defined and viewed as the place to go to find, search, interrogate and aggregate data – making it easier for data scientists to investigate and build data sets for their work. Find and search are two separate concepts here – one is where you know what you are looking for – the other is when you don’t know what you are looking for – and want to explore the data.
A data lake must be integrated into all systems that are part of the data lifecycle, crudely: creation, capture, analysis and reporting, so that all aspects of the R&D data landscape can be consumed and leveraged, re-indexed and continually enriched. A data lake should not be viewed as a regulatory or intellectual property (IP) store – it needs to be a living ecosystem of data and indices that adapts to the needs of the science and business.
Pharma is looking to shift to a situation where it can be much more data-driven. But first, data must be discoverable for scientists, data scientists and the applications they use. These data jockeys need access to vast quantities of highly curated data to do their jobs – and data lakes are likely the best answer.
AI and other tools like deep learning, augmented intelligence and machine learning all need a similar set of inputs to data scientists – lots of well annotated data. Adding more tags and metadata to a set of data is something that sits at the heart of what a true data lake should be – and the impact could be far reaching. The data volumes are huge and this leads to a couple of issues. Where should this data be stored? And how can it be made searchable? This is where the cloud helps.
Whilst searching is often discussed in a macro sense – Google-type searching for example – the questions that scientists want to answer are not always ‘keyword’ or phrase based. Scientific questions are far more intricate and need more than just typical text indices: they require fact-based searching and relationship-based searching too.
This requirement means data must be treated as a living organism and structured in a way that can handle tricky questions. This means each of the ‘index’ types need to be aware of each other so you can jump concepts, while also remaining easily updatable for when new data types are introduced.
This is not easy, but rapid progress is being made through the deployment and use of cloud storage, semantic enrichment, alternate data structures, data provisioning, data ingestion, analysis tools and AI. All these technologies have a part to play and their level of use depends on the questions being asked of the data. The cloud is the best way to leverage these technologies in a cost effective and consumable manner – vendors just need to make sure their applications are prepared.
Paul Denny-Gouldson is VP, Strategic Solutions, at IDBS