Controlling your data

Given that one of the overarching aims of any R&D – or data-driven organisation – is to maximise the amount of useful information and knowledge that can be derived from vast quantities of disparate data, how do today’s software-fuelled labs make sense of and contextualise that data, not just for today, but so it remains relevant and insightful years down the line?

Longevity comes up in many discussions. ‘The key is to think about the data lifecycle, so you can then build in sufficient control and data management,’ suggests Nick Lynch, Pistoia Alliance investment lead and one of the founders of the organisation. How is that data generated, where does it go, what do you want to understand from it, and how might it be used in the future?’

Scientific data is not disposable, it has a life beyond its initial creation, and this is a key consideration for labs that are setting up or upgrading their data management systems. ‘Whether that data is generated at the R&D stage, or at latter-stage clinical trials, having complete oversight and control of every aspect of that data is imperative, so it will still be accessible, relevant and usable in the context of future experiments and analysis, and especially re-analysis using AI/machine learning methods, perhaps 20 years down the line.’

FAIR principles

The concept of controlled data should also fit in with the foundational principles of FAIR data: findability, accessibility, interoperability, and reusability,’ Lynch continues. ‘I would add ‘quality’ to those principles.' The commonality of high-throughput and high content workflows, and the breadth of data now generated means that labs can’t just look at their data in terms of its endpoints, but must have systems in place to accurately manage all of the metadata, whether that means accurately documenting which cell lines have been used and from which supplier they came, who carried out the experiments, the source of reagents and consumables, and the maintenance of any equipment. Context and control go hand-in-hand, he suggests. ‘Only then can you truly compare your data, with that from future or past experiments.’

Set in place systems that can effectively husband and provide access to all data and metadata, and you will have the quality of data needed to exploit AI and ML tools and algorithms that can further identify patterns and generate new insights from data streams originating from different sources. ‘I think there are two aspects to consider when bringing in AI/ML,' Lynch notes. 'These are, “what can I do to ensure my data is of high enough quality to feed in to these algorithms?” And “what can AI/ML – perhaps more accurately described as augmented intelligence – then do with that data to help build models that I can be confident are relevant?” If your data isn’t up to scratch, complete and accessible, there is no point.’

Human responsibilities

The role of the scientist, and their experience and skills in designing experiments is also critical, and shouldn’t be underestimated, but scientists must also understand the necessity to collect all of the contextual and metadata around an experiment. ‘It’s very much a human responsibility to make sure that data and metadata are correct at the point of creation, Lynch noted. ‘You don’t want to have to try and fix data terminology or language, somewhere down the line to make it fit the required format. Not only would you then run a risk of reducing the usability of that data, but human error may come into play. It makes good sense to have set in place enterprise-wide data standardisation – this also marries with the findable, and interoperable principles of the FAIR data guidelines.’

The Pistoia Alliance is developing a FAIR toolkit to help companies adhere to FAIR principles, and encourage the use of best practice and learning throughout an organisation. ‘The term FAIR is actually quite wide-reaching, so you need to bring it down to practical levels for day-to-day operation. It’s not necessarily about reinventing anything, but more about making sure that everyone is aware of and implements best practice.’ Pistoia Alliance is working with many of the other FAIR initiatives, including IMI Fairplus project, to get the best outcomes.

Unified Data Model

The initiative’s Unified Data Model (UDM) supports experiment language standardisation, so that it becomes possible to share that data both within an organisation and to third parties. Standardisation of vocabularies and ontologies reduces the likelihood that data is misinterpreted, and increases confidence in that data, Lynch indicates. ‘You are also improving your efficiency day-to-day and also for the longer term, because the process of data acquisition, storage, utility and reporting is more seamless.

‘We are working with some of the world’s biggest biopharma companies, including Roche and GSK, to develop a unified data format that will hopefully go some way to helping people exchange data and experimental information. This will also work alongside initiatives, such as Allotrope Foundation, to help make the process of standardisation seamless across disciplines.’

How do companies start to put in place an infrastructure that will support data control and the FAIR principles? ‘It's probably worth starting with a kind of cartoon view of the data lifecycle, including who you may need to share that data with,’ Lynch said. ‘Just that outline, or sketch gives you a more complete view of who your key stakeholders are, in what formats you will need to deliver that data to them, and what data may also need to be brought in from third party partners, contract research organisations (CROs) or from the public domain. From there, you can start to derive some idea of how to store your data, make it more interoperable, and ease data exchange and analysis to drive scientific decision making.’

Achieving quick wins

Implementing an underlying infrastructure is just as much a business change as it may be a technology or human-oriented change, Lynch noted. ‘It will almost certainly not be a quick fix but quick wins can be achieved. Software and data investment must be paralleled with an investment in educating personnel.’

Breaking the process down into manageable chunks and making small, quick wins during early implementation of new software and processes can help to get people on board, and make the value of change immediately evident. ‘Whether that means deploying UDM or using the FAIR toolkit, these stepwise changes will show short-term benefits and encourage wider adoption,’ Lynch notes.

Increased quality of data

Understanding what a lab expects to do with its data and how scientific information is captured and then managed to allow that use is fundamental to the concept of data control, according to Jabe Wilson, consulting director for text and data analytics at Elsevier. It's all part of this same basic concept of long-term utility that inevitably filters through every discussion. ‘That understanding should go hand-in-hand with setting in place tools that can help improve the quality of data, and ideally apply standardised taxonomies and dictionaries. Increased quality of data will increase confidence in its utility, improve interoperability, and also help users derive more contextual relevance. That assurance of data relevance and quality means that AI and machine learning can be exploited to derive meaning from patterns and in-depth analyses.’

Another principle that may be overlooked is that of timeliness, Wilson continues. ‘This is about making sure that your systems for controlling data can also allow that data to be processed quickly so that it is available in the right form, and in real time.’ Laboratories need to embrace the concept of data science and its potential to accelerate R&D and product development, Wilson suggests, referring to comments by Novartis’ chief digital officer Bertrand Bodson, who in an interview shortly after his appointment in 2018 stated, ‘We already rely on data, but how can we unlock its power to drive more of our decisions, so that we can get better drugs to patients faster?’ In reality that’s the bottom line for any pharma company, and for any R&D-driven industry.

‘Making the most of data will likewise be a driver of competitiveness and ultimately success, aiding faster, more insightful decision making,’ Wilson points out. ‘To do that, companies need complete control of their data, so that they can easily find it, understand it, mine it and analyse it collectively.’ Lack of control of data can therefore impact on competitiveness. ‘One pharma company that has a better handle on their data than another will be better informed for making decisions on pipeline, and ultimately may get to market sooner.’ Better data means more informed decisions, and so lower attrition rates.

Data wrangling

It may seem obvious, but most labs are way off that ability to usefully exploit every piece of data. As Bodson said in the same interview: ‘Our data scientists probably spend 80 per cent of their time right now on data wrangling to get the data in good shape, which is really a pain.’

This is not an uncommon bottleneck, Wilson notes, but companies may be slow to understand that investing in new tools – or ensuring full application of existing tools – that can help to get data in the right shape, will pay off in the long run. Novartis is investing in building a platform to organise its data and ensure that it is fit for purpose, findable and accessible – we come back to FAIR Data principles – but that investment should start at the level of the scientists who generate and use that data, so that they understand how and why they should record and annotate their results, Wilson said.

Avoiding the need to ‘fairify’ data

‘If you are carrying out an experiment you obviously make sure that it will deliver results for immediate use, but should ensure that they capture and record and make accessible every bit of data that may be relevant to future use of results. This will then make your data assets far more usable, both across a business, and also between partners or service providers, such as contract research organisations (CROs). Build this concept in from the ground up and you won’t have the literal, time- and effort-related costs of having to ‘fairify’ your data at a later stage.’

Wilson acknowledges that the complexities of data control are not the same for every industry. In life sciences and pharma in particular, diverse, content-rich and high-throughput technologies for biology and chemistry generate vast quantities of disparate data, potentially across many different disciplines. ‘You have to ensure that you retain all of your data in a usable format, and that, if necessary, it will pass regulatory muster today and potentially years down the line.’

The need for data standardisation comes up in any discussion on data management or control. Wilson said: ‘The need for standardised taxonomies, dictionaries and ontologies makes it possible to interrogate and compare data from one source, experiment or scientist, alongside data from any other sources. Make sure your identifiers and synonyms are aligned, whether you are referring to protein or gene identifiers, or describing diseases or phenotypes.’

Elsevier has applied its expertise in this area to develop a suite of data tools that it is making available to the industry. The organisation’s PharmaPendium gateway gives customers the ability to search regulatory documents and data from FDA and EMA, and add insight to in-house R&D, without adding to the complexity of data husbanding. The cloud-based Entellect platform has been developed to deliver harmonised and contextualised data that can then be exploited for advanced AI-driven analytics, Wilson explained.

Entellect effectively links and manages disparate in-house and third party data, and gives users access to Elsevier’s own databases and content collections for pharma R&D. It is then possible to leverage off-the-shelf data analysis applications or, through Elsevier’s Professional Services division, develop custom analytics applications.

At its most basic level Entellect acts as an integration tool so that users can make their own data ‘fair’. But as an open platform, Wilson explained: ‘Entellect is also API-friendly, so users can interrogate the system and pull out information that they need’.

‘Users have immediate access to diverse data, from drug chemical information to clinical trials data, and can and use that data in any way they like. It’s a hugely powerful tool that gives people ultimate control over their data and how it is applied to generate meaningful intelligence.’

Wilson develops the idea of data science as an art. ‘You need to provide your scientists with the tools they need to be very intuitive and flexible about the work they do, but with the discipline to know how to collect, manage and control that data. This may need quite deep cultural changes so that it will eventually become second nature.'

Controlling your data

FAIR principles

Human responsibilities

Unified Data Model

Achieving quick wins

Increased quality of data

Data wrangling

Avoiding the need to ‘fairify’ data

Topics

Read more about:

Editor's picks

Out now: The Laboratory Informatics Guide 2025

Online Panel Discussion | Optimise your HPC storage strategy

On-demand | AI in Life Sciences: Practical applications in small molecule design

On-demand Webcast: Transform your labs with cutting-edge AI solutions

Centralising analytical data from mass spectrometry in drug discovery and development

AI-driven Laboratories: Navigating Challenges and Embracing the Future

Choosing a flexible digital platform for drug discovery