Skip to main content

What is the future of data-centric biotechnology?

Laboratory automation, biotechnology

Credit: motorolka/Shutterstock

Biology’s complexity is emergent: its intricate and novel properties, patterns and behaviours arise from the interactions of simpler component parts. These emergent properties are not directly predictable from the properties of individual components and often require a holistic understanding of the system's interactions and dynamics. So this means that merely harvesting huge amounts of data is not enough. It has to be data that is comprehensive and interconnected enough such that the interactions at the heart of biology can be identified.

Here I’d like to explore the different aspects of the data we will need to generate, in order to achieve meaningful understanding of the biological systems we work with.

Data volume and variety

The future of data-centric biotechnology probably looks fundamentally different from how biology looks and operates today. Much of the current focus is often on gathering large and varied datasets (think multi-omic data from large numbers of patient groups). These large “observational” datasets can be extremely valuable for specific parts of the drug discovery process, such as target identification, but are unsuited to other stages of bringing a drug to market. So while it is clearly necessary to have large datasets in order to get insight into biology, it’s not sufficient. For example, the validation of targets relies in part on the perturbation of clinically relevant cellular models. We need to understand how these cells behave as we disrupt the expression or activity of our target of interest. The data here are not merely observations but are actively generated as part of an experiment.

Multidimensional data from experiments

Living systems are not static: genomes stay mostly fixed, but every other aspect of an organism or cell line will change dramatically over time and depend on a wide range of factors relating to the conditions that the system is subjected to. This means that to build a meaningful model of how a system behaves, we need data that effectively describe that dynamism. Sheer volume of data doesn’t cut it for this. If you gather all your data under a limited set of conditions this can be woefully inadequate, or at worst actively misleading. Instead, we need to generate datasets that span large numbers of dimensions, dimensions that correspond with all the changes in conditions and timings that might be relevant to that biological system. This, combined with gathering as much data as possible for each set of conditions will bring about a step-change in our understanding of the biology we work with. 

Data quality

So we need multidimensional, dynamic data, and we need lots of it. But these are still useless if we don’t get the basics right. Quality of data can be measured in many ways, but I’d argue there are two main aspects we need to pay most attention to. Firstly, we have to be measuring the right things. Often the temptation is to measure what is easiest to measure: looking for our keys under lamp-posts, not in the dark and tangled bushes where they’re more likely to be. This will lead us to make beautifully sophisticated models that are also completely irrelevant. The second is to make sure that our assays are as high-quality as we can make them. According to the NIH Assay Guidance Manual, expenditure on assays equates to more than a third of the pre-clinical outlay for an example drug discovery program, so ensuring that these assays are giving the cleanest data possible is paramount.


Large, varied, dynamic, high-quality data. Sounds like a utopian vision, but it’s still missing a key component if those data are going to have enduring value. All data is produced for a reason and has a place within a broader landscape. Without this context it is utterly worthless. Consider: the key output from a critical experiment could be a csv with 96 numbers in it. It could hold within it the information that identifies a fantastic lead compound for an as-yet untreated disease. But by itself, with no context, it is just a collection of seemingly random numbers. 

On a less dramatic note, this is the fate of all data that leaves the scientist who generated it without metadata that describes how and why the data was produced. Ideally we’d have as much context as possible, such that future as-yet-unimagined AI-driven analyses can make as much use of today’s data as possible.

The future: data-centric biotechnology

We recently ran some research that found a staggering 43% of R&D leadership have low confidence in the quality of their experiment data. This is concerning because it doesn’t just demand we improve our means of recording experiment data, it also demands we perform experiments that generate higher quality data in the first place. It follows that to understand this data correctly we also require a high level of granularity about how it was created: metadata about experimentation itself should be collected as much as possible.

There are companies who, today, already exhibit many of the required characteristics of companies that are looking to the future in terms of how they think about data gathered about biological systems. Think of companies like Recursion and Insitro, that have built whole automated platforms around this. Fully digitized, they are built to systematically create a greater understanding of biological systems.

They give us a glimpse of what the future may look like: the routine generation of high-quality, large, varied, multidimensional data, in the full context of rich metadata. Data that provides the foundation for AI, and a step change in our ability to understand and work with biological systems.

Markus Gershater, Chief Science Officer, Synthace


Read more about:

Biotechnology, Life sciences

Media Partners