More data, fewer tools, more powerful software

In September Louis Culot became chief executive officer of BioData, a Digital Science company. He talked to Siân Harris about the informatics challenges presented by the growing volumes of laboratory data.

What are the challenges, as you see them?

Having data and information in a system other than the notebook is a challenge. Laboratory management systems, for example, don't generally create a snapshot in time; they tend to present more of a dynamic picture as it is. Things like calibration dates of instrument might be kept or they might not, and have to be synchronised with when the data were generated.

You have one system that is your authority. Sometimes it is easier to snapshot data into the notebook if the dataset or other record is not huge. Where the data is huge, it makes more sense to link and make sure that the repository where the data is located is maintained to same standard. But that should be done judiciously since it adds cost and maintenance complexity.

I think there will be a move to fewer, more capable applications that are integrating various disciplines and elements of the lab process. I think there will be an increase in the amount of data stored and shared - in fewer, more capable tools, with not so many silos.

What trends have you seen with experimental data?

One of the largest changes, which is still underway, is the movement towards data entry at the point of capture rather than at the end of the day. The other big trend is the increase in storage, which means that everything can be captured and analysed later.

Data quantity is a practical challenge. My past role was in genomics, where datasets can be huge and are currently kept somewhat in check since costs escalate quickly with the number of subjects. One of the challenges in that field is getting statistical power at cost, and as costs continue to fall the data storage and analysis problems continue to rise. Once you have the raw data, the bigger challenge is annotating and making it searchable. Science has to explain what it is you interpreted.

There are three big challenges that the research community is facing that digital systems can help address. First, there is research productivity. There is now much more of a focus on what kind of output we're getting for tax dollars, foundation, or commercial investment.

Another big challenge that’s recently been a major focus is reproducibility. Studies have found that, conservatively, more than70 per cent of certain types of studies were not reproducible. This could be for a few reasons, but the one we’re primarily focused on is transparency.

Experimental protocols can vary in their sensitivity to inputs – there are examples where people can't get something to work unless they use a particular reagent from a particular vendor, or depend on verification of intact sample and purity after extraction - the importance of this is not always clear from the methods section of papers but should have be captured in the laboratory notebooks. If we get transparency into the process we believe we can have a significant impact on this problem.

The other thing with transparency is the lack of laboratory control of materials. For example, in the USA recently there have been cases of discovering samples of smallpox, avian flu, and other bio-hazard materials, some decades old, being found in the corners of labs. So, control of materials and reproducibility can actually go hand-in-hand.

What are your thoughts on open data?

The bottom line is that open data is great. It helps with reproducibility. However, going beyond the paper to the experiment presents an extra challenge to scientists. They have to record their data with an eye to how it is going to be presented and read by others. This means that they have to pay more attention to the reader who might want to reanalyse the raw data.

Like most trends, it will have to catch on in the culture. At the moment, scientists feel that their lab notebooks are their property. I hear scientists say 'yes, people can read my notebook but they won't know what they are reading'. This isn’t a problem for electronic systems, since virtually all of them let scientists keep their records private. But it is a challenge if we’re asking them to open them and make them part of the paper.

I've never heard them say that they don't want to be scooped but I think that is a fear too. Or maybe there is something they have observed in their experiments that they want to take further themselves, so don’t want to make public.

Our focus is on life sciences, mainly biology and chemistry. We are fairly biology heavy right now because that is the area that has been the least addressed. As far as organising data, there isn’t a massive difference between subjects, but you do have to understand the type of data.

Dr Siân Harris is the editor of SCW’s sister publication, Research Information. A longer version of this article will be published in the October/November 2014 issue of Research Information.

More data, fewer tools, more powerful software

Topics

Read more about:

Editor's picks

Synopsys confirms deal to buy Ansys for $35bn

NEW On-Demand | Ontologies - the missing foundation for AI in drug discovery

On-Demand | One workflow, every tool: how AI-native ELN is changing drug discovery

On Demand: Free Online Panel Discussion | LIMS innovation boosts precision and security

The path to AI federated learning for drug discovery

Workstations vs Clusters for Ansys Applications

Avoid Duplication, Reduce Fragmentation | Integrated Informatics for Scientific Research