How to make your data AI/ML-ready

A finger points to the letters 'AI' on a screen.

Scientific Computing World gathered a panel of leading experts to discuss the impact that effective data management can have on driving research and laboratory efficiency, as well as enabling future capabilities. Our discussion soon shifted to emerging challenges - namely, the importance of preparing data and systems for use with AI/ML tools.

Read the full report from our panel here

Building a foundation for AI/ML integration

The use of AI/ML involving mass spec data is emerging, but much of the aforementioned data structure groundwork needs to be in place before any such tool can be used.

“There's a lot of talk about whether data is the asset or the model is the asset,” says Sebastian Klie, CEO of biotech company Targenomix (now part of Bayer). “In my opinion, it is actually the reasoning and decisions we make and take as scientists that is the real asset. This is what translates to business impact – data and models allow us to be faster and more accurate at taking business decisions.

“The foundation of the value chain of centralised analytical data is a single source of truth: a data set that is versioned correctly, so that any decision we make is always traceable.

“AI is definitely coming to our field. Presently, we are using classical machine learning models that are consuming the data that is integrated in centralised storage. For AI, we are seeing new challenges in ‘data readiness’, relating to the semantic data layers we need to have in place to give models the ‘right context’.

“In drug discovery, it’s all about the DMTA (Design Make Test Analyse) cycles – we need to own and optimise them in order to complete them faster to generate outcomes.”

Principal Scientist at animal health drug company Zoetis Lisa M Bacco agrees that data preparation is key. “There’s a difference between a data lake and a database and a constant debate between unstructured versus structured data and how we leverage advantages of both in the context of mass spec analysis,” she says. “There are so many layers between acquisition of the ion signals within a sample, the association of those ions to a molecule – metabolite, drug or peptide, and then further, its association with a biological pathway. Structured data is essential for organising and analysing the raw signals, while unstructured data is crucial for interpretation.

“To that end, we work really closely with a team of data scientists, engineers and digital biologists who are working tirelessly to generate a comprehensive ‘biolake’ and exploration tools to democratise the access and interpretation of all omics and non-omics derived data. These efforts will allow us to interrogate large data sets using ML models and other AI tools to action our data.

“As far as AI/ML is concerned, we’re looking at it from a biological perspective. How can we use AI or ML to identify biomarkers of disease? In order to have that information, we need comprehensive data provenance, including the origin of the mass spec file, the subject details, and all associated metadata. Additionally, we must have the outcome data for each subject or patient.

“Without that outcome data, ML models cannot be trained effectively for biomarker discovery. You need that whole picture to fully integrate AI and machine learning in the answering of biological questions. As access to ML and AI tools increases beyond bioinformaticians and data scientists, it’s important to socialise the impact and limits incomplete data can have in biomarker discovery applications. Right now, a lot of people don’t understand that.

“The assumption is that comparing healthy tissue to diseased via proteomics analysis can reveal biomarkers of disease. In reality, that’s not true. Those are putative, measurable changes associated with a disease, but how does that translate to a true biomarker or classification fingerprint where machine learning really shines? You need that outcome data for a more complete picture.”

Data quality and hybrid approaches

At LifeMine Therapeutics, Genomic Informatics Lead Kevin McConnell is already applying AI/ML, but not in isolation. “We’re using DreaMS, an AI tool that helps us generate the embeddings of the MS2 spectra as a method of enabling the search of metabolomics data,” he explains. “By using the embeddings, we are able to help predict the chemical structure – or at least identify the chemical class that those genes are making. It has its limitations, but we are able to use AI/ML to look at mass spec data in this way. We were able to train that model on large amounts of unlabelled MS2 data, then bring in smaller annotated (labelled) data sets that included known chemical classes. By combining those two sources of information, we were able to come up with a prediction model with good accuracy.

“We’ve also implemented clustering techniques using similarities in MS2 embeddings to identify similar structures within a sample, known as molecular networks, that we come up with in metabolomics data. You never get one molecule, you get the family of molecules, aiding in the confidence of identification.”

Prashi Jain, Director, Drug Discovery at Iterion Therapeutics says not all AI tools are equal. “Their effectiveness can vary depending on the specific data class,” she says. “Right now, AI models are most useful for providing us with initial predictions regarding specific compound characteristics, but we still heavily rely on experimental validation of those predictions.

“Having the right data sets is crucial for training models and making accurate predictions. Generally, the more data you have for training and testing, the better the outcomes. That said, it’s not always clear how much data is ‘enough’ or at what point your predictions become truly dependable.”

Birthe Nielsen, a consultant with the Pistoia Alliance - a global, not-for-profit alliance of life science companies, vendors, publishers, and academic groups - emphasises the need for well-structured data before considering AI/ML tools. “AI cannot replace structure and semantics,” she says. “There's lots of talk about how AI can be extremely powerful, but only if it’s fed well-structured, well-annotated and contextualised data. Otherwise, you simply won’t be able to trust the output. Annotated early-stage data like the MS runs could become searchable and comparable across projects.

“We get asked if clients can’t just use LLMs instead of ontologies as, on the face of it, they appear much cheaper and easier to implement. It’s important to understand that these two approaches are complementary – LLMs benefit from structured data being there in the first place.

“A lot of companies don’t like to share data, but there are opportunities for federated learning, whereby data can be used to train AI models, without the underlying detailed data being visible. We have tried a few projects to see if we can develop this further.

“Should all data be shared? No, of course not, but I think that the pharma companies sit on a lot of data that could be shared and could be really valuable, especially for smaller companies or academia. Being a bit more open with data could help train these language models. Again, having an industry-wide standard brings the advantages of being able to bring in public datasets and so on.”

Niche AI tools and data sources

Lars Rodefeld, a scientific consultant who spent 27 years at Bayer CropScience, has seen good examples of these smaller, application-specific AI tools popping up. “There’s a nice little start-up in Münster at the university called Chem Innovation,” he says. “They have been able to put GC/MS [gas chromatography / mass spectrometry] data together from different sources and build an AI forecasting model for structures. For everything that’s lower mass like flavours, fragrances, or oil refinery, it works pretty well.

“We have used AI forecasting to reduce the number of data points we need to collect. In our water solubility project, for example, we might be looking at 10,000 data points being collected every year in our laboratory, each time doing a seven-point calibration measurement in the first instance.

“With forecasting, we’ve been able to get that down to a three-point calibration, effectively doubling the throughput of the machine and the person operating it.

“What I would really love to see is that we finally get to a place where we have all LC/MS, MS, or NMR data talking to other systems that are used for structure forecasting. You might not need the raw data, perhaps just the fingerprint or the Fourier-transformed data – but that should be enough to feed an LLM. Existing instrumentation only really offers incremental forecasting, as it doesn’t yet have the reputation of being more precise.

“We do need shared platforms – those that get data out of proprietary instrumentation formats – that will really drive innovation through shorter test cycles.”

Targenomix’s Klie says niche areas of research may not be so AI-ready. “Generally in our field, most of us don’t have enough data to train huge models and develop complex architectures from scratch,” he says. “We use embeddings from existing LLMs in the genomic and proteomic space in order to put our entities that are under investigation, whether it’s chemical structures, a gene or protein sequences in context.

“These architectures have been developed and trained by companies using vast amounts of publicly available data.”

Other barriers to AI/ML adoption

As well as the need for the underlying data to be at a sufficient level of preparedness, there are still many other bumps in the road that need to be navigated before the widespread adoption of AI/ML models in this field.

“AI/ML tools are models; they’re not absolute truth,” says LifeMine’s McConnell. “It will always require additional investigation to get into the real, true details.

“Where we’ve seen AI really work well is by building on top of a foundational model, which traditionally needs a large data set, with high quality, well-annotated smaller data sets. The underlying foundational data doesn’t need the same depth of labelling, but your smaller, well-curated data sets can be used in fine-tuning the foundation model to uncover new discoveries.”

At Targenomix, they take a similar approach, as Klie explains: “We use LLMs trained on huge amounts of data and refine them to our tasks. This is referred to as transfer learning – a machine learning technique in which a model pre-trained on one task is ‘re-used’ as the foundation for a second, related task in which we are interested. So far, this has worked quite well for us and served our needs. However, there is more of a problem with explainable AI. In many cases, those models that we use are black boxes, often due to their complexity, and we need to invest a lot of time and effort to make them explainable and understandable.

“Another example: we might build a model for use in a particular applicability domain, but those projects are finite. Now, if I want to move that model to work on novel data that potentially follows a different distribution or structure, I need to be able to trust and understand the model before I know I can use it in this new area.

“As humans, we need to understand that when we make decisions based on models. We need to make sure we are doing so within the applicability domain of that model – i.e. a data space where it was trained on and where the model’s accuracy or confidence is high. It is dangerous to trust the ‘black box’ models, because they often still provide us with outputs, even if they are applied to data they haven’t yet seen.”

There is also the human problem in terms of users needing to change habits – something that Iterion's Jain has seen. “There is often some hesitation among laboratory scientists when it comes to adapting to new systems and practices,” she says. “If the process of integrating AI/ML requires the original data to be restructured, annotated or stored in a standardised format, the scientists who generated the data need to be part of that process to ensure fidelity.

“These steps are sometimes viewed by scientists as ‘administrative duties’ that take time away from experimental work. However, without proper data curation at the point of generation, it becomes significantly harder for bioinformaticians to ingest, standardise and analyse the data for downstream use, including model training and long-term reuse."

Effective data curation is fundamental to success with AI/ML tools. Our panel also discussed the importance of identifying data challenges - and how to effectively overcome them from the outset.

To read more, download the full report here.

How to make your data AI/ML-ready

Building a foundation for AI/ML integration

Data quality and hybrid approaches

Niche AI tools and data sources

Other barriers to AI/ML adoption

Topics

Read more about:

Editor's picks

Where hype meets reality: why breakthrough computing technologies must prove their scientific worth

Free Online Panel Discussion | Compliance Without Compromise: Why Modern Labs Are Moving Past Legacy LIMS

On-Demand: Optimise your HPC storage strategy

On-demand | AI in Life Sciences: Practical applications in small molecule design

Why AILNs are the future of scientific discovery

Future-proofing your lab: key considerations for upgrading or switching chromatography data systems

Artificial intelligence in life sciences: The dawn of scientific AI