Supercharging the bioinformatics data silo
Sophia Ktori explores the role bioinformatics software plays in helping scientists to make sense of complex scientific data.
A key ongoing issue for organisations involved in drug discovery research is how to manage chemical and biological data in combination.
Abraham Wang, head of marketing at Collaborative Drug Discovery (CDD), said: ‘Drug discovery and development inevitably combines chemistry and biology, but there has traditionally been no intuitive, data-rich way to hold and interrogate data on both types of entity in one repository.’
This results in what Wang calls ‘a data silo’ between biologists and chemists: ‘Our customers tell us there are plenty of software systems that cater for the chemists who are synthesising compounds, and separate systems for the biologists who are running the assays to test the efficacy of the compounds. But you end up with an artificial division of data because the one won’t work alongside the other.’
CDD’s flagship CDD Vault offers a complete informatics platform and ELN for housing, managing and querying all compound and experimental data and metadata, in one space. ‘Our strengths have traditionally been in managing small molecules and structure activity relationships,’ Wang said. ‘But we also realised an ever-growing need for a platform that could also handle biologics, and this represented an important step in our evolution.’
The goal was to fill this gap between the management of chemical and biological data. ‘We wanted to develop the Vault into an interconnected platform for biological and synthetic molecule data management and querying – after all, 50 per cent of new drugs being developed today are biologics,’ Wang noted. ‘Its about bringing chemists and biologists together, not keeping them apart.’
With enhancements to the platform released at the end of 2021, CDD Vault users can now register and analyse their biologic entities both alongside, and in combination with, synthetic molecules. ‘It’s a great start,’ Wang said. ‘Our customers who have been using CDD Vault for compound management can now register and manage their plasmids, antibodies, peptides, proteins, nucleic acids and even mixtures and complex entities, such as antibody drug conjugates that combine an antibody with a linker and a synthetic compound.’
Key data is stored for each molecule in a particular registration entity – say, nucleotide, amino acid, mixture – and then properties such as molecular weight and other compositional information, are automatically generated by the Vault for each registration: ‘Within the next few months, we are going to add some key additional features, including the capability to carry out sequence searches and visualise plasmids,’ Wang added.
The flexibility of the CDD platform means users can also either set up one single Vault for all of their entities, biological and chemical, or establish separate Vaults for, say, cell lines, nucleic acids, antibodies and compound libraries, so that specific entry fields can be set for each type of entity. Wang said: ‘What’s important is the Vaults can be cross-interrogated, so there is no loss of intelligence by having separate Vaults.’
As far as CDD is aware, this combined depth of management, degree of flexibility and interoperability isn’t available elsewhere in a single platform for both synthetic and biologics. Ultimately, it allows organisations to retain and manage all their key data, and importantly, capture and retain metadata associated with experiments – with connection to the design, synthesis and testing of synthetic and biologic molecules – on a single platform, to help ensure there is no loss of either content, or context.
Wang said: ‘Data is a lab’s most valuable asset, and by using CDD Vault, organisations can keep their chemical and biological data relevant, clean and accessible. We all understand that spreadsheets and paper notebooks can’t be searched or cross-referenced easily, and it’s also hard to keep these types of files up to date. With CDD Vault, organisations now have what we call a single source of truth immediately available. Everyone is looking at the same, current data, which facilitates decision-making and collaboration in real time.’
CDD Vault gives users an interconnected network view of compounds and biologics and experimental data across the ELN and registration system. Register every entity and reference how they are used in experiments so there is a deeper understanding and confidence in assay results. ‘You never lose track of the relationship between your entities and experiments,’ Wang pointed out.
Importantly, CDD works with other software providers to help establish the Vault as an integral part of a lab’s informatics ecosystem, said Wang: ‘Biotech companies, and especially the smaller ones, aren’t likely to have a complete in-house infrastructure necessary to carry out the end-to-end discovery, optimisation and development as a seamless workflow. It’s also unrealistic to expect any single vendor to do everything for them. No one software can do everything, so any drug discovery pipeline will likely involve working with a number of vendors to cover all the bases.’
CDD thus works with a range of partners to integrate the Vault with complementary software platforms. ‘Our strengths are in storing, managing and mining data, and allowing organisations to collaborate easily and securely, said Wang. ‘But the drug discovery and development pipeline requires a whole raft of capabilities, and no one vendor can offer all of that, so we have established partnerships with multiple vendors, so users can integrate CDD with these other specialty applications.’
It’s what Wang calls ‘a best of breed’ approach: ‘You take the best in class for compound management, which is CDD Vault. You then make a connection, via the CDD API, with what we consider are the best platforms for, say, inventory management, lead optimisation, or analytics. If our customers choose to work with these providers, relevant data can be pulled directly out of Vault into these platforms, and then resulting data imported directly back into the Vault. It also means all of this new data is available, as it’s generated, in the Vault.’
Partners in this ecosystem include Certara, SarVision, DataWarrior, Elixir, Knime, Schrodinger, PostEra, Microsoft (CDD Vault’s ELN directly integrates with Microsoft Office), Titian and Optibrium.
Wang said: ‘So while we don’t have the resources to be everything for everyone, we integrate CDD Vault as part of an informatics environment with these key providers and their platforms. Our customers can leverage the technologies they need for their research and pipeline development, with CDD Vault at the centre of their data management.’
Another major challenge for drug discovery and biotech generally, is that generating data is hugely resource intensive, explained Matt Segall, CEO at Optibrium: ‘Big pharma has dedicated departments for screening and will generate large amounts of data, but biotechs may have to be very selective about what they measure for which compounds, on a cost basis, particularly if they’re outsourcing to contract research organisations.’
Even for big pharma, measuring all the properties and activities of interest for every compound of interest is cost prohibitive, and so there will inevitably be incomplete, or sparse data, he continued: ‘Another challenge is that data is typically noisy. Biology is messy, and measurements may be subject to experimental variability. When experimental errors creep in, you can really waste a lot of time and resources either pursuing a hypothesis that turns out to be incorrect, based on incorrect data, or – perhaps catastrophically – discard a promising potential compound because of false negative results.’
The prospect of using AI to help make informed decisions on compounds is thus huge, Segall notes. ‘There’s a real appetite in biotech for AI. And again, AI isn’t just a topic for big pharma, we have a wide range of organisations talking to us about intelligent solutions for their specific challenges.’ Optibrium is pioneering predictive modelling as an aid to decision analysis in drug discovery, and has developed AI-based platforms for small molecule design, optimisation and data analysis, and what it describes as Augmented Chemistry. The company’s Cerella platform harnesses a unique deep learning approach to help overcome limitations or gaps in drug discovery data, and ultimately reduces costs and speeds drug discovery cycles.
‘We’ve developed products and platforms that bring AI within the reach of even niche biotechs,’ Segall noted. ‘Cerella effectively helps to highlight high-quality compounds with confidence, and prioritise compounds and experiments. The platform exemplifies how we can offer state-of-the-art, turnkey AI solutions that are intuitive, affordable, and that can have a significant impact on drug discovery timelines, decision-making, and ultimately, we hope success.’
Using deep learning imputation, Cerella looks at all of this very sparse and very noisy and messy data that is generated experimentally, and essentially fills in all the blanks, to find otherwise unrealised opportunities. ‘It’s far more than you could even imagine doing with a conventional cheminformatics approach,’ Segall stated.
‘And it doesn’t just “dump” all of that data onto a desktop,’ Segall noted. ‘Cerella is far more proactive. It can tell you if you have missed compounds that might fulfill specific sets of criteria – and highlight compounds with high probability that they’re going to achieve your objectives – and you can then validate those propositions experimentally. You get a much broader view of the molecules you are exploring, with biological and experimental context. This is something the cheminformatics space won’t achieve … It’s meeting chemistry and biology in the middle of their respective spaces.’
Traditional cheminformatics techniques are based on visualisation of data, and analyses, such as structure activity relationship analysis (SAR), to make sense of that data. This provides an understanding of the relationship between the structure of a compound and, say, its activity against a target. Cerella can relate that to the biological relationships between the different things being measured, and the overall outcome.
‘And it brings all that together and understands all those relationships, both chemical and biological, to make much more accurate predictions, potentially even in the much broader context of previous projects, and other information that might be tucked away in a database that hasn’t been looked at for years,’ Segall said. ‘It’s a proactive approach that can help inform potentially new directions for projects. And that’s where the promise of AI can leverage much more value from that data.’
And for small biotech, which may only have a few projects in its pipeline, the ability to maximise intelligence from valuable screens at any stage is critical, Segall noted: ‘Some of the biggest challenges these companies face is how to use the data they do have more effectively, to make decisions in the course of the project, and avoid those missed opportunities. AI can help in the decision-making process to give confidence that you are actually running the most valuable experiments, and that the resulting data is going to add the most information to make those better decisions.’
Importantly, Cerella doesn’t require what Segall describes as ‘a bunch of expert Python programmers, huge libraries and a team of data scientists to work with it: ‘We’ve implemented Cerella as a cloud-based platform, so essentially, it plugs into your data source, acquires the data you give to it, securely and safely, and cleans that data.’ And while the task of data cleaning can otherwise be incredibly time consuming and tedious, ‘Cerella does all that data cleaning, and then prepares the data for modelling,’ Segall continued. ‘It builds and validates the models – of course, you can look at those results and carry out further validation – and then automatically fills in the blanks and makes this very, very rich data accessible in a very, very intuitive way.’
This means Cerella can be interrogated using simple questions, for example, to find compounds that may have activity against a particular target. It will also suggest compounds that may not have yet been tested in that assay.
Cerella doesn’t just have utility at the level of screening, Segall pointed out: ‘We can use the platform to predict in vivo response, either in preclinical in vivo models, or potentially in human clinical trials. Through a collaboration with AstraZeneca, for example, we’ve demonstrated how Cerella can use in vitro ADME data to predict in vivo pharmacokinetics of a compound. And that’s an incredibly powerful capability. Or, the ability to help predict human safety outcomes using preclinical data, as another example.’
While Segall isn’t suggesting throwing out a particular compound based purely on a prediction, ‘even though it will be a higher quality prediction than conventional QSAR models,’ he acknowledged, ‘what it does do is inform you there’s a higher potential risk and informs what experiments you should do to mitigate that risk.’
A key part of Cerella is that the AI explains its reasoning. ‘It’s one thing to have a black box that spits out an answer and you have to sort of trust it, but actually understanding why it’s made a prediction is really important,’ Segall concluded. ‘This means scientists can validate in their own minds whether a prediction makes sense, formulate hypotheses, and derive new questions to answer or future avenues for experimentation.