Sophia Ktori completes her two-part series on the use of artificial intelligence in healthcare research
Artificial intelligence isn’t all about smarter, more accurate diagnostics, predictive or personalised medicine, points out Lee Harland, founder and CSO at Cambridge, UK-based SciBite. There are far more fundamental applications. ‘There’s a phenomenal amount of data coming out of laboratory machines, and this is causing data management problems that are undermining the achievable value of that data.’
The availability of vast amounts of increasingly detailed data has, in parallel, spawned something of a philosophical change in how scientists perceive that data, Harland suggests. ‘Lab work and clinical practices have traditionally been very hands-on, and data was a bit of an inconvenience.’ But now that so much R&D and clinical testing is automated, data has become completely central to results and decision making. ‘Speak to the pharmaceutical industry and it’s evident that data is now at the very heart of what they do, and much of their remit is to generate data that drives product development.’
Bringing in AI
Using AI to clean up and make sense of messy text-based information means that AI algorithms used downstream to analyse that information can then make full use of what is put in front of them, Harland continues. ‘In a drug discovery environment being able to access and use information on the biology of a drug target, its potential side effects, comparisons with other molecules in the same class, and structural data, depends on the clarity and interoperability of data available in databases, and generated by instrumentation.’
From the user’s perspective, and starting from first principles, the first stage is knowing where to find your data. Sounds simple, but Harland cites one company that acknowledged holding 148 different databases. ‘And even once you know where your data is, you must know what to search for,’ he says. The painkiller acetaminophen, for example, is known as paracetamol in the UK, but in the US its everyday name is Tylenol. Search for the wrong term and your software may not recognise that the two mean the same thing and two different people may obtain two sets of very different results.’
SciBite is addressing these sorts of first layer issues using AI as the foundation for tackling the problem of synonymy – paracetamol vs Tylenol – but going far beyond the capabilities of traditional search engines that treat words as just strings of letters, with no intrinsic meaning. Harland said: ‘Many drugs have multiple trade names, for example, and every gene has many different names. It’s one thing generating software that can recognise multiple terms, but with AI we train machines to understand that Tylenol and paracetamol, may be different words, but they are, in effect, the same drug, and once you can train a machine to recognise what a word “is”, then those words start to have meaning.’
Understanding scientific meaning
SciBite’s AI algorithms can be thought of as the plumbing that underpins text-based systems, such as a laboratory notebook or assay registration system. They allow software to understand scientific meaning, Harland suggests.
‘Text itself is pretty much useless to a computer,’ he said, ‘but turn that text into data, and suddenly it becomes usable content. Think about a published scientific paper, which may contain a huge amount of scientific information, but is completely unusable from an analytical perspective. It’s just a collection of words in a specific order.
‘Take that same paper and run it through our software, however, and it outputs data that is interoperable with other datasets, and can be turned into more structured, machine-readable data that will work with downstream analytical algorithms.’
At the heart of SciBite’s semantic analytics software suite is TERMite (term identification, tagging and extraction), a named entity recognition (NER) and extraction API, which scans document text in real time, at about two million words a second.
‘It can find a word, let’s say Viagra, and it knows that Viagra is a drug. By looking at word usage and proximity of words in that document, the system can then figure out what Viagra is used for, and then apply that knowledge to find and extract information on Viagra from millions of other documents.’
TERMite has been developed as a system which pharma companies can ‘plug-into’ their existing analytical software, Harland explains.
‘Just as you find a spellchecker inside a word processing software, our TERMite application can sit inside scientific data applications, making them instantly more intelligent.’
SciBite has also generated more than 100 ontologies containing many millions of synonyms across topics including genes, drugs, diseases, adverse events, all of which are delivered through TERMite. The firm’s TExpress software, which also works with TERMite-processed data, goes a step further and is able to find and extract semantic patterns of biomedical notation within sentences, such as text that describes how a specific gene defect leads to a certain disease.
‘Over the next few years, as we work to improve the quality of our data even further, AI will be able to ask more sophisticated questions, such as “why” someone is looking at changes to features in cells. When we get to this point, AI will be able to add even further depth of insight, because it understands the “why” of that question. In the cell recognition example, this might be because we are looking for compounds that can treat cells by generating the changes we are looking for. And then we can start to use AI to look for similarities in the biology of how different compounds work.’
Costs and serendipity in drug discovery
The drug discovery process represents a huge financial and resource drain on the industry, and has historically endured a high candidate attrition rate, comments Andrew Hopkins, CEO of UK-based Exscientia. ‘Traditional drug discovery operations account for about 35 per cent of the total cost of bringing a drug to market, and you may have to run 20 drug discovery projects, each one costing $15-20 million, even in the early, preclinical stage, just to get one molecule that will ultimately stand a chance of FDA approval.’
There has always been a large element of serendipity in the early stages of drug discovery to find promising ‘hits’, he notes, but the experience and insight of the scientist driving each project shouldn’t be underestimated. Exscientia has built an AI-driven drug design platform that automates the design and in silico assessment and optimisation of potentially millions of compounds against specified targets, to select the most promising candidates for further development. Steered by what the firm’s CEO Andrew Hopkins terms seasoned [human] drug hunters, the platform’s algorithms learn from the existing wealth of experimental, structural and ‘omics’ data that is already available on targets, diseases, and compound activity, and new experimentally derived data to bolster the learning dataset even further. Through this process the platform can design and then optimise candidate structures against designated targets, through design-make-test cycles.
It’s a project-focused process that Hopkins maintains is faster than traditional high throughput screening-based approaches, and is significantly more likely to generate candidates that will ultimately succeed in the clinic. ‘Exscientia’s starting point was the premise that algorithmic methods can improve design efficiency through evolutionary approaches. What we asked was: how can we increase the efficiency and success rate of searching chemical space to design and optimise better drug candidates?’
Marrying human intuition with AI
Traditional drug design is founded on human interpretation of available data, the formulation of a hypothesis, and the chemical structures that may have the predicted properties against the desired target, Hopkins continues. ‘This is a largely intuitive process, where you may make up to a couple of thousand molecules to solve individual problems.’
The firm’s AI-driven platforms can effectively design and pre-evaluate millions of compounds to predict efficacy, selectivity and ADME – absorption, distribution, metabolism and excretion – against any selected targets. It’s an active learning approach, rather than a deep learning approach, Hopkins says. ‘Active learning methods are about asking which experiment will provide us with the most information to answer a question.’ By asking the right questions, it can learn faster and generate a better design process.
A ‘full-stack’ drug discovery capabilities
To expand its in-house laboratory capabilities, Exscientia recently acquired UK biophysics specialist Kinetic Discovery, which has added protein engineering, biophysical screening and structural biology expertise to Exscientia’s own drug design, pharmacology and computational platforms. Exscientia had been working with Kinetic Discovery through an ongoing drug discovery partnership with Evotec, and says the company is a perfect fit with its existing in-house capabilities.
In combination with a recently constructed laboratory at expanded premises on the Oxford Science Park, the acquisition of Kinetic Discovery has effectively transformed Exscientia into a ‘full stack’ AI-driven drug discovery firm that can go from gene to clinical candidate for any druggable target selected, Hopkins claims. ‘With the Kinetic Discovery acquisition, we now have in-house capacity to develop any assays, solve our own crystal structures and be in a strong position to build our own internal portfolio of drug candidates,’ Hopkins notes.
‘We spent the first five years focused on technical and market validation of the approach in real-world drug discovery projects with the industry, and we are now in a position to scale the platform up.’ And with four AI-designed preclinical candidates now being developed by partners and in-house, Hopkins anticipates that the first of these will enter the clinic during 2019, adding a further layer of validation to the platform.
The success of the pharma and biopharma industries – and the drugs and diagnostics that they develop – thus ultimately relies on the experimental data that they generate, the analysis and interpretation of that data, and subsequent decisions made.
Tobias Kloepper, CEO at Aigenpulse, suggests it’s a workflow that should incorporate all relevant experimental and relational data generated enterprise-wide at all points in discovery and development.
Aigenpulse has developed a modular, machine learning-driven platform that puts all of that experimental data and metadata in context. It applies analytical algorithms to underpin key questions with data, and enable efficient human interpretation and decision making. Historically, this level of information exploitation has not been practical, because key data often falls by the wayside. ‘Scientists haven’t had adequate computational tools,’ Kloepper comments. ‘They may store experimental output in Excel spreadsheets or flat-files. Merging multiple files in different formats becomes challenging and it is difficult, at that stage, to ask contextual questions of such data.’
More than just digitising data
Addressing the deficit is much more than just digitising experimental results, Kloepper continues. Digitising data is a fundamental goal for every industry, from retail to manufacturing. ‘But in the life sciences you also have to keep up with a constantly evolving research environment that is developing new assays and new ways of using existing assays. Any intelligent digital platform needs to be able to evolve alongside that research process.’
The Aigenpulse platform employs machine learning to engage that analytical process so that scientists can have more confidence in their interpretation of those analyses. It’s a concept for which about 80 per cent of the work involves getting the data into the right structured, contextual environment and the other 20 per cent is the machine learning to derive insight from that data, Kloepper notes. The two must go hand in hand, and there’s little point in asking your algorithm to answer a question if you don’t have reliable high-quality data available and structured appropriately.
Provide your algorithms with a robust dataset and the job becomes more seamless and accurate. Importantly, the Aigenpulse platform can work with data in several formats across a company, from laboratory information management systems (LIMS), electronic laboratory notebooks (ELNs) and other data repositories, to results of sample analyses captured using proprietary software or output in proprietary data formats. ‘As well as contextualising that data, the Aigenpulse platform is designed to remove noise from such data, which allows the software to more accurately model patterns.’
Often such processes are about 95 per cent automated, but interfaces built into the software allow scientists to validate data to support the algorithm, Kloepper states. ‘Our control vocabularies harmonise data, spanning gene expression datasets, common assays such a ELISA and FACS, or mass spectrometry analyses.’ The software then looks at the structured information from every perspective, including disease, targets and compounds, so that the algorithm can learn and derive answers to specific questions set by scientists and biostatisticians.
Persistence of analytics
Try and do that with other platforms, or when all you have is PDFs and Excel files, and there will always be issues with data mapping and data matching, says Satnam Surae, chief product officer at Aigenpulse. ‘We enable that persistence of analytics, which extends to running a machine learning pipeline on all of the data available to the whole company if necessary, or scaling things down to the level of individual experiments. Importantly, by retaining the data in the state it was output and at the time it was derived, you can compare models at different time points.’
The Aigenpulse platform provides scientists with a web-interface for their data and analytical results, which reduces complexity and optimises usability. ‘Scientists can pull up the bits of data that they want, select the model or method that they want to run, put in the parameters, and click to set the analysis running,’ Surae adds. ‘The back end does all the work, and the output is displayed at the front end, in the most appropriate form, and in a matter of a few clicks.’
The Aigenpulse platform can be integrated into existing IT infrastructure on clients’ premises or in the cloud, and it can be precisely configured to match the requirements of each client. ‘Our aim is to help scientists derive greater insights into their research, through their data generated, and support them to ultimately develop better drug candidates faster and with less attrition,’ Kloepper states.
Concerns that intelligent software will put jobs at risk are unfounded, he believes. ‘AI isn’t going to make scientists redundant, but what it will do is help scientists find the best answers to the questions that they ask. In the next 10 to 20 years we will see every scientist being able to use machine learning algorithms as routinely as they carry out common laboratory assays today.
‘Scientists are very open to new technologies, and AI-driven tools will enable them to be more data driven in their decisions. And, ultimately, this will help industry develop more effective, safer drugs, faster.’