Software advances speeding drug discovery

Share this on social media:

Sophia Ktori explores the role cheminformatics software providers play in supporting scientists in academia and industry to better understand the properties of potential new drugs.

EMBL-EBI training course. Credit: Jeff Dowling

OpenEye Scientific develops and offers large-scale molecular modelling applications and toolkits, primarily aimed at drug discovery and design. ‘Although our tools are used by a wider industrial sector, from large pharma and small biotechs, through to agrochemicals and materials scientists,’  commented Ashutosh Jogalekar, associate director – applications science at  OpenEye Scientific.

OpenEye Scientific was established some 25 years ago, to build on the founders’ conviction that shape and electrostatics are the two key factors that are responsible for molecular recognition, which, in turn, drives how drugs work, ‘… because most drugs are small molecules that interact with proteins,’ Jogalekar continued. ‘So, being able to accurately model shape and electrostatics was really the foundational principle of the company.’

Cloud computing has been a significant driver of speed and scale as the need to work with huge numbers of compounds, and volumes of disparate data, has increased, Jogalekar continued. ‘We now  have access to almost unlimited processing  power, and we can use platforms such as Amazon Web Services (AWS) to recruit hundreds of CPUs and GPUs, if required, to help us search through a billion compounds in just a matter of hours.’ That’s been really transformative, Jogalekar commented. ‘Running our solutions on AWS means we can look at solving modelling queries and investigations that would previously have been impossible due to hardware or  compute time limitations.’  

A one-stop shop for drug discovery  

OpenEye Scientific’s flagship solution, Orion, a molecular design platform, is – as far as the company is aware – the only cloud-native, comprehensive molecular modelling and cheminformatics platform, which Jogalekar described as ‘a one-stop shop where scientists can come and do all kinds of cheminformatics and modelling calculations on compounds, including large-scale virtual screening, calculation of physicochemical properties and molecular dynamics’. Orion provides users with an integrated web-based solution for designing,  calculating, viewing and analysis in the chemical space, and offers a dedicated platform for managing data and applications. This single platform negates the need to swap between apps, and also means that data doesn’t have to be transferred from one tool to another, but can remain in the Orion platform. ‘Orion effectively wraps up all the other  applications that OpenEye Scientific has  developed into a single solution, which  includes tools for calculating shape and electrostatics,’ Jogalekar continued, ‘along  with at least a dozen other applications that  can be used to carry out all kinds of other cheminformatics protocols.’ 

Orion presents a set of workflows, which OpenEye Scientific has termed ‘floes’. ‘So, one floe will calculate key properties of compounds, such as their solubility for instance, while another can search for compounds that are similar to an existing patented compound of interest. All of these floes are available, essentially at the click of a button. All the user has to do is log in to  their account, upload critical inputs and go from there.’  

Solving data integration and curation issues and providing the tools that will help to rapidly evolve machine learning and AI tools are, respectively, the main challenges and future-focused goal for the evolution of cheminformatics across fields, he suggested.  

‘People are going to want to visualise and analyse not just more data, but more diverse data from chemistry, biology and pharmacology, say, at the same time. And so here at OpenEye we are particularly interested in machine learning, because we have all these tiers of data available, and we want to be able to see the correlations between them. That will make it easier to answer critical questions, such as how a  virtual screening ties in with the end result of your downstream workflows. And from  our perspective, while this is a challenge, if  you have an integrated platform where all  of the rigorously validated data is available,  in the right format, then developing and  applying those kinds of machine learning  tools just becomes much easier.’  

But at the same time, there is a real drive to consider technical complexity, and make the use of tools much simpler, so that additional complexity in the chemical space can be addressed, he suggested.  ‘The drug discovery space, for example,  is moving beyond the traditional classes of small molecules, into compounds that have completely novel modes of activity.  These compounds present a different level  of complexity, and gaining access into biological space, will obviously be very  significant.’  

Open source chemistry resources The European Molecular Biology  Laboratory’s European Bioinformatics  Institute (EMBL-EBI) maintains a range of freely available cheminformatics resources that allow users to share data, undertake and analyse the results of complex queries in the chemicals space.  

‘Probably the best known of these resources is ChEMBL, an open data resource of binding, functional and bioactivity data,’ explained Andrew Leach,  head of chemical biology and head of industry partnerships at EMBL-EBI. ‘But our suite of resources also includes  SureChEMBL, a searchable database that contains information extracted from patent documents, together with ChEBI,  a dictionary of small chemical molecular entities, and UniChem, which gives users the ability to cross-reference chemical structures across different databases. 

‘Containing data on more than 2 million compounds and 1.4 million assays, data’, Leach commented. ‘ChEMBL is widely considered to be the world’s leading expertly curated resource of its type that is completely open, and can be used without restriction by anyone in the scientific community. Its utility ranges from  basic searches for compounds that have  specific properties and activity against  particular targets, to the development of  new AI algorithms and machine learning  tools.’ Launched more than 10 years ago,  the ChEMBL database is now on its 29th release, and is derived from data in more than 80,000 published documents.  

relevant to the life sciences sector.  ChEBI includes information about the function, or role, of the compound in its biological context. Importantly, many external resources link to ChEBI due to its careful curation, the availability of a  stable identifier for each entry, and its ontology which can be combined with other ontologies to enable reasoning,  Leach suggested.  

Additionally, EMBL-EBI provides a  resource known as UniChem, which allows scientists to link chemical structures across different databases. This includes external resources that may not be focused on small molecules, but which contain some small-molecule information.  This capability relies upon the ability to unambiguously define chemical structures using the International Chemical Identifier  (InChI), which resolves the question of whether two compounds represented in different resources are the same molecule.  ‘By deriving the InChi, you can be confident that any compound with that InChI,  wherever you find it, is the same. It’s a very  simple idea but also really quite powerful,  which UniChem uses to easily allow  users to navigate to data from different  resources on the same compound.’ 

Challenges do still exist with respect to managing the diversity and volume of data that is being produced in the chemical space, Leach further pointed out. ‘There  are new data types being produced using  the latest experimental methods, so there  is a continual challenge with respect to  managing and integrating these different  kinds of data.’ So there will always be a  need for expert curation, ‘especially when  it comes to interpreting information from  multiple sources, which can be challenging to interpret or might be ambiguous.’  

The small molecule sector can, nevertheless, perhaps learn from other biological data resources, Leach suggested. ‘The biological community already deposits large volumes of data directly into the relevant databases. The type of data in ChEMBL is much more heterogeneous than (say) DNA sequence data; in addition, it is often generated in individual academic labs, which may not have the necessary informatics systems nor expertise to undertake the necessary processing and validating steps. Nevertheless, it is important for us to continually explore how we might address this challenge and indeed, we continue to  work with a number of laboratories who  directly deposit data into ChEMBL.’  

EMBL-EBI is funded by its member state governments and other external funders, which include UK Research and Innovation  (UKRI) and the UK’s Research Councils,  as well as the European Commission, the US National Institutes of Health and the  Wellcome Trust. The Wellcome Trust has been the most significant funder of the ChEMBL family of databases, Leach noted.  

EMBL-EBI is also a key partner in large scale collaborative projects, he continued. ‘We are involved in a large public-private collaboration called EUbOPEN, which is funded by the Innovative Medicines Initiative (IMI). This project aims to create a  publicly available chemogenomics dataset that will comprise a set of compounds representative of a wide diversity of bioactivities and also to deliver at least 100 open-access chemical probes. The ChEMBL team is also involved in a number of other IMI collaborations, the NIH-funded  Illuminating the Druggable Genome project, Open Targets and BioChemGraph, which aims to integrate data from ChEMBL,  the Protein Data Bank in Europe, and the  Cambridge Structural Database (CSD). 

Increasing model reliability  

Alvascience is a young cheminformatics software company, established three years ago, which offers a suite of desktop tools that are designed to help streamline the entire Quantitative structure-activity relationship (QSAR)/Quantitative structure property relationships (QSPR) process, from data curation to the deployment of prediction models. 

Cheminformatics and in silico techniques have been growing in focus for the past two decades, suggests  Matteo Bertola, Alvascience co-founder and head of software development. ‘Cheminformatics has traditionally been viewed as something of a niche field within the pharma/biotech sector but, in fact,  it’s critical for many sectors of discovery and R&D,’ he explained. ‘Just about every  agrochemical, food, oil and gas, pharma  or materials science organisation has a  cheminformatics department because  these organisations all need to work with molecules, and try to understand  the chemistry of foodstuffs and crop  products, drugs, petro- and speciality chemicals, and new materials.’  

Organisations today expect to work with software that allows them to screen huge numbers of molecules and create reliable models to test specific properties or evaluate a prediction, Bertola noted. ‘They may commonly have some endpoint  that was acquired experimentally, on which  they want to build a mathematical model  that allows them to test new molecules against that endpoint, and rule out or rule  in molecules they may want to take to the next level.’  

One of the main challenges, Bertola suggests, is ‘how to create models that are reliable, and explain how the molecule will behave in the real world, so you can test structures with some degree of  confidence.’ Another challenge is how to explore the sheer size of chemical space that is available and find the best molecules that fit the required properties and, ultimately, functionality. 

With these goals and challenges as starting points, Alvascience has built a suite of desktop QSAR and cheminformatics tools to aid the complete workflow. Data curation is often the first step of a QSAR pipeline,  and the firm’s alvaMolecule platform has been developed to allow users to analyse, visualise, curate and standardise a molecular dataset.   

‘In the next step, alvaDesc calculates molecular fingerprints and thousands of molecular descriptors in an efficient way,’ Bertola continued. ‘Molecular descriptors are also key components for the development of models to predict given endpoints, so we’ve developed alvaModel to enable users to generate  QSAR/QSPR regression or classification models to predict the endpoint you need.  

‘The software, making use of genetic algorithms, can search for high performing models by selecting the descriptors from those previously calculated in alvaDesc. Once your models have been created, you can share them with your colleagues, who can use alvaRunner to apply the models to new molecular datasets. In this way, you do not need to use other software for applying models as alvaRunner provides a single solution.’ alvaModel and alvaRunner can thus effectively be applied together to build and deploy QSAR/QSPR regression and classification models, with alvaRunner offered as a software tool that allows users to apply models, created using alvaModel,  on a new set of molecules. 

Alvascience also offers a tool for de novo molecular design, called alvaBuilder, which has been developed as a user-friendly software that lets users generate new molecules with a  set of desired properties, starting from a defined training set. ‘The suite of tools effectively addresses this concept of a cheminformatics pipeline, starting with data curation, and then getting to the deployment of a model,’ added Bertola. 

All Alvascience’s tools are offered solely as desktop solutions, available for Windows, Linux and macOS. ‘We’ve stayed away from the cloud as many customers don’t want to share any of their data with third parties,’ Bertola pointed out, ‘but we are not ruling out possible expansion into cloud offerings in the future.’

Case study: Empowering chemistry research with Synthia

 

The latest iteration of Merck’s flagship retrosynthesis design software, Synthia, has been developed to give chemists the freedom to generate a synthetic route for just about any desired chemical compound. It does this while meeting specific criteria for that synthetic pathway, from the availability and cost of starting materials, to the avoidance of hazardous reaction steps or intermediates.

‘Synthia is uniquely founded on more than 100,000 reaction rules that have been coded by expert chemists,’ explained Dr Ewa Gajewska, chemistry development manager at Merck KGaA, Darmstadt, Germany, who has played a key role in development of the Synthia retrosynthesis software.

‘Whereas automated retrosynthesis platforms commonly utilise data extracted from the literature, Gajewska notes, Synthia is unique in that it combines expert knowledge with the algorithms strategising over multiple steps, as well as modules and fi lters based on computational methods. ‘The latest release promises to further speed the derivation of efficient, reproducible synthetic pathways,’ she said. ‘We have, for example, witnessed the power of Synthia retrosynthesis software deployed globally by researchers in designing synthetic routes for potential Covid-19 therapeutics.’ Retrosynthesis is not a novel concept. 

Chemists looking to synthesise a compound will commonly start at the final molecule and backtrack through reactions to identify starting compounds, or building blocks.

While this had traditionally been a time-consuming, trial-and-error process to find a route that is feasible in the lab, and at a larger scale, the evolution of AI algorithms and massive computing power has led to the development of in silico solutions that can evaluate pathways in parallel, dramatically cutting the time required to create synthetic routes that meet all the process requirements and endpoints.

Synthia retrosynthesis software can rapidly evaluate and define dozens or more of the most likely practicable synthetic pathways for a molecule, whether that molecule has been previously described, or is completely novel. The platform can also be used to design new, faster, more efficient or more cost-effective synthetic approaches for existing drugs, for example, to match increased demand. 

‘Taking into account chemical features, such as stereochemistry and the need to protect certain functional groups along the synthetic pathways, users can also hone down the number of potential pathways by applying conditions that, for example, may specify starting reagents, mandate cost restrictions or limit the number of reaction steps,’ Gajewska stated. ‘Because our database of reaction rules was coded by the expert chemists based on the underlying reaction mechanisms and considers stereo- and regiochemistry, as well as the context information, chemists can be confident that the pathways proposed are chemically feasible, and actionable in real-world settings.’

Training algorithms to computationally contrive efficient organic synthetic pathways has, in fact, been a challenge for decades, Gajewska commented, and Synthia is the culmination of efforts to develop a platform that could take into account real-life constraints chemists face at the bench. ‘We’ve been able to harness the power of computers to embrace natural, reaction-related constraints, in parallel with other user-defined filters, to define the best routes in the shortest timeframe, which are ultimately based on chemical logic.’ These constraints may include, for example, finding the shortest, enantioselective route that utilises building blocks available in the user’s inventory.

A foundation on native ‘human’ understanding of each reaction mechanism allows Synthia retrosynthesis software to capture all of the nuances of chemical reactivity. ‘For example, the software considers the electronic and steric effects of the substituents that influence the reaction outcome,’ Gajewska noted. ‘Every reaction rule is also accompanied with the list of functional groups that are incompatible with these reactions, and the ones that will need protection.’

This marriage between expert human knowledge, algorithms and computational methods is unique, Gajewska continued. ‘And it’s this combination of human-derived expertise and algorithms that allows the platform to accommodate such a wide range of user-defined expectations.’ Synthia retrosynthesis software was originally developed at Grzybowski Scientific Inventions (GSI), and was acquired by Merck KGaA, Darmstadt, Germany, in 2017. Since then, continued R&D has resulted in the addition of new computational and search algorithms and the expansion of the reaction rules database. ‘New synthetic methodologies are continually added,’ Gajewska pointed out. ‘With the latest release, users will find improved pathway visualisation features that really help the chemist to choose the most promising pathway and start the synthesis.’ 

New features of the latest Synthia retrosynthesis software release include:

• Certification to ISO27001 standard for an independently verified information security management system (ISMS)

• Single sign-on

• Incorporates 845 new reaction rules, plus integration with DeepMatter’s SPRESI database, which contains 4.6 mln reactions and 700,000 literature references for known reactions

• Interface allows users to search according to preset parameters, or customise and save their own parameters

• Filter results to favour pathways using preferred reactions, starting materials or intermediates

• Screen results to remove pathways with unwanted reactions, focus on interesting intermediates, or limit the number of protecting groups needed

• Visualise results, including detailed schematic view of individual pathways, and structures, reaction names and conditions

• Easily build and save lists of molecules or functional groups to frequently favour or connect with scientific tools, using enhanced APIs for integrated workflow 

Offered as a web-based platform via licence, Synthia retrosynthesis software also gives users the flexibility to effectively drive the retrosynthesis software manually, rather than rely on completely automated synthetic design. This means chemists can chart through pathways, starting with their end product, and work through stages backwards, much as chemists would do before in silico tools,

but with the wealth of reaction details, references and other annotations for each reaction step they select, available at their fingertips.