Making sense of genomic data

Share this on social media:

Credit: Gorodenkoff/Shutterstock

Genomics and proteomics data have changed the face of disease and therapeutics research, drug discovery and development, finds Sophia Ktori 

There is an increasing drive to identify disease-associated genetic variants, genes and the proteins for which they code, in order to understand the underlying cause of diseases. This can be used to help assess whether the proteins are tractable targets for traditional small molecule, as well as biologics - peptide/protein or nucleic acid-based - therapies.  

‘One of the major bottlenecks now, especially for small biotech companies, is how to make sense of the wealth of genomic data that is available in the public domain, whether individually published studies, patient cohort-based genome-wide association studies (GWAS) or population-scale initiatives such as the UK Biobank,’ explained Ellen McDonagh, informatics science director at Open Targets. ‘We know that therapies are more likely to be successful, and achieve regulatory approval if there's underlying genetic evidence for the disease-related target on which they act, so understanding the link between genes and disease is key. However, while there is now a huge amount of data in the public domain, much of the information is, ‘messy’, not standardised and there is little harmonisation of terminology. This means that looking for or comparing data has become a real issue.’

Open Targets is an international public-private partnership, with a portfolio of both large scale experimental approaches and informatics projects to support target identification, prioritisation and validation. The initiative ultimately aims to help industry develop more effective, safer, drugs and reduce the cost of drug development. Its informatics projects collate global human genetics and functional genomics data into open source resources such as the Open Targets Platform, and Genetics portal. 

‘The Open Targets Platform integrates public domain data to facilitate target identification and prioritisation, and the Open Targets Genetics portal effectively harnesses large-scale datasets, based on genome-wide association studies (GWAS) and functional genomics data, to link variants to genes and disease phenotypes,’ McDonagh explained. ‘So, for example, our Locus-to-Gene (L2G) machine learning pipeline predicts the most likely causal gene for GWAS loci (to identify the most likely gene associated with the trait in question). The model utilises different datasets and considers several features including which genes are the closest to the locus as well as whether the variant is also associated with changes to the protein expression level.’ The L2G pipeline produces a score which gives an estimate of how likely a gene is to be the causal gene at that locus. L2G scores above a certain threshold are integrated into the Platform as evidence for the association between a gene and a trait or disease.

Open Targets is a consortium of partner institutions and companies. EMBL’s European Bioinformatics Institute (EMBL-EBI), and Wellcome Sanger Institute represent the key research and informatics partners that work through Open Targets to create, access and assimilate data and help build the Platform and Genetics portal. Industry partners Bristol Myers Squibb, GSK, and Sanofi, also help drive evolution of the resources, giving industry’s perception on need, from a data perspective, as well as being involved in some of Open Targets’ experimental and informatics projects. 

The Open Targets Platform and Genetics portal are founded on data from large-scale GWAS studies, and major population cohorts, such as the UK Biobank, and FinnGen. Both the Platform and Genetics resources are regularly updated, as new studies and data become available. This is done partially through automated pipelines, and partially by manual curation, McDonagh added. ‘It depends on the data type. So, for example, genome wide association studies come through the GWAS Catalog, and this already includes systems that allow organisations submitting their studies to automatically upload that data. But quite a bit of the mapping of diseases or traits to standardised ontology terms may have to be manually undertaken. Our own Platform team brings in data from 22 different resources, ranging from genetic evidence for a target-disease, to evidence from mouse models.’ This all helps to build a picture of the evidence for a target-disease association, so that users can understand the relevance, applicability, and tractability of a target protein. 

Importantly, the Open Targets resources, code, and datasets are open source, and freely available to download, query, and manipulate, McDonagh noted. ‘We also regularly update the resources by incorporating new public datasets and results from our data providers, and introducing new tools. The Platform has five updates a year.’ The machine learning tool L2G in the Genetics Portal, for example, was introduced in 2020. 

As part of the platform, Open Targets in collaboration with Europe PubMed Central operates a natural language processing pipeline that pulls in data on the relationships between disease and genes and drugs, from 39 million papers and other publications. ‘We work closely with the Europe PubMed Central on that project,’ McDonagh noted. ‘The NLP pipeline automatically grabs scientific publications associated with a particular entity – say, a gene target - and these are then shown on the relevant gene, or disease, or drug pages of the Platform. And that can be really, really useful information, especially for targets for which there is sparse public data, and for which there are no known drug candidates, or existing GWAS data. So we can basically look at the whole bibliography of research and identify potentially novel genes of interest that may represent new targets for diseases.’

One of the challenges that the Open Targets resources are addressing is standardisation of terminology. This can be a major headache for organisations trying to search for equivalent or complementary data. The resource effectively maps disease terms to a standardised ontology, known as EFO - Experimental Factor Ontology - which has been developed at the EMBL-EBI. ‘We work closely with the Samples, Phenotypes and Ontologies Team (SPOT)  team to develop new terminology under the EFO umbrella, and this gives users a standardised search and compare foundation, at both a broad or a more honed level,’ McDonagh noted. ‘That ontology will greatly help users find and compare, for example, targets with genetic evidence for an association with rheumatoid arthritis. But also, scientists can widen their catchment and look at targets associated with autoimmune diseases more generally, or filter down for more specificity, and look at one particular form of arthritis.’

The experimental and informatics projects led by Open Targets feed into its platforms. Through the PROTACtable genome project, for example, which Open Targets reported in 2021, scientists within the consortium members established a framework for assessing whether human proteins could potentially be targeted by a particular class of drug known as proteolysis targeting chimeras. The methodology used to define the PROTACtable genome was based on an approach developed by a group at GSK to explore the tractability of small molecule drugs. The PROTAC workflow is incorporated into the Open Targets Platform, adding to the information that can help inform what modality could be used to modulate a given target to treat a disease. 

‘We have a range of ongoing informatics projects,’ McDonagh noted. ‘These include a project looking to build networks of interactions between proteins  and a project which is looking at assessing the safety of potential targets.’ The outputs from these projects are added as features to the Platform, contributing information beyond the evidence for a given target-disease association, to help users build therapeutic hypotheses, for example on whether an interacting protein would be a more suitable and safer target in the treatment of the disease. These informatics and experimental projects provide key information on what’s useful for drug discovery as well as novel data, and help to drive direction both for the Platform and Genetics portal, McDonagh said.