Software helps unlock the potential of genomics research
Today’s DNA sequencing technologies now make it possible to sequence whole human genomes cost effectively and with speed. Sequencing initiatives are generating vast volumes of data that – theoretically – give scientists a starting point to drill down into individual patient genomes in the hunt for disease-related variants, and also to mine collectives of huge public datasets to aid our understanding of the genetic basis of disease, unpick disease mechanisms, identify drug and diagnostic targets, and stratify patients for clinical trials and personalised medicine.
In practice, analysing this wealth of genomic data, in the context of associated biological and clinical data, is challenging. Gene variants identified through genotyping studies are stored in variant call format (VCF) files, but deriving patterns and insight from these files and connecting disparate data types isn’t necessarily intuitive. And with relational datasets generated through large public and private initiatives (containing potentially millions of variants from many thousands of individuals) there are immediate issues associated with scale, as well as with how one can formulate the right queries.
In contrast with relational databases, graph databases can help to transform large-volume unstructured data into actionable knowledge, explains Alicia Frame, senior director, graph data science at Neo4j. ‘In the case of genomic research, the key problem is how to integrate the large volumes of highly heterogeneous data and gain maximum insight,’ she says. This is whether for diagnosis, personalised therapies or drug development, she is keen to stress. ‘Graph databases are an ideal way to represent biomedical knowledge and offer the necessary flexibility to keep up with scientific progress. Using graph databases, a well-designed data model and query can deliver in seconds what previously took days of manual analysis.’
Graph platforms are effectively a way of representing and storing data as connected concepts, Frame explains. ‘You can think of the graph as built on nodes that are concepts and then the relationships that connect them,’ she says. ‘In ‘everyday speak’, we might well consider the nodes as nouns. So, in the genomics or bioinformatics space, these ‘nouns’ are the genes, chemicals, diseases, variants and phenotypes. And then, of course, the relationships between them are effectively the verbs, which connect the concepts. It’s – kind of – a real-world systems biology model.’
Under the Neo4j platform, the data is stored in the same way that the ‘nouns’ and ‘verbs’ relate to each other in biology, says Frame, so getting the data you want back out is very intuitive. In a relational database, where everything is stored as rows and columns, you need to join the data – and that means spending a lot of time thinking about how the computer stores that data and trying to map how to connect it. Cypher lets a domain expert query far more naturally for patterns in the data. ‘So the user can literally ask the database to find chemicals that bind to receptors for particular genes that are associated with a particular disease,’ says Frame. This makes it very easy to effectively express a ‘mental model’ and phrase the questions naturally and retrieve the relevant information from the underlying database.
‘If you’ve ever worked with a relational database, you have to typically join data across lots of tables,’ she says. ‘The more complex the query, the more complex it is to join the proverbial dots in the table. The more joins you have, the slower it is and the more difficult it is to write the query,’ Frame acknowledges. ‘Use a labelled property graph model based on nodes and relationships and there is no need to consider joins between tables, because the data is already joined.’ It also becomes intuitive to add new data as it is derived.
Open-source and user-friendly
Graph databases also make it much easier to build applications for every end user – think again, clinicians and researchers – and, at the back end, it becomes relatively easy for the person building the graph to maintain the resource, update it and deliver it to those end users.
Neo4j has focused on making the open-source platform easily accessible and user-friendly for novices and smaller initiatives. ‘For the community edition, we offer the database, plugins for data science and visualisation tools,’ explains Frame. ‘If you are a researcher or an individual, you can download our database and our software from our website for free . In fact, many groups start there.’ The pivot point between the free, open-source version and the commercial enterprise platform will depend on the volume of data and the number of people who will be using the system, she adds. ‘One of the primary differences between the free community version and the enterprise system is parallelisation. The community platform will use up to four cores, whereas users of the enterprise platform can tap unlimited numbers of cores for faster computation when datasets are really huge and speed is important.’
In fact, many public genomic datasets are already encoded as graph databases. ‘The NCBI, for example, has downloadable graph representations of many of its public databases,’ Frame says. ‘We also have a ‘graphs for good’ programme, through which we offer the commercial, enterprise software for free to nonprofits, charities, researchers and academics in order for them to do their research; we also licence the database and the plugins to drug discovery companies such as Novo Nordisk.’
The most obvious – although not the only – challenge associated with managing and analysing genomics data is its scale, comments Ignacio (Nacho) Medina, CTO of Zetta Genomics and founder of the open-source computational biology (OpenCB) platform. OpenCB is a bioinformatics suite that is designed to allow genotypic data management and analysis on a scale relevant to the massive sets of genome sequencing results that the research and clinical communities are generating. Medina describes the platform as a full stack open-source software solution, enabling large-scale genomic data storage, indexing, analysis and visualisation.
Scalable genomics research
The need for a dedicated, genomics-focused platform became increasingly evident to Medina more than a decade ago with the emergence of next-generation sequencing technologies and with the application of genotyping – not just for basic disease research, but also in clinical settings for potential applications in disease diagnosis and the development of personalised medicine.
As the first scalable solution enabling genotypes – recorded in variant call format – to be stored in a variant database, OpenCB is a high-performance solution for indexing and analysing many hundreds of thousands of samples, he believes.
Medina, who has been Head of the Computational Biology lab on the HPC team at the University of Cambridge since 2015, conceptualised and founded the OpenCB project while working in Spain during 2012. Within a few years, the platform was gaining the attention of some major genomics research initiatives. ‘At first, it was just a prototype – very small – but this was enough to raise the attention of EBI, the University of Cambridge and Genomics England in 2015, which adopted and contributed significantly to its development,’ he says. During this period, Medina remained the platform’s architect and has led the design and development of OpenCB. ‘Today, OpenCB also includes a metadata and clinical database, fine-grained security management and a knowledge database, representing a complete genome data interpretation platform,’ Medina notes.
As an open-source platform, OpenCB is accessible and free-of-charge for any organisation looking to manage and analyse genomics data in a non-regulated setting. In 2019, Medina spun Zetta Genomics out of Genomics England and the University of Cambridge to commercialise the OpenCB technology as XetaBase – a regulated, clinically validated and technically supported data architecture and software solution that is applicable for clinical genomics data management and evaluation at large scale.
‘Zetta Genomics is, effectively, the commercial venture established to extend the scope of OpenCB, and XetaBase – OpenCB’s commercial name – was created and launched in 2020,’ says Medina. ‘XetaBase is now becoming a certified platform that meets the regulatory requirements for data in clinical settings, while also addressing the need for customer support and implementation skills ‘built-in’. It’s offered as a software and through a service model, so we provide updates, fixes and training, along with ongoing support.’
Medina remains the CTO of Zetta Genomics, which is now also the main contributor to OpenCB. In June 2021, Zetta won £2.5 million in VC seed-funding. This investment is being focused on growth, improving performance, stability and implementing new analysis. Some is also enhancing the company’s partnership network while it expands from the UK to open both Spanish and US offices. Resource is also being focused on talent; securing additional team members with software, development and commercialisation expertise. Importantly, the OpenCB and XetaBase data architecture supports regulatory governance for clinical and genomic data management, including NHS digital security and privacy policies.
‘Regulatory and security issues aside, clinical labs face particular challenges with respect to how you deal with patients’ genotyping test data,’ Medina explains. These challenges relate to the sheer numbers of tests that are performed and the volumes of data generated but, also, the almost inevitable shortfall in human resources to analyse all the data for each patient in the hunt for a gene variant that might be the pathogenic cause of a disease.
Another challenge that the OpenCB platform and XetaBase address is one of data sharing between scientists. Typically, if a clinician identifies a new disease-related variant that explains pathogenesis and disease symptoms, that finding may stay buried in the clinician’s notes.
‘In some cases they can submit or publish their findings but, even if that happens, it can take as long as 12 to 18 months for peer review and publication,’ says Medina. ‘Clinicians really need to be able to share their findings – with all of the patient data-related regulations in place – across hospitals. With the new federation feature, XetaBase will finally address that need to make findings available within minutes, not months.’
XetaBase is cloud-hosted and this simplifies data management and scalability, with a huge emphasis on making data secure and, effectively, available in real time.
‘You may have several gigabytes of genotypic and other contextual data and metadata per patient,’ explains Medina. ‘The server for our platforms runs in the cloud and so this fact allows customers to easily scale to their needs, [supporting] tens or hundreds of thousands of patients in some cases, while we take care of and provide all the services that they need for the platform.’
Importantly, the OpenCB platform is built on a fileless infrastructure. ‘Other solutions rely on a file-based system, but then how can you easily search across, say, 20,000 files to look for a disease-related variant that matches that of your patient?’ he asks. ‘In OpenCB, in contrast, all of the genomic variant data is aggregated in one indexed database. The largest example we have is Genomics England – for which there are about 140,000 whole genomes in one single installation, accounting for about 300 terabytes of data,’ says Medina. ‘And this fileless system means that, despite this massive volume and breadth of data, we can scan the whole database within minutes or scan any patient or the entire family in a few seconds.’
In fact, the OpenCB architecture makes it possible to include hundreds of different pieces of information relevant to each genetic variant and still query the whole platform. ‘One analogy we can use to help explain this is Google,’ says Medina. ‘When you search for something on Google, the system doesn’t search through the one trillion pages of content individually. Rather, Google has every page indexed so, when you query Google, you query that index and it takes just milliseconds.
‘We have done something similar with OpenCB. We take all of the billions of mutations from large datasets and put them into one index on the system to enable incredibly fast analyses.’
And, of course, this is critically relevant whether the query is for insight into one patient, such as searching for patients with the same mutation, but also for the disease researcher who might be querying different variants across all of the different samples, Medina adds.
The ultimate vision is for a platform such as OpenCB and XetaBase to help reduce drug development times, increase the speed of disease diagnosis and aid decision-making for patients.
‘My goal for the next five years is to demonstrate we can have a significant impact on research and healthcare, and realistically help to reduce drug development times by potentially years,’ says Medina. ‘We also want to enable researchers to communicate their findings in a secure way, so that they can reanalyse data and ensure no patient is forgotten.’