Data dissection

Developments in bioinformatics are putting pressure on those tasked with data analysis and interpretation, writes Sophia Ktori

Next generation sequencing (NGS), high content screening, metabolomics, and other emerging biological disciplines are allowing scientists to drill down into the mechanisms of disease and drug activity at the level of individual biomolecules, genes and interacting pathways. Along with that capability comes the generation of huge and disparate datasets that need to be integrated, analysed and interpreted in context.

‘Customers in the bioinformatics space are asking us for help managing their complex and changing scientific data, especially for NGS and biologics discovery,’ comments John Stalker, product manager, Platform for Science, at Core Informatics. ‘Huge volumes of NGS data need to be stored and analysed, and bioinformaticians use a variety of vendors and tools to generate and analyse their data. Our goal is to enable them to do the analysis they want to do, in the way that they want to do it.’

Core’s approach has been to develop a multi-tenant cloud-hosting infrastructure that allows users to generate their own, configurable environment for tracking and integrating these different processes. Platform for Science (PFS) provides a unique framework for bioinformatics data management, and can integrate disparate data sources and instrumentation. ‘When you model your data in our system, that data becomes exposed in our application programming interfaces (APIs), scientific data management system (SDMS), and other areas, so that you can immediately start coding against it and integrating with other systems and tools,’ Stalker comments.

PFS is a cloud-hosted platform as a service (PaaS), which contrasts with single-point software as a service (SaaS) solutions. ‘Whereas SaaS are offered as individual pieces of software, our platform represents a foundation on top of which you can build applications. It underpins the other products that we have to offer, including laboratory information management system (LIMS), electronic laboratory notebook (ELN), and SDMS capabilities.’

No more closed systems

Core Informatics’ vision for supporting the effective use of complex and huge biological datasets is to move away from ‘closed, black box systems’ that are difficult to get data into and out of, Stalker continues: ‘We are just about to launch version 5.2 of our PFS environment. This will include a new API built on the OASIS-standard open data protocol, OData for building RESTful APIs. It’s really nice and simple to program. We are actively looking for similarly user-friendly and open solutions for our expanding family of APIs.’

Core is working with the industry to develop user-friendly apps for the PFS, says Stalker. ‘As an example, we have built a plugin for the Geneious solution from Biomatters, which effectively allows our platform to operate as a document repository. Using the plugin you can query against and annotate your sequences directly and then bring them back into PFS without having to worry about importing and exporting to and from the LIMS and Geneious. It’s all done seamlessly through the API, and this is where our platform technology really shines, as it’s easy to add these integrations. We have additional partnerships ongoing to develop plugins or apps that will directly link the functionality within our PFS with other bioinformatics platforms and tools.’

Core Informatics is effectively fostering what Stalker describes as an ecosystem of developers, and a community that can collaborate to build solutions on top of the PFS infrastructure. ‘Everyone is trying to move away from standalone analysis tools that companies have to buy licenses for and install in house. We want these capabilities offered as a service, on demand. This is what we are developing within the Platform for Science environment.’

Sifting through the public repository

It’s not just in-house research that is creating vast quantities of biological data. Expanding capabilities in mulitple biological disciplines have resulted in huge volumes of experimental and analytical data sitting in the public domain. The commercial sector is looking to exploit this repository to help inform and direct their own disease and drug research and development, notes Jaqui Hodgkinson, VP product development biology and preclinical products at Elsevier R&D Solutions.

‘The failure rate in drug discovery is very high, and there is a lot that we can learn from viewing published data contextually alongside in house and collaborative research,’ Hodgkinson notes. ‘However, there are few tools that can filter through millions of published works according to precise search parameters. We have developed Pathway Studio as an open and flexible platform that gives customers the ability to search, mine and model data in exactly this way.’

Pathway Studio is founded on a proprietary natural language-processing text-mining tool that can rapidly search through potentially hundreds of thousands of full text articles in just a few hours to search for any gene, protein, biological concept or pathway, disease or drug response/interaction of interest. The software allows users to build maps of interacting pathways, as well as map, analyse and visualise complex disease mechanisms, gene networks and drug response profiles.

‘Users can search for key concepts within relevant data sets, perhaps for gene and protein interactions, or inhibitory activities of drugs on gene expression, using simple search terms. Powerful statistical tools also allow users to import and analyse their own experimental data to help identify possible cause and effect on biological pathways, and to model the effects of gene expression and protein-protein interactions on disease,’ Hodgkinson explains. ‘You could, for example, use microarrays to compare analytical samples from sick and healthy individuals, and then import the resulting data into the platform to identify major differences between the datasets, and the networks that are involved.’

Supporting biomarker and target discovery

Pathway Studio can provide new insights into the molecular basis of disease, and supports target and biomarker discovery programs and the pathway analysis of clinical and experimental data, Hodgkinson suggests. ‘We have been particularly interested to see how the platform is also being exploited at the clinical level, by physicians, to help direct diagnosis and treatment for individual patients. It’s not what the software was designed for, but we do know that a number of hospitals are using Pathway Studio as a clinical tool.’

Complementing Pathway Studio, Elsevier offers a comprehensive text mining portal, which allows users to identify every paper in the publisher’s dataset that mentions a specific protein, drug, gene, or even cell type or disease, and also to build content sets for further modeling and analysis. ‘You can mine full texts as well as abstracts, and this often identifies research on a drug target or gene that might otherwise have been missed.’

Intuitive and user friendly

The ultimate goal is to use underlying information to map as many common identifiers as possible, but in an intuitive and user-friendly way, Hodgkinson notes. ‘We have, for example, invested huge amounts of time building a taxonomy around many different cell types, so that users don’t have to type 20 different keywords to ensure that their search covers every synonym that may commonly be used for one cell type. Our goal is to provide the tools that will help researchers and clinicians derive the maximum useful information from existing molecular, cellular, translational and clinical research, and to help them find whether published data have been reproduced or validated and so whether they are reliable.’

At the end of the day, utilising experimental data in the context of existing data is largely about predicting cause and effect, rather than providing definitive answers. But herein lies another big data problem. How do you search for and filter out reliable, reproducible studies and experimental results, either to corroborate new results and drive the direction of disease, target or drug R&D, or even to instruct patient diagnosis and therapy at the clinical level? This is an important concept, Hodgkinson stresses.

The ability to identify and therefore discard less safe experimental data is a key requirement for researchers who are sifting through published research. ‘How to ensure the providence, reliability and reproducibility of data is an issue that comes up time and time again. What we can or can’t trust? Our tools allow users to identify easily how long ago research was conducted, and whether it has ever been replicated, when and by whom.’

Making accurate predictions

The higher the quality and relevance of prior data, the more likely it is that predictions based on it will be accurate. ‘This is going to be an exciting area moving forwards,’ Hodgkinson continues. ‘How do we get to the point where we can really start to build a reliable inference base that allows researchers to make confident predictions.’

Elsevier continues to work to provide new functionality to its tools, and is also working with a number of genomics providers on further platform integration. ‘Through these collaborations we hope to make it possible for researchers to step directly back and forth between Pathway Studio and bioinformatics tools, such as Illumina’s BaseSpace Correlation Engine,’ for example, Hodgkinson suggests. Users identify their genes and proteins of interest, and can then take them directly into Pathway Studio to see which other disease processes and pathways may be linked. It’s an area of development that we are accelerating to allow users to combine our platform with their bioinformatics tools directly.’

Distributed research

The move towards distributed research is shaping the industry and market for bioinformatics-related tools, comments Jens Hoefkens, director of informatics at PerkinElmer. ‘Many large pharma companies are outsourcing research to contract research organisations (CROs) and collaborating with academics and biotechs. There is a huge need for informatics solutions that can support cross-organisation management, utilisation and analysis of NGS, high content screening and other sources of bioinformatics data, as they are generated by different instrumentation and migrated between partners in varying formats.’

This means that you have technical challenges not only with respect to giving and controlling access to information, but also with respect to ensuring that the software is easy to use, Hoefkens continues. ‘It’s an issue whichever industry you are in, and is especially relevant in the bioinformatics space, which encompasses a broad range of experimentation and data types. You may be willing for your collaborators to have access to data on your software, but what you don’t want is to have to spend weeks training individuals to use it.’ The imperative to have easily accessible and user-friendly solutions thus becomes more critical as the breadth and extent of bioinformatics data continues to grow, and new software is required to make sense of it, he continues. ‘Our customers also want to work with fewer vendors across disciplines within their R&D environment, so the pressure is increasing for vendors to develop flexible platforms that can support cross-discipline research, data sources and data formats.’

To support collaboration and manage data breadth and depth, PerkinElmer is making a comprehensive move into the cloud. ‘We already have the capability to support streaming a wide variety of data into the cloud in real time, from NGS and imaging data, to clinical trials and high content phenotypic data. But we also appreciate that this will be a stepwise progression for some customers, and so we offer a hybrid strategy where some data processing may be carried out on premises, using tried and tested systems and platforms that our customers trust.’

When it comes to data integration, it is not always feasible to rely on experts from each biological discipline,’ Hoefkens adds. ‘You may be integrating NGS with mass spectrometry data or high content phenotypic screening data with pathology information. Our goal is to shield the user from the complexities of these research platforms, to allow them to ask the biological questions that they are trying to answer.’

Facilitating better integration

And when you look at it from a bottleneck perspective it’s more data integration, rather than data processing, that is the issue, he suggests. This is a topic that arises time and time again in discussions on how informatics solutions can support cross-discipline research in any R&D environment. PerkinElmer’s strategy is not to attempt to provide every piece of software that may be required, but to offer tools that facilitate better integration, and to establish more of an open research collaboration platform onto which customers can bolt their own preferred software. ‘Our own data integration and visualisation tool, TIBCO Spotfire, has been built as a generic platform that allows our customers to integrate their profiling data from whichever instrumentation they are using. However, we appreciate that customers may have their own preferred tools.’

PerkinElmer has developed software tools that can sit on top of TIBCO Spotfire or other visualisation platforms to facilitate seamless data integration and interrogation. ‘Our Genesifter product for the analysis of microarray and NGS data, for example, is offered as an Analysis Edition tool for data manipulation, and as a Lab Edition, which provides laboratory informatics management system (LIMS) functionality.’ Both editions integrate with the company’s OmicsOffice suite of products for managing qPCR, microarray, NGS and functional genomics data, and all of this data analysis and exploration functionality then sits within Spotfire.

Offered in parallel is Columbus, an image analysis solution that can manage images imported from any major high content imaging instrument. ‘Columbus can extract features such a cell count, numbers of living cells, shapes of cells, as well as tissue pathology,’ Hoefkens comments. ‘Spotfire’s high content profiler module then allows multivariate analysis so that users can identify features of relevance, based on potentially thousands of parameters. And then, sitting next to all this is PerkinElmer Signals, our cloud-based big data platform, and in particular Signals for Translational, which pulls clinical trial, patient, and adverse event data into the same environment as the NGS, microarray and imaging data.’

Faster drug development timelines

The NGS community has for some time exploited computational clusters and high performance computing (HPC) to handle the size and complexity of data, but the benefits of HPC in the imaging field are only starting to be realised, Hoefkens continues. ‘There are some very sophisticated algorithms now being applied to imaging data, and we are working with customers to migrate image analysis infrastructure onto HPC platforms that could dramatically reduce compute time, and in real terms shave possibly months off drug development timelines.’

Interestingly, he suggests that while the life sciences sector has in the past pioneered and driven innovation in software and informatics, the pendulum has now swung and there is significant innovation outside of life sciences. ‘We would do well to tap into the experiences of other sectors, particularly with respect to integrating, analysing and interrogating vast volumes of diversely structured data. And what we must realise is that scientific knowledge and experimental capabilities are also expanding. Vendors such as PerkinElmer need to develop flexible products that will evolve with the R&D landscape.’ 

Analysis and opinion

Robert Roe looks at research from the University of Alaska that is using HPC to change the way we look at the movement of ice sheets


Robert Roe talks to cooling experts to find out what innovation lies ahead for HPC users

Analysis and opinion