FEATURE

Facing a formidable challenge

Turning data into scientific insight is not a straightforward matter, writes Sophia Ktori

The lack of data standards is a long-standing challenge across many scientific disciplines and industries, and as organisations attempt to turn their data into scientific insight, the proliferation of both current and legacy file formats poses a formidable challenge, comments Trish Meek, senior manager for product marketing at Thermo Fisher Scientific: ‘Without standardisation, system integration has to be effected at the project level, or by vendors on a case-by-case level.’

‘Historically there have always been a lot of data formats out there, but in the life science R&D space, particularly, there is a broad range of analytical and experimental workflows,’ comments Paul Denny-Gouldson, VP of strategic solutions at IDBS. ‘Analytical testing of synthetic and biological samples, imaging and modelling tools, high throughput and high content screening and next generation sequencing are just some of the techniques that are creating data in formats that can’t be easily interrogated alongside – or integrated directly with – data from other platforms.’

When information is scattered across the company in individual workstations and multiple, disparate formats, data sharing and collaboration – even within the same organisation – becomes difficult, if not impossible, stresses Darren Barrington-Light, software marketing specialist, informatics and chromatography software at Thermo Fisher Scientific. ‘Also, when data systems are retired or replaced by a different vendor solution, the existing data can pose something of an issue – what should happen to that data? It needs to be retained and made accessible after an old system is retired, often for technical reasons such as unsupported operating systems or hardware issues. When this happens, then sometimes the only options are either a laborious import to the new system, or the maintenance of a cut down legacy system.’

Collaboration and externalisation also necessitate moving experimental and analytical data across firewalls into the informatics infrastructures of partners, collaborators and service providers, including CROs. ‘Each organisation will have its own informatics infrastructure and software,’ Denny-Gouldson adds. They will quite possibly all use instrumentation from different vendors, and their data formats may not be compatible with those used by collaborators’ hardware and software.’

How much easier it would be, Meek adds, if true standards existed. ‘Integration would become much easier if open standards were universally adopted across the scientific community, enabling free flow of data throughout organisations from start to end, without transformations or customised interfaces.’

And yet, although many end-users do want to see widespread adoption of open, standard data formats, there is still some resistance by a few hardware manufacturers, Denny-Gouldson continues. ‘If you have developed an analytical instrument with software that does something completely new, then you have a unique selling point for that instrument. As soon as you start publishing that instrument’s data in an open standard format, you lose that USP.’

This is a point reiterated by Burkhard Schaefer, president of BSSN Software. The benefits of a common data format may not be viewed quite the same from the instrument vendor side, he concurs. ‘If you are a manufacturer of, say, a commodity HPLC instrument, and you have a base of end users who are all set up to work with your software and data formats, then they are more likely to stay with you when they upgrade or need new machines. But if all the HPLC instruments from all the vendors use the same standard data formats, then you lose that vendor lock-in. If another vendor comes along with a cheaper, equivalent instrument, then it could put your business at risk.’

Even some of the pharma companies can be a bit reluctant to take the concept of standardisation on board, because they are comfortable using their existing, proven processes and software admits Gene Tetreault, senior director of products and marketing at Dassault Systèmes BIOVIA. ‘We need to get across the idea that adopting standard data formats will improve laboratory efficiency, give end users much more freedom to choose a wider range of instrumentation, and ultimately allow all their instruments and software systems to work together without the need for integration tools. And this will allow us to focus more on building capabilities that provide real scientific value.’

Companies like IDBS that specialise in data management are encouraging the development and adoption of data standards, Denny-Gouldson stresses. ‘We are not an instrument vendor or a provider of software that analyses, reanalyses, or does specific science. From our perspective it is actually better if we get a data standard in an open format, because then we don’t have to use specific and bespoke connectors to get the data out. Wherever possible, we use open formats to store and publish data, even if that format is something as simple as a CSV file. We have supported AnIML for the last six or seven years, and we are also a partner of the Allotrope Foundation. In the big utopian view of the world, all data would be generated in a standard open format. This would also make it easier for niche providers to make their software or instrumentation more saleable. What you don’t want, however, is to have so many standard data formats that you essentially end up with just another collection that throws up a different set of cross-communication issues.’

If we can get data standards adopted by instrument and software manufacturers, then adding or changing analytical instruments would feasibly be a fairly simple plug and play exercise that doesn’t require major changes to your software system, adds Andrew Lemon, CEO at The Edge Software Consultancy. ‘It really boils down to interoperability. Data standards won’t just make in house or collaborative data more usable in an analytical or experimental setting. They will also allow researchers to make use of public databases more effectively. In the experimental biology sector, for example, data standards will feasibly make it possible for a researcher to relate their own analytical results to biological pathway data on a public database, and also to databases informing about the genes that might be activating that pathway.’

Customers of The Edge are primarily in the experimental biology sector, working in areas including in vitro/in vivo pharmacology, ADME and DMPK. ‘Although these are well-established disciplines, there are really very few data standards for this space, and it is a struggle to get systems talk to one another. We expend a lot of effort to build in the ability to capture, structure and store results in a single database that allows users to access multidimensional experimental and analytical data, and interpret it all in context.’

One way of speeding the development and uptake of data standards is to get involved in the development of such standards, Lemon points out. ‘We have worked with initiatives such as the Pistoia Alliance, and others, to bring our understanding and experience into this space. The industry must collectively look at the needs from the perspective of the end users. The pharma companies and laboratories don’t purchase software and then find the instrumentation that can run with that software. End users purchase instrumentation first, and expect it to work with their existing infrastructure.’

The need for data standards may therefore seem obvious but, while efforts to develop and adopt data standards such as AnIML in the analytical chemistry space have been encouraging, equivalent moves in the bioanaytical space are lagging behind, Lemon suggests. ‘In my experience end users working in this sector aren’t really aware of data standards. Yet the ability to output data from all these technologies in one format would significantly help to streamline research.’

A true open data standard will structure data at the point of capture, so that any part or all of the information – including any metadata around that information – can be extracted intact. ‘Right now we still have to rely on file formats that can be exchanged, but they are just a way of exporting data so that users in different departments or different organisations can read it. A lot of data is thrown out in the process of supplying pieces of information in a format that can be easily transported. The only way to retrieve lost information is to go back to the raw data files.’

There’s nothing wrong with sending data in a PDF or a Word document, so long as you can easily go back to the database from which it was generated and extract more complete information, Lemon adds. ‘And if all the information in that database is in the same format, then you have the potential to search and mine analytical data from multiple instrument types.’

There is one potential stumbling block that might also hinder the rapid uptake of data standards by end users, even if instrument and software vendors were quick to adopt open, standard format languages, Lemon suggests. ‘Ideally those formats can be applied to existing instrumentation; otherwise it’s going to take a decade or so to get through the instrument replacement cycle before uptake can be considered anywhere near comprehensive. Our clients aren’t going to be able to change their instruments or software just because a new data standard is only being released with new equipment.’

There have already been instances of some data formats naturally being accepted as standards for particular applications,’ comments John Stalker, product manager, Platform for Science at Core Informatics. ‘The .ab1 file format, for example, originally developed by Applied Biosystems as part of the capillary sequencing platform, became the de facto standard in that that space. These were the files stored in NCBI, and all work around analysis tools at the time centred on that input format. In this case, uptake and acceptance of the format as a standard has been organic rather than contrived. Similarly, there is increasing commonality in areas such as next generation sequencing, where much of the data ends up in .bam and .sam files, irrespective of the sequencing instrumentation or methodology used.’

Standardisation of data formats in the R&D sector could also learn a few lessons from the web itself, Stalker continues. Safari, IE, Chrome and Firefox are all competing web browsing platforms but, nevertheless, all work with the same HTTP and HTML standards. It was necessary to get standards in place very early on in the development of the internet – otherwise it would all have collapsed.’

One alternative to the implementation of across-the-board standard data formats is to exploit the burgeoning availability of microservices in the cloud, Stalker continues. ‘There has been a huge movement towards containerisation of microservices in the cloud. There is a burgeoning catalogue of web-based converters and translators that can just be picked up and clicked together to allow the integration of different data formats. As the cloud grows and more tools become available to make the connection of disparate software and data formats more trivial, the need for a standard may become less urgent. These containerised microservices are becoming increasingly commoditised, so we can just consume them as a service, in real time. We’ve seen this in the sequencing arena for the last couple of years by purpose-built businesses that string together pipeline analyses; for example, Seven Bridges, DNA Nexus, GeneStack. Soon, consumers will be able to wire up plate readers, LC/mass spec, and a myriad of other data producers in a more cohesive way.’

So what are the ideal attributes of a data standard? From a technical perspective, it’s all about accessibility for current technologies, as well as being able to embrace new types of analytical instrumentation and methods as they are developed, suggests BSSN Software’s Burkhard Schaefer. ‘A standard must have a natural longevity, and be flexible enough so that it can easily be adopted by new analytical approaches and techniques.’

Data standards will also help to make sure that data can be retained and easily retrieved, perhaps decades on, he points out. ‘Regulated industries, and particularly the pharma industry, are looking at data retention times of 50 years or more, so having data in a format that can be read without requiring additional software is important. As soon as you put something in binary format, then there’s always a chance the software that can read that format now will become outdated or obsolete.’

Data will often outlive the data system that generated it, Barrington-Light concurs. ‘This leaves customers with a serious issue – how can I search and view that data without the original software? If the data was in a standardised format that could be read by any current data system, then the need to retain outdated systems would be removed, and allowing customers to select their next data system based purely on their business needs and not compatibility with a previous system’s data format.’

Standard formats also spell good news for the software providers, Schaefer notes. ‘The more instruments that you can connect to one vendor’s platform, the more valuable and versatile that platform becomes. And an open data standard format will not only make partnering and outsourcing more seamless, it will help to ensure equivalence with respect to data quality, content and reporting.’

At the end of the day, end-users don’t look for standards; they look for instruments that do a particular job in a particular way, and informatics solutions that will allow them to carry out an application or solve a business problem, Schaefer points out. ‘A data standard only represents a tool that can make something easier to carry out, but it never solves the original problem.’

We need to consider how a standard might function, continues Schaefer, who is championing the XML-based AnIML data standard, and is also involved with eight other standards organisations, including SiLA. The primary requirements include ease of adoption – how well does it play with current software tools and instruments. ‘When you design a standard data format, it should, most obviously, fall into place within any existing informatics environment. People already have an infrastructure in place and they want to be able to use the standard without too much new learning, so that their productivity isn’t compromised.’ Cost is another major consideration. ‘You have to ensure that the total cost of ownership will not be prohibitive – just for the sake of getting two systems to talk to one another without the need for an integration tool. And if your data format isn’t an open format, then that cost could be considerable.’

Schaefer also maintains that development of standard data formats should be overseen by a recognised standards body, such as ISO, IUPAC or ASTM, under the auspices of which AnIML is being developed.  ‘As well as offering the necessary expertise to underpin development of a data format, a standards body provides both credibility and some assurance of longevity, and helps to ensure a level playing field for all stakeholders, including vendors and end users, academia and governmental bodies. Stakeholders need to have an equal voice, and a standards body can ensure that the development process is fair and truly consensual.’

The challenge isn’t just about developing harmonised data formats for recording, storing and reporting data that comes out of mass spec and chromatography devices or next generation sequencers, stresses Andrew Anderson, vice president of innovation at ACD/Labs. ‘They must focus equally, and in parallel, on standardising taxonomies and other data that we put into the system, so that experiments can be reproduced on any vendor’s instrumentation. Users need to describe the critical quality attributes of their method, in that data standard also.’

Taxonomies need to be sufficiently broad to cover a wide range of instrument types, and flexible enough to encompass new instrumentation and experimental procedures, adds Graham McGibbon, manager of scientific solutions and partnerships at ACD/Labs. ‘Taking FTIR spectroscopy as an example, then there will be a set of acquisition settings that are common to every FTIR instrument, whoever the vendor. It is relatively easy to include these settings as a matter of routine in a digital file alongside the recorded results, so that when someone comes to repeat the experiment, they have all the necessary settings and metadata.’

The situation becomes more complicated when you have some particular instrument-specific or vendor-specific settings that fall outside of the standard set of instrument data, and which may cause an experiment to yield different results if not included. ‘This would be particularly important if, for example, you want to build a library of FTIR spectra. Unless you have those certain acquisition settings, instrument-specific metadata and any other common metadata required to configure other vendors’ machines identically, you will need to carry out all your analyses on the same instruments. Importantly for that purpose, this doesn’t mean one needs each and every piece of parameter metadata from every instrument, but that some will be needed, and creating a standardised terminology for those is a key facilitator. Since a standard should allow for analytical instrument innovation, one must also plan how such a standard will be able to update changes without creating either problems of version, or other incompatibilities.’

ACD/Labs is an Allotrope Foundation partner, and is encouraged by the progress of the initiative and its approach, Anderson suggests. ‘Allotrope is starting with its members; most mission-critical activities, and trying to build end-to-end digital systems that will minimise reliance on abstraction moments – when you have to rely on human intervention to move data from one system to another, or between different areas of a decision support platform. Rather than try and boil the ocean, Allotrope is building from a position of need. The aim is to attempt to include – along with its standardisation of output data format – taxonomies/ontologies, methods and procedures, as well the flexibility to learn from specific examples and then adjust to more general situations. Achieving the ultimate goal of building an open format standard will require input and collaboration from and between all stakeholders – including the instrument developers and vendors, end users, and intermediaries such as ourselves and other informatics software developers.’

And, with a standardised data format, users are able to retrieve original raw data or converted files easily, and to respond much more promptly and effectively to requests from colleagues, regulatory or legal bodies, Barrington-Light notes. ‘The data sharing and recall capabilities of the system can aid development of new ways of reanalysing samples, and developing predictive models that are impossible when the information is scattered across the company in individual workstations and multiple disparate formats. 

‘The development of truly open standards will take this even further, to enable scientists to make connections between data sets that aren’t possible today.’

Achieve all that and you could potentially foresee a future in which software systems make decisions, as well as provide the data that allows those decisions to be made, McGibbon continues. ‘With standardised formats for inputting and extracting data, it becomes feasible for software to generate, interpret and make critical development or business decisions based on initially heterogeneous data.

‘But this will only be possible if we can utilise data comprehensively, seamlessly and without any loss of context –and that requires ongoing development and implementation of sufficiently granular, flexible standard data formats that they can handle data from any experimental, analytical or decision-support systems.

Feature

Sophia Ktori highlights the role of the laboratory software in the use of medical diagnostics

Feature

Gemma Church explores the use of modelling and simulation to predict weather and climate patterns

Feature

Robert Roe reviews the latest in accelerator technology and finds that GPUs and coprocessors will be key fixtures in the future of deep learning

Feature

Robert Roe finds that commoditisation of flash and SSD technology and the uptake of machine learning and AI applications are driving new paradigms in storage technology.