Thanks for visiting Scientific Computing World.

You're trying to access an editorial feature that is only available to logged in, registered users of Scientific Computing World. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

The data dilemma

Share this on social media:

Beth Harlen reports on the challenges of laboratory data integration

The laboratory landscape is changing. The combination of new technologies and evolving business strategies that – especially within industries such as pharmaceuticals – shift focus towards externalisation, mean that data integration and exchange present more of a challenge than ever before. The issues being addressed here are not small. Multiple systems exist with their own proprietary and often inconsistent data formats, making the aggregation of that data in order to discern meaningful results difficult. This complexity is compounded even further when a lab manager or director has to manage multiple teams or groups of work, especially when those teams are located at separate sites, Colin Thurston, director of product strategy for process industries at Thermo Fisher Scientific, explained. Integration and data management solutions solve these information challenges and deliver automated data acquisition, Thurston added, and distribute data across the enterprise, regardless of the individual report format delivered by each instrument.

‘Thermo Fisher Scientific’s Integration Manager acts as a translator of all the individual languages for those different data consumers, being able to provide and accept data in the format applicable for each instrument and software application, and subsequently deliver it to each user. This allows all levels of management to access lab-sourced data from within familiar systems,’ Thurston said. As an example, he highlighted how process operations staff will see relevant lab batch data alongside online sensor readings within their familiar manufacturing execution system (MES) control panel. Back-office personnel in logistics and quality assurance will see up-to-the-minute quality information and shipment release information within the corporate enterprise resource planning (ERP) system.

Glyn Williams, VP of product delivery at IDBS, added that in the past simple file transfer has sufficed – but, with the increased complexity and functionality of instruments and systems, a deeper bi-directional interface is now required. This has led to the development of richer application programming interfaces (APIs) for software and hardware to allow the levels of integration required. ‘The volumes and richness of data that instruments now produce makes linking to systems the only viable alternative, as moving huge data sets from one system to another is no longer viable or desirable,’ said Williams. ‘This in turn leads to transferring metadata as opposed to the whole data set, so that the systems or instruments can be searched from a different application as the data required transferred on an as needed basis.’

IDBS’ E-WorkBook regularly captures data from other systems or instruments (registration systems, SDMS, CDS, HPLC, etc.) but also pushes to systems such as warehouses or ERP. Williams explained that, in many cases, the ELN is the application that many scientists spend the majority of their time using, and so it becomes the main interface to other systems and makes integration a key factor in simplifying the user experience. ‘When systems are used by many users (often thousands) with many integrations, there are always challenges. These can vary from integrating with old instruments or systems that have limited integration points, to leading-edge systems that have many possible integration points and rich data. These challenges are not insurmountable, but require thought and planning.’

Another company aiming to meet these challenges is PerkinElmer, whose e-notebook provides extensive tools to capture data from a wide range of sources and preserve them in electronic form. In addition, it offers search and retrieval, protected long-term storage, as well as the possibility to deploy in validated environments. Preserving the data is, of course, just one part of the process, as Rudy Potenzone, VP of product strategy for informatics at PerkinElmer, explained: ‘Formats, whether based on standards, proprietary vendor formats or ad hoc user creation, need to be fully defined and understood.

‘Moving data requires a complete knowledge of the form, of the date, its units and often, the description of its history (conditions, source, treatment, etc.). In order to assure the quality exchange, the appropriate context must be included.’

When sharing data with external collaborators, such as contract research organisations, context is critical. Dotmatics CEO Steve Gallagher believes that a common language between partners is essential to guarantee the success of such ventures. ‘This paradigm shift requires software that will enable the safe and secure communication between partners, independently of where the data is stored or how it is formatted. In addition, geographical constraints mean that cloud solutions have become key, and web-clients are a must.’

The Dotmatics web-based platform can query, analyse and report on multiple data sources at once, then share these findings in a simple and secure way. It is a secure enterprise platform, with sophisticated yet easy-to-tune business rules, that provides an uncluttered view of all data, independently of format, data source or location. Dotmatics is designed not only to share data but to stimulate collaboration across project teams, resulting in improved quality, creativity, and productivity. 

A matter of standards?

The difficulty that arises when integrating data from multiple systems is a lack of consistency – both in terms of metadata and document formats. In simple terms, people can often describe the same data in different ways, and with organisations that contain a range of scientific disciplines, this can be exacerbated. The consequence of this is that people can resort to transcribing data manually as definitions can also vary between systems – a laborious process that invites errors to creep in.

One possible solution is the introduction of data integration standards – a goal that bodies such as Allotrope Foundation are working towards. Gerhard Noelken sits on the board of directors and the planning committee for the foundation. He explained that an open ecosystem will provide a document standard, metadata guidelines and class libraries to use the standards: ‘The ecosystem will provide a document standard, which means a format in which primary analytical data can be stored in a non-proprietary way.’ In terms of the development of metadata guidelines, the foundation is analysing the current metadata structure across its member companies in order to determine a standard vocabulary. Noelken acknowledged that there are many different vocabularies in place already; however none has been broadly adopted. To this end, the foundation will use existing vocabularies and information standards wherever possible.

Founded in 2012, Allotrope Foundation is relatively new but, in the past year, has managed to develop a precise problem definition and come up with a plan of how to resolve it, and is now in the process of finding software development partners. Noelken commented that, as a consortium, the key has been for members to leave their in-house mentality behind in order to come together and communicate something as abstract as a data standard. He hopes that using an open framework like Allotrope will revolutionise the way data is shared within companies, across companies and between companies, CROs and regulators.

While Allotrope focuses on building a common laboratory information framework compliant with the regulatory environment, the Pistoia Alliance seeks to ‘lower barriers to innovation in life science research and development through optimising business processes in the pre-competitive domain’. John Wise, executive director of the Pistoia Alliance, explains that Pistoia enables pharma R&D, scientists and informaticians to communicate with the technology community and commercial providers, as well as academic or government organisations, such the European Bioinformatics Institute (EBI).

One of the many projects Pistoia is currently involved in is the HELM (Hierarchical Editing Language for Macromolecules) project. Originally developed at Pfizer, HELM will enable a standard notation and software tools for the examination and exchange of information about macro molecules. This consistency is much needed, Wise explains: ‘The industry is moving away from small chemical compounds into the larger more complicated biological molecule space, and the way of describing macro molecules and the supporting software to manage macro molecules has yet to be well defined.’ The aim is to develop this once internal technology into a universal industry standard. 

Tackling standards from yet another perspective is the AnIML project. AnIML – the analytical information markup language – is a standardised data format for the storing and sharing of experimental data. Recognising that there are discreet silos of information within an organisation, the project is attempting to build bridges between these heterogeneous data sources.

BSSN Software has been involved with the project since 2003, and president Burkhard Schaefer says the he hopes the project will push the core standard through the ASTM balloting process towards the end of 2013: ‘Of course, it’s one thing to have the de facto standard – which has been complete for almost two years now – but actually getting the seal of approval from ASTM is crucial for fostering adoption and ensuring that people aren’t running after a moving target when they actually deploy the standard.’

The benefits

Should standards come together, what can we, as an industry, expect? Schaefer believes that one critical benefit will be the lowering of integration costs as companies would only need to interface against that one standard. ‘If there is a normalised exchange mechanism that is open and standardised, the incremental costs of adding to instruments or data systems is greatly reduced,’ he said. Schaefer added that standards will also greatly reduce the total cost of ownership in terms of long-term data retention. ‘Retaining a data archive means that companies must maintain the software and hardware capable of reading that file type. If you multiply that by the versions of software that have ever contributed a record to that archive, the total cost of ownership is exploding. The idea is to reduce the number of file formats that need support, and consequently reduce the number of tools in the long term.’

Beyond that is the problem of software from one vendor being incompatible with hardware from another. ‘But if organisations invest in standard compliance to do the post-processing of the data from all instruments, they can standardise their methods and simply do feature-driven purchasing,’ explained Schaefer. Standardisation may seem like a logical step – but, as he warned, not only are the resources in the community scarce, but the question remains of whether vendors will be able to agree with their competitors that data should be presented in a certain way. ‘A standard is only a piece of paper, and you cannot implement a piece of paper,’ Schaefer said. ‘There is the need for tooling and vendor support.’

Randy Bell, director of operations at LabLite, believes that in order to be effective, standardisation must be driven internally. ‘Externally, I just don’t think it is realistic,’ he said. ‘For example, take a simple environmental lab with a laboratory information management system (LIMS), a few instrument interfaces, that on occasion uses some contract labs, and has to provide compliance reports to the state. The LIMS will have its own data format and naming convention, and each instrument could output data differently. The format of the data received from each of the contract labs will most likely be different and each state has different electronic data deliverable (EDD) reporting formats.

‘Even in this simplistic example both the lab and the LIMS vendor have to handle uniquely named and formatted data. I don’t necessarily look at this as a bad thing. I think it is just something we have to recognise exists in any implementation.’ 

He continued: ‘We would probably all agree that data integration standards would be helpful; however, with the number of different lab and test types, equipment manufacturers, LIMS vendors, and client requirements, the reality is that each job poses its own unique data integration challenges. Based on the diversity of lab types and the various formats in which they receive data, I don’t think we will ever get to the point where there is one standard.’

In conclusion, John Wise of the Pistoia Alliance believes there will be a coming-together of standards, but that standards in life science R&D have historically always been a challenge – one that is set to continue.

Ryan Sasaki, director of Global Strategy at Advanced Chemistry Development (ACD/Labs), discusses small data problems

Heterogeneous data distributed across different silos in different labs is a major obstacle to retrieving and re-using information and knowledge extracted from analytical data. The first issue is related to data access. Currently, there are more than 20 historically important analytical instrument manufacturers – many of which offer more than one instrument model – for chromatography, mass spectrometry, optical spectroscopy, nuclear magnetic resonance (NMR) spectroscopy, thermal analysis, and x-ray powder diffraction. Throughout any given organisation there may be a variety of different instrument-selection strategies based on the complexity and business importance of the scientific problem. As a result, laboratory managers may seek a balance between ‘best-of-breed’ and ‘fit-for-purpose’ instrument solutions that inevitably compound the problem described above by creating disparate laboratory environments that consist of a variety of instruments and data systems that don’t work together.

While the long sought-after, ‘holy grail’ solution is to come up with an industry standard file format for all analytical instrument data, this standard does not exist today. ACD/Labs is a software informatics company with long-standing partnerships with analytical instrument vendors and serves as a third-party technology resource to help organisations tackle the challenge of heterogeneous laboratory data formats. With the file-handling support of more than 150 formats from multiple analytical techniques, this technology can help laboratory organisations provide their scientists with better access to ‘live’ analytical data.

The second major challenge that sits on top of the heterogeneous data format challenge is the ability to extract and capture knowledge effectively from analytical data. This will often require dynamic interaction between the analytical data acquired, the chemistry being studied, and the scientist doing the analysis. At the end of the day, while the optimisation and generalisation of an analytical measurement is constantly evolving, for most end-users the actual data is a means to an end. Traditional informatics systems like LIMS, ELNs, and archiving systems do a fine job in handling the regulatory aspects of proving that data was generated in accordance with a sample and for documenting scientific conclusions. However, these systems do not capture the key observations and interpretations that lead to a structure confirmation or characterisation.

Through the adoption of a unified laboratory intelligence (ULI)[1] framework, organisations can collect data from different instruments across laboratories, convert heterogeneous data to homogenous structured data with metadata, and store unified chemical, structural, and analytical information as ‘live’ data. This provides the ability to apply chemical context (why) to the vast amounts of analytical content (what).

[1] Ryan Sasaki and Bruce Pharr, Unified Laboratory Intelligence

 

Simon Tilley, head of pharmaceuticals at SAS UK, on meeting regulation in a big data age

Imagine the vast amounts of data generated from 12,000 or so clinical studies over two decades. Multinational companies often conduct trials across the globe, and also take on old trials conducted by acquired companies. The data is held in different siloed systems, formats and countries, making it difficult to respond to constantly evolving data standards and naming conventions that need data relating to specific studies within a matter of hours. With a sea of data to contend with, and often no idea where to start, one multinational pharmaceutical giant recognised that it was becoming increasingly difficult to meet these demands and it needed a data management solution that would convert the information into easily accessible and usable data.

To help consolidate and harmonise all of its data, the company worked with a three-company consortium, including SAS, which built and designed a programme spanning multiple locations across three continents. The challenge was to find a way to integrate data that varied in age by 20 years, during which time relevant data standards had changed massively. The processes also had to be transparent and fully auditable.

Through a combination of SAS technology, domain knowledge, training and technical expertise, the pharmaceutical heavyweight was able to drive efficiencies in the way that it was accessing and managing clinical data, through the creation of a central repository. While the repository is physically hosted at SAS’ headquarters in the US, lab workers can access the information needed quickly and easily, and demonstrate a trail of how and when the information is accessed for audit purposes.

By applying an analytics solution to its data management challenge, the pharmaceutical company can respond to regulatory demands in the fastest, most accurate and cost-effective way. In creating efficiencies within its data storage and analysis capabilities it can avoid penalties from regulatory bodies and, from a pharmacovigilance perspective, react to adverse events, drugs misuse and medication errors as they occur. An advanced solution like visual analytics can help pharmaceutical professionals glean rapid insights from clinical trials for faster and smarter decisions.

Having data represented visually in one centralised hub also means that lab workers can report to key stakeholders or regulatory bodies with easily digestible yet in-depth information, justifying their work and securing a stamp of approval in line with stringent regulation. And that’s easier said than done!