Data standards

Share this on social media:

As the amount of data increases, organisations are looking to data standards to increase the value of data in the laboratory, writes Sophia Ktori

The Pistoia Alliance has been pioneering initiatives that will support organisations as they work to enable and engage Fair (findable, accessible, interoperable, reusable) principles of data management and stewardship in their organisations. The Fair data guidelines, published by Wilkinson et al in 2016, have been adopted across industries as the foundation for all organisations to maximise the value and interoperability of their data.

Fair principles hinge on the digitisation of data acquisition, access, management and mining, explains Ian Harrow, a Pistoia Alliance consultant. And while the concept may seem to be intuitive, he believes that companies are still finding it hard to get to grips with Fair as an ideology for data management.

With this in mind, the Pistoia Alliance is working to develop a toolkit that will support companies as they work to understand and implement Fair guidelines across their organisations, and not just individual departments.

The toolkit, the first version of which is due for release early in 2020 – will be hosted on a freely accessible website, and encompass selected tools, best practices, training materials, use cases and methodology for change management. ‘The aim is to provide easy-to-use, straightforward tools in a one-stop-shop environment that will help companies in their drive to change the corporate mindset and implement a Fair data infrastructure enterprise-wide,’ Harrow stated.

In fact, it’s that concept of change management that is key to success, Harrow pointed out. Many of the software platforms and tools that will help companies establish a Fair data infrastructure are already available, but the how and why of implementation may be less well understood, especially when cost is involved.

‘In effect, it’s not necessarily the technology pieces that companies are struggling with. The stumbling block is sometimes the cultural evolution that will be needed across the organisation. Scientists think of the data that they generate in terms of ownership, but at every level, from scientist up to senior management, people have to move on from the kind of magpie mentality and view pockets of data, or data streams, as part of the larger whole, and integral to the concept of data as a corporate asset.’

The Pistoia Alliance toolkit will combine pointers to digital/software tools and other resources, or technologies that help companies to measure how Fair their data is, and how to increase that. From a practical perspective the Pistoia initiative shows organisations how to use existing software tools as part of a complete package for making data Fair. But the toolkit also encompasses education and training tools that will help users to work out how best to implement change across the organisation.

Fair enough

The overarching aim is to demonstrate how and why, not necessarily to point to or provide all of the tools that will enable them to accomplish that. Use cases will demonstrate real-world applications and benefits, Harrow said.

‘Use cases are vital, real-world pieces of the toolkit that can show the benefits of working towards making data Fair. In most cases there will be aspects of making data “Fair enough”, rather than “perfectly Fair”, and the use cases demonstrate what this might mean, in terms of data management and handling on a practical, and ongoing basis. It’s commonly a matter of perspective.

‘The toolkit will include a number of use cases from the different project teams, and they will really sell why making data Fair is so important. Ultimately we want to show how the application of Fair and standardised data can impact on bottom lines – return on investment and development costs/timelines,’ he said.

One of the biggest practical hurdles is what to do with data that is acquired through merger and acquisitions, when potentially massive amounts of legacy data come into an organisation, commonly in proprietary formats. Someone has to grade the relative importance of all this data, and prioritise its acquisition and husbanding, Harrow said.

‘You have to identify your low hanging fruit to show management – who holds the purse strings – the value of making that legacy data Fair and integrating it with your existing in-house data. There has to be a balance between pragmatism and forward thinking. Where does it make sense to make your data Fair? Not just for your operation now, but potentially to match future projects and objectives,’ added Harrow.

Once the ‘whys’ and ‘whats’ have been established, at which point do companies start to figure out the practical ‘how’ of making their data streams, or putting in place the infrastructure that will ensure future, as well as historical Fair? This is where it becomes important to think about data standards. ‘Importantly, organisations should use existing open source, community standards and protocols that have already been well tested,’ Harrow noted.

‘What you don’t want to do is make use of a proprietary solution for standardisation. One of the big strengths of the Fair principles, and how they are implemented, is that they embrace open standards. That doesn’t mean that your data is open. The data itself can still be closed, and locked in to your system. But the standards and the protocols that are used to manage and, where appropriate, communicate that data should be open, and relevant. Examples include AnIML and SILA, and Allotrope Foundation. EMBL’s EBI and NCBO’s bioportal are key sources for ontologies,’ stated Harrow.

Companies should consider that Fair and standardised data go hand-in-hand, whether the underlying science is hypothesis-driven, or data-driven, Harrow suggested.

‘Whatever data lake you are fishing in, the chances are if your data follow Fair principles and are accessible in standardised formats, then the depth and breadth of insight will increase, both for your current studies and when you come to reuse the data, possibly years down the line. You want to be able to spend more time analysing that data and deriving utility from it, rather than wrangling the data to make it Fair after the event.’

The Pistoia Alliance Fair toolkit is being funded by some of the global pharma players, including Roche, AstraZenea, Abbvie, BMS and Novartis.

‘Over the next few months we aim to bring content that we’ve been developing and gathering into the web page templates, with the aim of launching early next year,’ Harrow said.

‘It’s a collective initiative, and a great example of pre-competitive development. There has been input from a good, 
healthy mix of consumers of data, data content providers, and technology vendors. That’s a healthy mix.’

The benefits of standardising data formats extends to regulators, as well as to the companies and academic organisations that are carrying out R&D or manufacturing. From a regulator’s perspective, being able to evaluate drug submissions provided in standardised electronic formats not only saves time and money, it makes it possible to review new submissions in the context of previous applications, suggests Kevin Trimm, head of product management for pharmacometrics software at Certara.

‘Each application is still reviewed on its own merits, but an extra layer of perspective and insight is now possible, because reviewers can contrast and compare study data – both clinical and preclinical – that support an application, by calling up data on prior studies of a similar type, such as those that evaluated the same class of molecules, or were testing drugs for the same disease indication,’ said Trimm.

For any drug regulator, building databases of key submission data and metadata from each study in an application is prohibitively labour intensive. What makes much more sense is to request that all applications are accompanied by data and metadata in standardised electronic formats, which can then automatically populate such databases, and provide equivalence in future review. FDA reviewers can then call up or mine historical data for comparison during their assessment of new applications.

Building common data standards

It was this need to set up standardisation for electronic drug submissions that led to the creation of the Clinical Data Interchange Standards Consortium (CDISC), a standards development organisation that is building common data standards for supporting electronic regulatory submissions.

The organisation has developed multiple standards covering regulatory submission types, disease fields and therapeutic areas. Each standard is designed to facilitate the accessibility, interoperability and reusability of clinical research and research data – think Fair principles – whatever its source. Pioneered by FDA, CDISC is now a global, non-profit community initiative, funded by more than 450 member organisations, as well as through grants, events and educational initiatives.

The organisation itself states that CDISC standards have been adopted and implemented in over 90 countries. CDISC standard formats are now required for electronic submissions to FDA in the US, and to the Pharmaceuticals and Medical Devices Agency (PDMA) in Japan.

CDISC standards were also endorsed by China’s National Medical Products Administration (NMPA), in its Clinical Trial Data Management Technology Guide (July 2016). They are now requested by the European Innovative Medicines Initiative (IMI).

In its industry guidance on electronic submissions (Providing Regulatory Submissions in Electronic Format – Standardised Study Data – Guidance for Industry), released in 2014, the FDA outlined its requirements for statutory electronic submissions, including the use of defined CDISC standards for study data and controlled terminology.

‘CDISC has generated a suite of standards categories, underpinned by a set of core foundational standards,’ Trimm noted. ‘These support clinical and non-clinical research processes from end to end, and take data formats back to first principles, focusing on defining data standards, and including models, domains and specifications for data representation,’ stated Trimm

‘In parallel with the foundational standards are CDISC data exchange standards that enable the sharing of structured data across different information platforms and computer systems.

Therapeutic area (TA) standards then offer extensions to the foundational standards, and further refine and lay out standards for defining how data relating to specified disease areas is structured and communicated,’ Trimm continued.

Most recently, Certara announced the availability of PK Submit, a technology solution for automating the creation of pharmacokinetic (PK) CDISC domains during Non-Compartmental Analysis (NCA).

The generation of domains associated with PK analysis represents a unique challenge for electronic drug submissions, Trimm noted. ‘There are challenges associated with documenting record exclusions and comments in these domains, as well as generating the Relrec and Pooldef domains that explain how they were generated. Allowing the pharmacokinetic concentrations (PC) and pharmacokinetic parameters (PP) domains to be generated contemporaneously by a PK scientist, at the time of performing an NCA from a single source, is the best way to solve these problems.

‘PK Submit is integrated with software associated with generating the data for these domains (Phoenix WinNonlin) and supports the automatic generation of a complete electronic PK regulatory submission package, including all necessary CDISC domains, during the normal process of setting up and executing an analysis by a PK scientist who is not a CDISC expert,’ Trimm added.

Making use of standards

Does FDA’s requirement for CDISC mean that drug submissions will become wholly digital with no need for human-readable data?

‘There always has been, and still is a human-readable component to a drug submission, but the requirement for submission in a standardised electronic format not only ensures equivalence in data type and form, but in control terminology, so everything is coded in the same way,’ Trimm stated. ‘These are the two elements that you need if you are going to be able to mine your data.’

So how do companies start to build CDISC standards into their data collection, collation and formatting for submission? For companies making regulatory submissions in the US or Japan, where CDISC electronic submissions are now mandatory, one of the primary challenges is creating the right domains into which the different types of data will fit.

‘While this can be done manually, it’s hugely time consuming, and there are software packages that can streamline the process. This is one area in which Certara is engaged: creating software solutions that don’t require a lot of knowledge about the domains themselves. Because becoming an expert in electronic submissions is almost a profession in itself. Understanding all the data models, the control terminologies – which may be updated every few months – and the most up-to-date implementation guidelines, is complex,’ notes Trimm.

‘While ideally the requirement for electronic submissions in standardised formats should be set into the experimental mindset, in reality the data collected can be converted into the right format downstream. The tricky part is ensuring that you collect the necessary data that’s going to be required to create the domains,’ Trimm added.

While electronic laboratory notebooks (ELNs) for capturing experimental workflows and processes are now relatively standard in modern labs, organisations must still consider recording the necessary data in a way that will allow it to be converted electronically, Trimm suggested.

There’s no avoiding the fact that generating an electronic submission for FDA is expensive. ‘Companies may commonly employ outside consultants that will take data and start the process of conversion,’ notes Trimm, but this can be a lengthy undertaking if the data hasn’t been captured in an easily convertible format.

‘On top of all the expense, it can add a lot of time to the creation of a regulatory submission, effectively eating away at the length of time you may have left to market your drug with patent protection,’ Trimm continued. Get your data in the right state early on, and the process can be much shorter. ‘What commonly occurs is that companies are outsourcing the data formatting as a service, so every time they need to do a submission, they have to engage a vendor to prepare the submission for them. The cheaper alternative is to do it themselves, but without the right tools this is a highly manual and inefficient task, and so they’ll approach us to ask if we can help them to do it all the more quickly?’

Making sense of drug development data

Certara specialises in offering software and services that span the drug development lifecycle – from discovery to patient access. The company’s expertise, modelling and simulation, and regulatory-focused software are designed to speed the development timeline, and accelerate regulatory approval.

Certara has expertise in the pharmacokinetic field, and in the development of software tools and services for the generation and management of PK data.

Certara is a platinum member of CDISC, has a seat on the CDISC Advisory Council, and has a certified CDISC trainer on staff. It is also involved in the creation of the ADaM NCA standard. ‘We have also been contracted by several big pharma companies to create custom software solutions for Send and SDTM, including ADaM datasets,’ notes Trimm.

It was a natural step for Certara to become directly involved in the CDISC community, Trimm said. ‘As a software company involved in the analysis of data that forms a key part of regulatory drug submissions, Certara realised the importance of becoming involved with CDISC.

‘A large component of what we do as a commercial organisation is manipulating data management, and having standards is actually beneficial for us and ultimately for our customers. We provide off-the-shelf, and tailored software, that enable companies to create the domains that will work with the CDISC standards. As a platinum CDISC member we have input into key decisions that are made during the development of new standards.’

‘We have some products, which you can buy commercially off the shelf, that will create those domains for you, but we are primarily engaged in building custom software solutions for companies, so that we can address their particular needs around their data sources and native formats.’

Part of the problem for companies that need to think about and implement electronic submissions, is that often the people who are producing the data – the scientists – are not the people who understand standard format data submissions and CDISC. ‘What Certara tries to do is to represent, through our software, a kind of data standards expert, so the user can interact with the software to manage the data, rather than having to send the data to another department to be managed.

‘Providing the CDISC knowledge and ability, within and through an application that is easy for the user to interact with, not only removes the need to send data back and forth to a data specialist, but also gives the scientist who is generating the data a basic understanding of why they need to structure their data in particular ways.’

There are issues when considering the standardisation of analytical data, according to Andrew Anderson, VP business development at ACD/Labs. ‘The different analytical techniques used to help assess the identity, quality, and purity of substances, compounded by the plethora of instrument vendors with proprietary data formats, makes standardisation of analytical data a unique challenge.’

Add to this instrument innovations that allow scientists to delve deeper and collect more data in shorter periods of time, Anderson noted, and that adds another dimension to a ‘moving goal-post’. He noted that there two ‘major challenges’. One is that ‘scientists are relying on data transcription between systems to bring relevant data together for decision making, which introduces the risk of errors – something to be avoided, especially with the growing emphasis on data integrity and the ALCO principles for data, that it should be attributable, legible, contemporaneous, original and accurate’. The other challenge is that scientists are ‘relying on abstracted data, effectively complex spectral and chromatographic data simplified to images, numbers, and text, without chemical context and meta-data.’

While many attempts have been made to standardise analytical data over the years, no single accepted standard has been universally adopted by instrumentation vendors or their customers, Anderson continued. ‘There have been notable analytical data standardisation efforts, and since 198 these have included the Galactic *.SPC format; *.CDF from Unidatata NetCDF, IUPAC *.JCAMP-DX, *.mzXML, and *.esp and *.spectrus from ACD/Labs. More recent emerging standards include *.AnIML and *.ADF from Allotrope Foundation.’ ACD/Labs claims to offer ‘the only commercially available standard that allows for homogenisation of all major analytical techniques from the broadest number of instrument vendor formats,’ states Anderson.

ACD/Labs recognised that many different analytical techniques are necessary to characterise substances and make decisions around identity and composition, and that this information is often required by regulatory authorities before substances can be approved for use, Anderson said.

FDA requirements

FDA requirements now span CDISC controlled terminology, the set of CDISC-developed or CDISC-adopted standard expressions (values) used with data items in CDISC-defined datasets.

FDA-mandated CDISC standards also include SDTM (study data tabulation model), for submission of clinical studies as part of an NDA submission. SDTM provides a standard for organising and formatting data to streamline processes in collection, management, analysis and reporting.

Non-clinical data for IND submissions must be formatted as Send (standard for the exchange of nonclinical data), an implementation of the SDTM standard for nonclinical studies. Send specifies a way to collect and present nonclinical data in a consistent format.

For SDTM, FDA requests that applicants provide the data in a format that is ready for analysis; these datasets are called Analysis Data Model (ADaM) datasets, and they define dataset and metadata standards to support how clinical trial statistics are generated, analysed and reviewed. Define-XML, another FDA-required standard, is a data exchange standard for transmitting metadata, and which informs regulators which datasets, variables, controlled terms and metadata were used in studies. Japan’s PMDA requires STDM, ADaM, Define-XML and analysis results metadata (ARM, for Define-XML).

Exclude from view: