Siân Harris reports on the progress towards standard data formats and how this could transform biological research
If something goes wrong with a drug the consequences can be devastating, both to the patients and to the company that developed that drug. This is why regulators require pharmaceutical companies to keep such careful watch over all the data generated on any of the compounds that they develop, test and turn into drugs.
Hardly a big challenge compared with battling diseases, one might say, but it’s not as simple as it sounds. What if, in 50 years’ time, the regulatory body knocks on the door and wants to see the mass spectrum of a compound that you have just analysed? The mass spectrometer you used today is not likely to still be in use by then, the vendor of that spectrometer might not even be still in business, and the software and hardware supporting it are likely to have changed beyond recognition. Do you reach for the key for a vast cupboard full of old spectrometers, computers and printers and hope they still work?
And we don’t have to go 50 years into the future to find challenges to the use of data such as mass spectra. In many subject areas there is an increasing trend towards gaining new information from looking at patterns in data and building up a big picture. Obvious examples are mapping the genes of an entire animal and the next stage, proteomics, which is the large-scale study of the animal’s proteins – particularly their structures and functions. These types of large-scale research require international collaboration and data sharing in order to better understand what is going on.
The need for standard formats
Both future-proofing and data sharing depend on the mass spectra, and other data generated, being in a form that other people can access and understand. But this is not as simple as it might seem. Every instrument vendor has file formats that are unique to its equipment – one vendor’s ‘comment’ field might be another one’s ‘memo’. And, even where there are standard formats, subtle differences in the way that different vendors interpret the standards can prevent them being interoperable.
The idea of standard data formats for mass spectrometry is not a new one. Almost two decades ago an independent industry organisation, the Analytical Instrument Association (AIA), began developing a set of clearly-defined standards (the ANDI protocols) for this data and these became fairly widely adopted. Unfortunately mass spectrometry experiments have moved on enormously since then – the large-scale genomics and proteomics experiments were barely on the horizon when these standards were developed and they do not fit into the data structures that were defined back then.
‘The technology and applications really outgrew the scope of the standards,’ says Don Kuehl, vice president for marketing and product development at Cerno Bioscience. ‘The AIA standard basically isn’t adequate to describe the experiments that go on today.’
One of the main limitations of this standard, he explains, was that unless a particular experimental condition was defined in the original standard then there was no easy way to include it in the AIA standard data format. ‘You need to be able to save data in a way that many years later it can still be accessed so it must be accurate and complete. This means that it has to be represented with all the acquisition parameters, or metadata, such as experimental conditions and sample conditions,’ he explains.
This raises obvious challenges for any standard format: how do you design something that takes into account data types and experimental techniques that have not yet even been invented? One of the front runners in the search for a standard is to use the internet data format XML. ‘The nice thing about XML is that it is extensible so you can add to the schema without breaking the old stuff,’ says Kuehl. But XML has a downside: this flexibility means that it is not very efficient on storage. Many of today’s mass spectrometry applications generate huge amounts of data and the XML version can take up much more storage space then the raw data set, which must also be kept. Kuehl pointed out that, although computer storage is becoming very inexpensive, the IT department that must support that extra storage is not so cheap.
A major bottleneck
Cerno Bioscience’s interest in the mass spectrometry standards search arises because of its position as a third-party software developer. The company develops calibration software that irons out inconsistencies in mass spectral data from different vendors and different processes so that it can be analysed and searched on more precisely and easily. ‘We need to access the data from different vendors and much of our development effort goes into getting data formats,’ explains Kuehl.
This situation is shared by many similar companies and university groups and he sees this as an impediment to progress, as well as a drain on resources. ‘At the moment, third-party software developers and universities have to develop tools that work with every vendor’s data format. And another software developer has to repeat all this for their tools. It is not very efficient and not very good for encouraging the use of data,’ he says. ‘Furthermore, users might be required to own a copy of the vendor’s software in order to use the data. It is a real bottleneck.’
And this bottleneck affects the whole laboratory. A laboratory information management system (LIMS), for example, needs the metadata about the experimental conditions and sample details in order to track the data. ‘Right now when a LIMS is installed it is likely that more than half of the cost of the LIMS is doing custom integration with a lab’s equipment,’ says Kuehl.
Despite his conviction that standard formats are important, Kuehl is not optimistic about widespread uptake of them in the near future. It has to be driven by customers but I haven’t really seen any pressure to drive it,’ he says. ‘And vendors aren’t going to support a standard if there is no standard. But, although it’s not easy, it is do-able.’
Proteomics pushes standards
Angel Pizarro, director of the Bioinformatics Facility at the University of Pennsylvania, USA and chair of the Proteomics Informatics Standards Group, agrees that developing standard formats for mass spectral data is challenging. ‘Standards are in as much a state of flux as the instruments themselves,’ he says. ‘Each year the limit of detection is greatly increased because of advances in the instruments.’ And this is particularly true of proteomics. ‘Proteomics is such a moving target that there has to be agile software development. Academic research doesn’t want to be locked into proprietary systems, so labs end up having a hotch-potch of equipment,’ he adds.
Pizarro’s primary interest in standard data formats is for algorithm development. He explains how, in the early days of microarray research, the data was not included in journal papers so it was not reproducible for statisticians and computer scientists. ‘The first such article to include the data, a yeast cell cycle paper by Paul Spellman et al (1998), became one of the most highly-cited papers because, even though it was not in a standard format, the data was open so people could use it to start to develop software,’ he points out.
And such algorithm research will ultimately feed back into the wet lab research toolkit. ‘It will enable researchers to pump their results through data analysis and get more confidence in the validity of their own results,’ Pizarro explains.
Given the level of demand for such standard formats from the proteomics community, it is not surprising that the Proteomics Informatics Standards Group, which is part of the Human Proteome Organisation Proteomics Standards Initiative (HUPO PSI), is leading the way in opening up and future-proofing data formats. And, according to Pizarro, the instrument vendors are very supportive of these efforts.
Debate over formats
However, there is a fly in the ointment: many of the data users favour the mzXML format but vendors prefer mzData. The reasons for this tension are quite simple: mzXML encompasses more but the vendors have less control over its release schedule. ‘mzXML encodes the vocabularies directly in the data while mzData uses annotations to the external schema. This makes mzData much more robust but users still need to account for the annotations,’ explained Pizarro.
The mzData format is being championed by HUPO PSI. According to the organisation, the aim of this data format for capturing peak list information is to unite the large number of current formats. However, it is not a substitute for the raw-file formats of the instrument vendors. The data storage concern has been addressed by storing mz/intensity information in the very memory-efficient binary base 64 format. And the vendor support is more than simple lip service to this format. Many companies, including the likes of Agilent, Applied Biosystems, Bruker Daltonics, GeneBio, Insilicos, Matrix Science, Thermo Electron and Waters, have already released products that comply with mzData.
Eventually Pizarro and others at HUPO PSI hope that the format debate will disappear. Earlier this year the organisation’s mass spectrometry standards group announced plans to merge the two data formats, with a roadmap that will see many of the features of the combined format emerging by the end of 2006.
The new format is expected to have aspects of both formats, including an interchange schema which has split-data vectors compatible with other analytical interchange formats and support for both random access indexes and digital signatures via a wrapper schema. Open-source tools to support developers and users of the format are also being developed.
The road ahead
According to Pizarro, the first priority has been to get data out there in an open, readable format. Once that has been achieved, the next step is to ensure that this format works with the various algorithms that are designed to work with the data. After this, there is the annotation of the experimental context, which is either the first or last thing that gets computerised, depending on the priorities of the researchers.
And this standardisation doesn’t stop with proteomics. Pizarro’s main standards work is on a protocol subclass called Functional Genomics Experiment (FuGE). This is intended to form the basis for other data standards in functional genomics. According to Pizarro, many diverse technology-specific standards have already committed to extensions of FuGE for their standards efforts. These include the microarray (MGED) and toxicogenomics (RSBI) standards and several of the PSI working groups, including experimental context and sample description/separation techniques (spML) and gel experimental annotation (GelML).
‘The end goal is to have an interoperable set of standards for biology,’ says Pizarro. ‘Our own group is developing a format, commonly known as analysisXML, for reporting spectra analysis such as peptide search engine results. We just had a PSI working group meeting and have completed 80 per cent of the work needed to finalise analysisXML and send it through the PSI standardisation process. I fully expect the format to be finalised by the HUPO meeting at the end of October and the standardisation to have been started by then.’
For this grand vision to become a reality, Pizarro believes that it is the users, more than the vendors, who need convincing. ‘The vendors have been convinced ever since users started to request some sort of open format,’ he points out. ‘For a data format to become standard it needs to be adopted by the users. You can’t just define a standard and expect people to use it. There is an under-appreciation of how much marketing is involved.’ But users will come on board soon, he believes. ‘Journals and grant-awarding bodies are starting to require the publication of data so users will start to adopt the standards in order to be able to publish,’ he explains. ‘Once we have tool sets that are used for publishing we will start to see much wider adoption.’
HUPO Proteomics Standards Initiative
HUPO PSI-MS: Mass Spectrometry Standards Working Group
Spellman et al., Comprehensive identification of cell cycleregulated genes of the yeast Sacccharomyces cerevisiae by microarray hybridization. Mol. Biol. of the Cell, (1998) 9, 3273.