Archive with care

Recording research data electronically sounds like a good idea; but, Peter Rees asks, can you produce that information in court 50 years later?

Pharmaceutical companies are showing growing interest in electronic laboratory notebooks (ELNs), as I discussed in the November/December 2004 issue of Scientific Computing World. But this revived enthusiasm for capturing information from the research lab electronically has an unexpected consequence: it highlights the need for pharmaceutical companies to look hard at how they will archive their electronic data. At the moment, they are still, by and large, storing electronic material on paper for historical and/or legal reasons. But in the US, which tends to set the standards for the sector globally, recent legal decisions about electronic evidence, and remarks by patent regulators, have given greater encouragement to those who want to drop paper altogether.

Pharmaceutical companies have seized on the ELN as an opportunity to improve efficiency in research and development. The danger is that companies will plunge in without properly planning an archiving strategy, according to Simon Coles the chief executive of ELN company, Amphora Research Systems. It may be 10 years or more before they find out about any mistake - and that could be in a costly court action, he warns.

To cut down the risks of not being able to produce crucial evidence when needed, Coles champions a strategy based upon open file formats, hardly surprising given that is the approach Amphora uses for its own ELN. But it is one increasingly being pushed by large organisations in other sectors, and by government agencies, that take the long view of record storage. For example, the European Union has been battling with Microsoft over open file formats, including the version of XML used in Office, as part of a programme to increase interoperability, as has the US state of Massachusetts.

Laboratory-generated scientific information needs to be kept for long periods, for a number of obvious reasons including internal scientific use, regulatory requirements, use in patent or intellectual property disputes, and general legal matters (e.g. contract disputes, product liability). And internal uses are changing as companies seek to take advantage of knowledge-management software and new opportunities for sharing information.

But archiving involves a conscious decision to move records to a separate repository, so the organisation has to think about what to archive and how long to hold the information. And that could be a very long time. This was brought home to Simon Coles when he acted as consultant on a project for Kodak to build an enterprise-wide ELN, and was asked where his company would be in 100 years time.

Archiving for regulatory or legal purposes and for internal use are necessarily different. A regulator or judge will need to establish whether correct procedures were used to make decisions and will want raw data to be available to look at and re-interpret. The timescale could be anything from 20 to 50 years, perhaps longer as patent, smoking, and asbestos litigation in the US all show. Internal uses need not be as robust - after all, information locked in paper notebooks can hardly be re-used at all, let alone routinely.

Most important for pharmaceutical companies is evidence that could be used to support the patent application for a drug. Under the US 'first to invent' rules, a witnessed record of a discovery may need to be produced; the word of the inventor just isn't enough. In many cases, this record is a laboratory notebook. The court is looking for evidence of the original idea and of its application in some way. Any patent challenge will be an adversarial process, and another company will be looking for any weak points in procedures. Commercial lawyers have consistently advised pharmaceutical companies to stick to paper records, as long as legal precedents on electronic records are lacking. But there are problems with existing records, and many paper notebooks are not witnessed properly, which makes them useless in court, says Coles. And case law on electronic communications is starting to emerge, so companies need to plan their e-archiving strategy.

Selecting the right file formats is a crucial part of any strategy, in the shifting sands of the IT world. If a format is neglected and loses support, old files will need to be converted to a readable alternative, perhaps using specially written software that runs on the operating system of the day. If the wrong format is chosen, archives will periodically need to be migrated to new formats.

'Make sure you're not the only one with a problem,' is the best approach, says Coles. For that reason, he recommends Adobe's Portable Document Format (PDF) as one to standardise on. PDF has many advantages, not least its platform-independent nature and, as well as the free Adobe reader, there are a number of open-source readers. But because it is a proprietary file format, it still has some problems. It's a great paper substitute, as is another favourite format, TIFF.

But neither of these was designed for archiving. A new version of PDF - designated PDF/A - is being created by some interested parties (both public and private) to address some outstanding problems. PDF/A is a cut down version of PDF intended for long-term preservation of documents, which should be adopted by the International Organization of Standardization sometime this year. It gets round font-licensing problems by insisting that all fonts used must be embedded and available for unlimited, universal use. PDF/A also incorporates some XML technology, to help with indexing and searching documents. Coles is less keen on this aspect, believing one shouldn't mix XML and PDF.

Of course, nothing stands still, and Microsoft is promoting a competing format to PDF, in partnership with Global Graphics, code named 'Metro'. The format is based on XML, but requires a licence, and is set to be included in the next 'Longhorn' version of Windows. But Adobe's PDF is an established format with widespread support, including projects such as the one set up by the Australian state of Victoria in 1998. The state public record office's VERS (Victorian Electronic Records Strategy) should see PDFs supported for many, many years.

The ubiquity of PDF on the internet, strengthened by the spread of broadband access, is a useful guide when choosing other file formats. For marked-up text Coles suggests HTML - again because it has been adopted so widely. The increasing use of the HTML browser as the front end for knowledge management and a host of other tasks also strengthen the case.

Microsoft Word can be used, says Cole. Here the problem is Microsoft's constant updating of its file format to introduce new 'features,' which require users to keep buying the latest version of the company's software. Amphora's solution is to pipe Word files through 'Open Office' to 'filter' them before archiving. Open Office is an open source project to develop and support a cross-platform suite of programs that are compatible with Microsoft Office files. It is supported by Sun Microsystems and is based on their Unix software StarOffice. As an added bonus, it lets users export documents as PDF files. So far so good, says Cole. But newer versions of Office have XML capabilities and Microsoft has been busy surrounding its XML format with patent protection. This could make it more difficult for future versions of open-source software to work successfully with files formatted in Office XML. Microsoft and other software vendors have become more aggressive in licensing their intellectual property to prevent competition from open-source projects, and to increase earnings. Coles cites the case of SCO's case against IBM for alleged copyright infringement as an example of IP cases that can threaten a whole community of users - in this case, anyone running the Linux operating system.

As to images, Coles favours PNG (Portable Network Graphics) files or 'pings' over TIFFs, JPEGs, or GIFs. An alternative formulation of the acronym, 'PNG's Not GIF', gives a hint about its history. PNG was designed to supersede the once-popular GIF format, the fate of which is a warning to all organisations to think hard about their archive formats. In late 1994, Unisys and CompuServe suddenly decided to demand royalties from programs that used GIF, because of Unisys' patent on the LZW compression method used in the format. The fallout from this led an informal group of internet users to begin work on a replacement. PNG was an opportunity to update an ageing format by adding some handy features and incorporating better (and non-patented) compression. It became an ISO standard last year.

Coles is less worried about file storage and hardware issues. He doesn't advise storing important archival material on writable CDs (CDRs) while uncertainty about their longevity remains. There may be less of a problem with writable DVDs (though not about the various formats). The storage demands for laboratory notebook data are not so great that they can't be handled by keeping them on adequately backed-up hard disks.

The history of GIF and PNG, and of the other file formats favoured by Coles, illustrates a recurrent theme of his approach: how the internet and the open-source movement have repeatedly come to the rescue of those struggling with file-format and conversion issues.

But there remains one critical area where this cannot be the case - proprietary file formats used for analytical instruments and other scientific software. Pharmaceutical companies must persuade instrument vendors to open them up, or refuse to accept their products, says Cole. Although he clearly has an axe to grind, his logic is persuasive. Closed formats put pharmaceutical companies in a weak position when negotiating software and other upgrades, and they threaten archiving projects. If a company were to go out of business, or be taken over and a product dropped, electronic records could be rendered unreadable in a future patent case. Keeping formats in escrow, in case of such an eventuality, would not be sufficient protection, he stresses. File formats are often poorly documented, a user might not have full rights to use them in the way they need to, and in any case would be stuck with writing software to read the files.

The dangers of adopting a poor e-archiving strategy are greater for smaller firms than the bigger pharmaceutical companies, says Coles. The major players have more negotiating power with suppliers and they can command the IT resources to spot and overcome problems. Coles says a substantial failure of archived electronic files in a mid-sized or small company is almost inevitable, but this may not emerge for 10 or 20 years.

Analysis and opinion

Robert Roe looks at research from the University of Alaska that is using HPC to change the way we look at the movement of ice sheets


Robert Roe talks to cooling experts to find out what innovation lies ahead for HPC users

Analysis and opinion