Knowledge: Document management
Scientists have been involved in the development of artificial intelligence (AI) for decades. The modern version of AI, which sought to create an artificial human brain, was launched in 1956 but had been fermenting many years before that, following the discovery that the brain was an electrical network of neurons that fired in pulses. Over the years, there were apparent breakthroughs that were followed by troughs of despair caused by hardware and software limitations. Among the highlights of this era was a Siri-like machine called ELIZA which, in 1966, could be asked natural language questions and provided voice-appropriate – albeit canned – answers.
The birth of the discipline of knowledge management (KM) was in the early 1990s. A tidal wave of KM consultants appeared, heralding the birth of this newer AI version and the emergence of supporting hardware and software. A few years later, several scholarly journals appeared as forums for advancing the understanding of the organisational, technical, human, and cognitive issues associated with the creation, capture, transfer, and use of knowledge in organisations.
Today the value of tapping into an easily accessible collection of information, such as a smart laboratory’s assets, is appreciated more than in the past. The amount of data and information generated by instruments and scientists increases exponentially, and staff turnover is rising to the point where someone with seven years of service with the same employer is now considered an old-timer. Undocumented know-how and locations of information resources are now issues. Reinventing the wheel is becoming more commonplace – not a desirable occurrence, because the costs of drug development are ever-increasing and fewer blockbuster products are hitting the market.
Many software solutions offered to assist information management are specialised and fragmented. Often, disparate divisions of an organisation select their own local software solutions, and IT departments often dictate requirements that restrict the scope of possible vendor solutions. Excluding small, single location labs, it is rare to see a smart laboratory where all associated information resides under one roof.
There are general solutions to support the processing of large data sets in a distributed computing environment. One of the best known is Hadoop, sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of nodes, involving Petabytes of data. Its distributed file system facilitates rapid data transfer among nodes and allows the system to continue operating, uninterrupted, in case of a node failure. The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising.
Organisation is everything
When organising information one needs to decide what is important and what is not. Traditionally, in the paper notebook era, experiments, results, and comments were systematically entered to show diligence in pursuing a potential patent on an invention. Nothing could be removed; only subsequently noted or re-explained. Supporting data from instruments was retained with the notebook entries, and this practice led to the warehousing of innumerable papers as well as electronic records that might or might not be needed to support patent claims or meet regulatory requirements.
The volume of instrumental data today is much larger. Is it prudent to keep everything, or perhaps classify the data into two piles – one that directly supports a conclusion and another that is perhaps more generic? All electronic data suffers from aging, not unlike human aging. We’ll talk about media and file format aging a little later, but we should also consider relevance aging. Should a particular spectral analysis file be kept or should the sample be re-run five years from now using updated equipment?
Information needs to be categorised into a small number of groups, preferably in a central location to facilitate retrieval.
Start with two piles and gradually split them appropriately. It is sensible to imagine how a researcher in the future would look for things, having no knowledge of past notations and conventions.
People like to use familiar visual signs to navigate. It’s natural and usually results in finding what is needed plus additional, associated materials. Search engines may give more precise results but may omit important things that are part of the navigation journey. Scientists appreciate the role of serendipity in drug discovery.
Not everything can be kept for ever – but how long is sensible? There is some consensus that information supporting a patent should be retained for the life of the patent, plus several years before and after to cover eventualities. Most pharmaceutical companies have settled on a 40- to 65-year retention for intellectual property. Records to support regulatory compliance sometimes need to be retained for as long as 25 years. At the end of their retention period, records should be evaluated for their disposition. Should they be destroyed, or perhaps kept for a few more years? Scheduled examinations of records have a bonus of providing information that could be applied to current issues. Looking through the supposed ‘rubbish’ can be a very good thing.
There are at least two good reasons for retention schedules. First, there is the smoking gun. In the event of legal or regulatory investigations and/or audits, there’s bound to be information that is erroneous, that conflicts with established facts, or serves no particular purpose. Observations and comments that arae taken out of context can also be misleading. This is not a licence to cook the books; the aim is to throw out the junk and items that have no real contribution to the organisation. It’s better to identify what needs to be retained before any of these issues occur. The non-records should be destroyed as quickly as possible and the declared records evaluated after a pre-prescribed time (retention period). Keeping non-records and records past their retention dates costs money. The cost of hardware associated with information storage continues to decrease but the amount of labour needed to support large collections has increased sharply.
How can records that are created within an organisation be authenticated? They don’t all need to be notarised, but it would be nice if there was an easy way to come close to this. So here are the concepts to use. Appoint a designated records manager who will have full control of the records. People have been doing this for years with paper records – it works. The custodian authenticates the author and maintains a chain of custody if the record is moved. Copies can be made and distributed, but the ‘original’ is always in the vault. It’s pretty much the same with electronic records: the documents are stored on a server where users can view them or make copies. The official, ‘original’ record stays in its slot. Chain of custody is maintained when the record is migrated to another location or is converted into other formats.
Long-term archiving: paper and microfilm records
There is a general perception that records will be easy to find, retrieve, and view in the distant future. Paper and microfilm records that are stored in a clean, temperature- and humidity-controlled environment could be readable for more than 100 years. However, finding and retrieving them requires some strategic planning. At the very least, they should be organised by year. Additional sub-categories or folders can be added to facilitate retrieval. The ideal solution involves the assignment of a unique identifier to each record; the identifier containing or linking to relevant metadata to aid in searching. For large collections, this information should be stored in a database. A plan must be developed to migrate this information from its existing hardware and software, after it becomes obsolete, to newer systems.
Long-term archiving: electronic records
We are all aware of the extremely short half-life of computer hardware and software. The software authoring tools in use today will blink out of existence and be replaced by tools that have more capabilities or are compatible with current operating systems. One can only speculate regarding the hardware and data storage media we will be using in the future. There will probably be no practical Rosetta Stone to help translate codes used in legacy software. Maintaining authenticity and minimising data corruption needs to be addressed.
There have been attempts to maintain a museum of hardware and software that could help in viewing legacy records. These mostly failed, most notably an effort by the National Aeronautics and Space Administration (NASA). NASA lost many of its electronic records from the early 1960s and then took steps to ensure that it would not happen again.
This resulted in the 2001 launch of the Open Archival Information System (OAIS) reference model, sponsored by a global consortium of space exploration agencies concerned with data preservation.
Other global consortia have come together to develop preservation strategies. The International Research on Permanent Authentic Records in Electronic Systems (InterPARES) aims at ‘developing the knowledge essential to the long-term preservation of authentic records created and/or maintained in digital form and providing the basis for standards, policies, strategies and plans of action capable of ensuring the longevity of such material and the ability of its users to trust its authenticity.’
Finally, Australia’s Victorian Electronic Records Strategy (VERS) provides a framework within which to capture and archive electronic records in a long-term format that is not dependent on particular hardware or software.
The concepts that these global data initiatives use for long-term preservation are the same. First, capture the content and metadata, then protect them with an immutable file format that preserves the text, images, charts and tables and renders them readable in the way the authors intended. The emerging standard for this purpose is PDF/A, an ISO-standardised version of the portable document format (PDF). Finally, the immutable file is further protected from tampering by digital encryption.
Electronic storage media is a moving target. It is quite unlikely that media being employed today will be used beyond the next 20 years. The storage of electronic information on magnetic tape, pioneered by IBM in the 1970s, is not only the storage method of choice today, but its usage is increasing.
Tape is far cheaper and more reliable than any other medium used for archiving data. This does not mean that records from a 20-year-old tape can be retrieved readily unless a compatible drive, which could retrieve its content, has been saved in a museum.
To keep electronic records for more than 10 years, a migration strategy needs to be developed and implemented now, before the museum closes.
The best approach to organising information is to decide what is important to keep and what is not. How would a researcher in the future look for things, having no knowledge of past notations and conventions? There are at least two good reasons for applying retention schedules. In the event of legal or regulatory motivated investigations, and/or audits, there’s bound to be information that is erroneous, conflicts with established facts, or serves no particular purpose.
If there is a risk that observations and comments can be taken out of context, items that have no real contribution to the organisation’s business should be thrown out. Records that are past their retention dates should also be discarded to avoid maintenance costs.
The cost of hardware associated with information storage continues to decrease, but the amount of labour needed to support large collections has increased sharply. A records manager should be designated and given full control of the records.
The basic guidelines are as follows:
- Understand the legal implications of electronic records;
- Establish a file plan;
- Establish an electronic records preservation file plan;
- Establish an electronic records manager or management team;
- Establish and communicate policies;
- Avoid point solutions; and
- Don’t keep electronic records forever.