Bioinformatics in the disaster zone
It is now thought that 2,749 lives were lost in the terrorist attacks on the World Trade Center in New York on 11 September 2001. Frightening though this total is, it was originally thought to be even worse. The first estimate was that as many as 10,000 people had died. The New York City Office of the Chief Medical Examiner (OCME) faced an unprecedented challenge: to identify each victim, and, if at all possible, to return the remains to their families for dignified burial.
To solve this problem, the OCME turned to bioinformatics: specifically, to a small software company based in Ann Arbor, Michigan, called Gene Codes, and to its founder, Howard Cash. By 2001, Gene Codes had become well known for its innovative program for gene sequence assembly, Sequencher. Less than five per cent of its business was with forensic DNA analysis. When, in October that year, Cash was asked by the OCME if Gene Codes could produce a human identification program that could cope with the scale of the World Trade Center disaster, his immediate thought was that he was being asked the impossible. Working to such an accelerated timescale would have been technically challenging even without the knowledge that any mistake - potentially leading to a body being returned to the wrong family - would be deeply disturbing.
However, Cash accepted the challenge. He was able to deliver the first version of the Mass Fatality Identification System, or M-FISys (pronounced 'emphasis'), within two months, and a revised version was produced almost every week for the next two years. Throughout that time, the design team worked around the clock in gruelling 12-hour shifts. 'It was like dancing with a gorilla: you don't stop when you get tired, you stop when the gorilla gets tired,' he says.
Unusually for a bioinformatics entrepreneur, Cash trained as a classical musician, and spent a year working as an assistant conductor with the Pennsylvania Opera Theater. He then moved to Stanford to study psychoacoustics, and took his first job in programming to clear his debts before seeking work as an audio engineer. Fortuitously, that job was at the bioinformatics company, IntelliGenetics, and his first task was helping with the sequencing of the HIV virus genome. He never went back to audio. In 1988, he returned to his childhood home in Michigan to start the company that became Gene Codes, and launched the Sequencher software for genome assembly. 'That was the very beginning of the Human Genome Project', he remembers. 'Working on sequencing in the late 80s was rather like being a sailor in the time of Magellan'. By 2001, Sequencher had almost 75 per cent of the market in sequence assembly, and the company had been profitable for almost a decade.
The nature, as well as the scale, of the World Trade Center disaster presented unique challenges to Gene Codes' software engineers. Few bodies remained intact, and fewer still could be identified using classical methods such as fingerprints and personal belongings. In total, more than 20,000 human samples were retrieved from the disaster site, with some individuals being recovered in as many as two hundred pieces. Remains were further degraded by the intense jet-fuel fires that burned for months. DNA identification was the only method that could be used to identify most of the bodies. Yet many, if not most, of the DNA samples retrieved were incomplete.
The software used in the first weeks after the disaster, CoDIS (Combined DNA Index System), had been written to identify criminal suspects from DNA registers, which is a so-called 'one to many' problem. The World Trade Center identification problem, in contrast, is self-evidently 'many to many'. 'Despite the availability of a hammer [CoDIS], the problem could not be represented as a nail', comments Cash. One hundred and five identifications had been made using conventional techniques up to 12 December 2001. On 13 December, the day the first copy of M-FISys was installed, it identified an additional 55 victims.
The M-FISys system has been designed to hold three types of data: the victim samples retrieved from Ground Zero; DNA samples from people presumed missing (including toothbrushes and lipsticks as well as pathology samples); and cheek swabs from relatives. One of the programmers' first tasks was to design the database and enter the data, which was obtained in a wide variety of formats, ranging from Oracle databases and Excel spreadsheets to hand-written records.
Three techniques for identifying individuals from DNA sequences - short tandem repeats, mitochondrial DNA, and single nucleotide polymorphisms - have been incorporated into the package. For clarity, rather than displaying all comparisons for one sample at once, the display can be altered to show samples linked through any one of these techniques in turn, with links indicating where other data either supports or contradicts the presumed identification.
Analysis of short tandem repeats (STR) is the most widely used technique for DNA-based human identification. The human genome contains many short sequences of repetitive DNA, where a sequence of a few bases is repeated up to a few dozen times (e.g. '…AATGAATGAATG…'). The number of these repeats in any such sequence differs between individuals. STR regions occur mostly in non-coding DNA, and so are medically uninteresting, and they are scattered all over the genome, so are not necessarily inherited together (they are 'unlinked'). As STRs are unlinked, the probability that any individual's DNA contains given numbers of repeats at different loci may be obtained simply by multiplying the probability for each repeat number. In current applications, each DNA sample is characterised by the repeat numbers at each of 13 STR loci. As an individual inherits a complete genome from each parent, each DNA sample will contain either one repeat length or two at each position. It has been estimated that odds of obtaining an incorrect match by chance between two complete 13-locus profiles is less than about one in 1015.
Therefore, in ideal circumstances, it is almost impossible to imagine that STR analysis alone could yield any incorrect identification. In the World Trade Center disaster, however, fewer than the maximum 13 STR loci could be sequenced from most of the victims' DNA samples because of damage from blast and fire. And, obviously, it is only possible to compare the sequence lengths of STRs that are present in both the samples. The probability of finding a match by chance rises very rapidly as the number of loci compared drops. As the company had to avoid false positives, it was necessary to complement this analysis with the other two techniques.
With so much riding on correct identifications, and the OCME needing the software delivered 'yesterday', Cash and his senior colleagues rapidly decided to use a series of techniques known as Extreme Programming (XP). The XP philosophy, developed by Kent Beck , allows programmers to write accurate code without using specifications.
Software engineers work in pairs on one computer, with one constantly reviewing the work of the other, and tests are written before the code they are intended to test. Before any new code is incorporated into the system, it must pass not only its own tests, but also all tests written since the project began. An obvious advantage of this is that it is difficult to unknowingly break the code in one place while fixing a bug in another.
Cash remembers the first time the team got the program running: he looked at 40 samples from the same man and realised that nobody else knew that identification. About 18 months later, he was flying from Detroit to New York to deliver a new version of the code when he read a story in the Wall Street Journal about a British woman who lost her American husband in the disaster. 'In the article, the woman explained that she had held a memorial service for her husband on the Fourth of July 2002, but that almost a year later she still had no body to bury', he says. 'I looked the victim up on the M-FISys database, and found records from a toothbrush and from family members, but these didn't match. The toothbrush could not have come from the victim, but it may have been his wife's. It was possible to make a solid identification from the kinship records alone, and we were able to return the remains by the next July Fourth.'
By late 2004, more than 1,500 World Trade Center victims had been identified, and the Gene Codes team could have been forgiven for drawing a line under the M-FISys project, and to getting back to work on Sequencher. It was not to be. When East Asia was hit by the Boxing Day tsunami, Cash and his colleagues offered their system to the Thai authorities. That disaster presented different challenges. Although few of the bodies were fragmented, the disaster was on a far larger scale and, with many members of some families being lost, proportionally fewer kinship records were available. 'It was sometimes possible to identify someone from records from a relative who was also a victim', says Cash.
M-FISys is still not used widely in the affected area, possibly because of conflicting political interests. Cash is not discouraged and signed a contract on 31 March to provide a version of M-FISys to the UK government. Mexico has also recently signed a contract for the software. Gene Codes has offered to provide M-FISys at significantly reduced cost, and/or to donate Gene Codes technology for victim identification to several non-profit organisations.
In June 2005, the International Society for Computational Biology (ISCB) invited Cash to present the opening keynote lecture at its flagship conference, Intelligent Systems in Molecular Biology, in Detroit. About 1,700 delegates heard ISCB's president, Michael Gribskov, give Cash a special citation for humanitarian service. In the 'impossible conditions' after the World Trade Center attack, Gribskov said, Cash and his colleagues had worked to produce an 'invaluable piece of bioinformatics code'.
 K. Beck. Extreme Programming Explained. Addison-Wesley, 2000.