DATA ACQUISITION

The DNA deluge

The DNA deluge

Clare Sansom on how next-generation sequencing methods are creating challenges in the field of bioinformatics

Scientific Computing World: August/September 2007

Speaking to a packed demonstration hall during the International Systems in Molecular Biology conference held in Vienna in July 2007, Arif Anwar, general manager of Malaysia-based bioinformatics company, Synamatix, said: ‘Previously impossible experiments are now within reach.’ He was talking, specifically, about genomics, and the impact of so-called ‘next-generation’ DNA sequencing technologies, which are finally displacing trusted sequencing methods that have been widely used since the beginning of the ‘genome era’. The sheer volume of data that can pour off these new sequencers is already producing unprecedented challenges for the companies that develop software for the acquisition and analysis of this data.

Anyone working in computing must be familiar with Moore’s Law, first formulated in 1965, which states that the power of computers per unit cost (colloquially, ‘bangs per buck’) will double every 18 months to two years. During the Sanger era, throughout and beyond the Human Genome Project, improvements in the cost and speed of DNA sequencing very roughly kept pace with this. Thanks to the new sequencing technologies, however, the growth of sequencing is outpacing it to such an extent that the hardware and software companies are struggling to catch up. ‘This represents a quantum jump into the next generation [of sequencing],’ says Martin Gollery, of California-based bioinformatics company Active Motif.

Faced with a newly-minted DNA sequence, most analysts will immediately want to run a database search for similar sequences. The computing requirements of these searches can be estimated, roughly, by multiplying the speed of acquisition of new query data by the size of the databases. With the introduction of next generation sequencing techniques, it is predicted that both these will soon be increasing at least 10-fold every 18 months to two years. Processor speed is predicted to continue to obey Moore’s law and increase only two-fold. ‘It is clear from simple arithmetic that we have a problem –data quantity will outpace sequencing power by 50 times every two years,’ says Gollery. ‘So the data acquisition and analysis backlog can only grow. We could simply buy more boxes, if we can afford them, but such large hardware systems come with practical problems in terms of requirements for power and air conditioning, and even – for the first time for decades – with floor space!’

So, what are the new sequencing technologies that are causing such a rethink within the bioinformatics industry, and what strategies are software companies adopting to cope with the coming DNA sequence deluge?

The ‘gold standard’ method of DNA sequencing is based on the discoveries of Fred Sanger: an achievement that won him his second Nobel Prize in Chemistry, in 1980. It involves dividing a sample of DNA into four, and adding to each portion a DNA polymerase and the dideoxy version of one of the four nucleotides found in DNA. In each sample, DNA synthesis occurs and terminates when a dideoxynucleotide is added. This produces a series of fragments of different lengths, which are purified, separated by size using DNA electrophoresis, and visualised by autoradiography. Each pool will contain a series of fragments terminating in the base corresponding to the dideoxynucleotide that had been added, so the original sequence can be read directly from the fragment lengths. The cost of this process has been dropping steadily, and now an ‘average’ sequencing run will produce 2-3 megabases (Mb) of sequence at a cost of about one cent per base.

The new techniques that are replacing Sanger’s method are still experimental, but already produce more data at a reduced cost. One problem, however, is that there is no consensus as to what the new ‘gold standard’ might be. Three of the sequencing companies involved are 454 Life Sciences (recently acquired by Roche), Illumina, and ABI. The first two of these have products already being used; ABI’s system is to be released late in 2007, but that company is poised to take advantage of its very strong position as the leading provider of Sanger sequencers.

Each of these vendors uses a different chemistry, and all are quite different from Sanger’s. The sequencing platform developed by 454 Life Sciences[1] involves immobilising a DNA fragment on a bead, amplifying it into millions of copies, and immobilising the beads onto plates for sequencing by adding one monomeric run at a time. 454’s technology has already been used to sequence an individual human genome: appropriately, that individual was James Watson, co-discoverer of DNA structure and chancellor of Cold Spring Harbor Laboratory. Illumina can already produce gigabase runs within two to three days using a fairly similar method involving polymerisation of immobilised DNA fragments. ABI’s SOLiD system, however, identifies sequences using a different type of enzyme, a DNA ligase which joins complementary fragments of DNA together.

Software companies have, of course, been producing software to acquire, and analyse, data from ABI sequencers for several decades. They now need to rise to the challenge that these new technologies, and the ‘quantum leap’ in the amount of data that they will generate, are bringing. One of the most important of these challenges, however, arises simply from the range of products in the pipeline. ‘Will the new “gold standard” come from ABI, with its global reputation, from the current sales leaders – which are probably 454 – or from some other company? If we as a company make a poor choice, we might land up with a “white elephant” like the old Betamax video format,’ says Gollery.

Furthermore, as Anwar says, ‘current database platforms will not be able to scale to manage this ever increasing volume and complexity of data’. The largest data volumes come from the image files that capture the raw output of the sequencers. Some companies are working on solutions that involve analysis of image data ‘on the fly’, converting these large files to much smaller text files before any data is stored. ‘We are hearing talk of sequencing centres producing tens of terabytes (1013s of bytes) each year. A consensus is developing that it will soon be cheaper to re-run a sequencing experiment than to store images for re-use,’ says Gollery.

Another, at least temporary, problem concerns the length, and quality, of individual sequence reads. ‘Current ABI sequencers using Sanger’s method produce very high quality reads about 1kB long,’ says Anwar. ‘At the moment, next generation sequencers produce much shorter reads, maybe only 25-300 base pairs long depending on the technology, and the error rate is often more than 0.1 per cent. Although the technology is improving all the time, we now have to work within these limitations, and there are obviously many more places on a chromosome where a short read will match than there are for a longer one.’ William van Etten, of the BioTeam informatics consultancy, who has been working with Apple on next-generation sequencing products for the Macintosh platform, recognises a very different challenge. ‘With sequencing becoming cheaper as well as faster, there will be tens or hundreds more centres producing the quantity of data that is currently associated with, for example, the Sanger Centre. Most of the staff of these centres will be biologists, medics or technicians. They will be less “IT-savvy” than users of high throughput sequencing kit now, and they will need smarter, more user-friendly software,’ he says.

Synamatix’ data acquisition and analysis platform is centred on its proprietary database platform, SynaBASE. The unique feature of this database is that it is based on changes and patterns in data. Rather than storing sequences as flat files, each unique subsequence is only stored once, and the original sequences are referenced only when necessary. Therefore, adding very similar sequences to the database – such as the genomes of multiple mutant strains of the same bacterium –adds very little to the storage requirements. This goes a long way to explaining its success in dealing with large volumes of data, and is particularly useful for these ‘re-sequencing’ problems. The company has developed a wide range of analysis tools that interface with SynaBASE, including SXOligoSearch, a new global DNA alignment tool that has been optimised for large-scale genome resequencing projects. This tool is able to map reads with base mismatches or gaps while fully utilising quality scores. ‘Our products are designed to run on a single server, and are ready today to handle and process data from all the currently available next generation sequencers,’ says Colin Hercus, Synamatix’ chief technology officer.

Active Motif became a serious player in the game of producing software for next-generation sequencing through its acquisition of Time Logic. They are developing applications of a type of chip-based programmable logic called field programmable gate arrays (FPGAs). This approach, which might be thought of as midway between ‘traditional’ hardware and software, is quite difficult to develop applications for but, once an algorithm has been converted, it will run in a massively parallel fashion and extremely fast. Active Motif has developed versions of popular bioinformatics programs, including the rigorous Smith-Waterman pairwise alignment algorithm and the ever popular BLAST, to run using this technology; other companies have produced, for example, software for gel readers and a version of the multiple alignment tool ClustalW. ‘We are working towards developing a program for automatic processing of data coming off sequencers, so there will no longer be a need to store raw data,’ says Gollery.

But specialist bioinformatics companies are not the only players in this game. Some of the most widely known names in hardware, such as Apple and SGI, are also developing hardware and software products for use with new-generation sequencers. Apple developers are hoping that the combination of sheer ‘number crunching’ power with the known appeal of its user-friendly interface will make its eight-core Mac Pro system popular with ‘less IT-savvy’ biologist users in the next generation of sequencing centres.

Barely two years since the publication of the paper describing 454’s technology platform1, next-generation sequencing can be seen as still in its infancy. In even a year or two, we can expect the technology and market to have matured, opening out ever more ambitious genomics projects to a wider user community. And the eventual winner of the Archon Genomics Prize of $10m, offered for 100 human genome sequences, solved in 10 days at a cost of $10,000 apiece, will undoubtedly be using one of these. But by that time, genomics may well have moved ahead again. For, says Gollery, ‘Any biologist can come up with a problem that will slow down any system a computer guy can come up with.’

References
1. Marguiles, M. et al. (2005), Nature 437(7057), 376-80