Skip to main content

Navigating the sea of genes

Over my working life, statistical data analysis has explosively expanded in significance across every area of scientific endeavour. In the last couple of decades, computerised methods have ceased to be an aid and become, instead, simply the way in which statistical data analysis is done. Partly as a result, and partly as a driver, data set sizes in many fields have grown steadily. As with every successful application of technology, a ratchet effect has kicked in and data volumes in many fields have reached the point where manual methods are unthinkable.

Some areas of study have seen their data mushroom more than others. Of those, few can match the expansion found in genetics, a field which has itself burgeoned alongside scientific computing over a similar time period. The drive to map whole genomes, in particular, generates data by the metaphorical ship load; an IT manager at one university quipped that ‘it’s called the selfish gene for a reason and not the one Dawkins gave: it selfishly consumes as much computational capacity as I can allocate and then wants more.’

What is called Next Generation Sequencing (NGS) is speeding up to a dizzying degree what was not so long ago the heroically laborious business of transcribing a genome. Sequencing was a matter for representative model genotypes; now the time is fast-approaching when it will, in principle at least, be available for the study of any and every individual organism. That fact is a fundamental game changer not only for genomic and genetic studies, but for the data analyses which inform them. Both statistical tools and the computing platforms that support them must accept that their future is one of continuing exponential growth in the quantities of data they handle. The tiny sample sizes that made up the bedrock of my training and professional life are still being used to educate the next generation of statisticians, but bear no relation to present analytic reality.

Nor is sheer size the only issue: in any such period of rapid development, heterogeneity is both a blessing and a curse. Different analytic tools and approaches are generated by those involved in specific tasks, then taken up for development, modification and adaption by those related by slightly different needs. Studies are likewise designed to suit the requirements of particular enquiries. The combination of new or altered procedures and variant study designs produces a huge and ever growing ocean of information content, enriching the primordial soup from which the most productive methods will evolve, cross fertilise and stabilise, but it also produces a messy landscape in which direct comparison of different results is often difficult. Somewhere in that soup lie the answers to an unimaginable spectrum of questions asked or as yet unframed, but the data analyst must first fish out relevant components and then figure out how to make them work together. As IDBS’ Robin Munro points out (see box: Struggling to keep pace), ‘good quality metadata management is vital.’

Scale and diversity have required the development of new analytic, meta-analytic and platform technologies of various kinds. IDBS provides, in Munro’s words, ways ‘to ensure well managed data and results as well as orchestration of industry standard genetic and genomic tools.’ Thermo Fisher Scientific’s Proteome Discoverer offers the opportunity to automate large parts of the proteomics (that part of genetics which studies the complete protein set encoded by a genome) informational management and analysis loop – not necessarily doing analyses, but choreographing them in time with, among other things, data search, acquisition and management. Companies like SGI provide hardware that can run analyses in a single ‘on chip’ memory space of 16 terabytes, on in excess of 2,500 computing cores, to accommodate the need for completing ever-larger analyses within viable timescales.

More and more analyses are being approached in the broad bandwidth, high volume manner that this collateral ballooning of data sets and scientific computing technologies make possible. Hastie et al describe[1] how, in their investigation of respiratory syncytial virus, ‘tandem mass spectra... searches were submitted... using Proteome Discoverer.’

Not every genetic study is drowning in immensity, however, and genome mapping is only one end of an investigation spectrum which also encompasses the operation of individual genes. Much valuable data collection and analysis, though it inevitably becomes part of the larger ocean, is done in much smaller tributary contexts and is analysed in established standard desktop software. This is especially so when it is part of experimental work on closely focused topics, often linking genetic and environmental or other effects. Dipping into a pile of recent papers, for instance, I find a range of data analyses underpinning genetic links to macro concerns from agriculture to oncology conducted in generic software statistics tools.

Statistical analyses in a recent investigation[2] of the periplasmic chaperone role played by HdeA and HdeB genes in acid tolerance of shiga toxin producing E coli, to take one example, were conducted using Systat’s widely popular generic Sigmaplot software. Deletion of these genes was found to reduce acid survival rates by two or three orders of magnitude in various haemorrhagic strains, but not in strain O157:H7 serotype where, by contrast, loss of hdeB had no effect and hdeA produced only half an order of magnitude effect. A point mutation which altered the subsequent sequence seemed to be the key to this divergent evolution.

VSNi’s GenStat has specific provision (as an extension of generic mixed models) for quantitative trait loci analysis and other genetics-centred analyses, as well as a strong life sciences history in general and agriculture in particular. It’s not, therefore, surprising to find it well represented in areas as diverse as gene expression under varying phosphorus levels[3] and germination timing[4] in brassicæ, polymorphism and intramuscular fat[5] in pigs, heterosis[6] in maize, blood and lymphoblastoid cell lines[7] in humans, or rain damage resistance[8] in Australian strawberries. ASReml (a separate product also from VSNi) is also represented, often in the same studies although my personal favourite was one[9] that attempts to disentangle heritable from learned antipredator behaviour components.

An important focus, particularly in agricultural genetics, is on mapping of marker loci in the DNA sequence onto trait variations in the phenotype. Association mapping (AM), an application of linkage disequilibrium (LD) mapping techniques, has a solid history in study of disease in humans. In this connection, as one approach to formalised study of genetic disease architectures, probabilistic graphical models (already well established in bioinformatic gene expression and linkage analyses) are appearing in support of AM methods up to genome scale, although there are limitations in that respect. AM has tended to emphasise high-frequency alleles, but development of statistical models is addressing this and some interesting plant studies exploit them.

Brassicæ once again raise their leafy little heads[10, 11] here and not without reason. Though statistically powerful and capable of very high mapping resolution, AM is dependent upon well-established understanding of single nucleotide polymorphisms (SNPs) within the organism being studied. It can, therefore, be most effectively applied to those subjects whose genomes are already known and, conversely, is least useful in those not yet recorded in sufficient detail.

That limitation is, of course, fluid and progressively under revision as new genomes are explored, mapped and published. An increasing number of trees (both orchard and forest) are being sequenced and high volume sequencing methods are advancing; with suitable phenotyping to match, association mapping will follow. One high-value cash crop tree, the cocoa tree (Theobroma cacao), has already[12] received close AM attention: almost 250 samples, from 17 countries in Latin and South American or the Caribbean, yielding close to 150,000 expressed sequence tags and a high-density genetic map.

Looking to the future, the ways in which researchers analytically interact with data seem set to change in wide ranging ways. Current corporate providers of solutions are, as in every other area of computing, being joined by open systems and imaginative uses of distributed access. The Discovery Environment provided by iPlant Collaborative (a non-profit virtual organisation, funded by the US National Science Foundation, centred at the University of Arizona, and now in its third year), to take just one example, provides a web portal through which botanists and other plant scientists can both provide and access analytic tools. Those tools sit on high-performance computing platforms, will handle terabyte scale data sets and can be used by anyone through a semi-friendly graphical user interface. Data can be stored, analyses run, results shared.

Looking at the ways in which distributed and/or cloud-based computing structures are spreading and establishing themselves, it seems likely that this model or something like it is the pattern to expect. How exactly it will interact and cohabit with present corporate providers is anyone’s guess, but that’s not a question peculiar to genetics. Some companies have already moved experimentally down the ‘free up to a point’ route opened up by small shareware and similar vendors: Wolfram Research, for example, has for some years provided several web-based access points through which anyone with a web browser can make use of Mathematica facilities on a small scale basis, thereby setting out its stall for those who need more and are willing to pay for it. Statistical software publishers, like office suite publishers, get support from customers who do not want to rely on the excellent but unsupported free alternatives. Perhaps genetic analysis will support the same kind of mixed market. Whatever the mechanisms, they will for the foreseeable future be following an ever-upward spiral of size, speed and complexity.

References and Sources

For a full list of references and sources, visit

Given the advances in genomics sequencing technology, genetic analysis and management of that data has been struggling to keep pace. Recently there has been a plethora of statistical tools for genetic analysis coming to market and a growing understanding that good data management is crucial.

Open-source tools, together with some proprietary algorithms from commercial vendors, are the basis for how genetics data is generated and analysed using Next Generation Sequencing (NGS). Notable mentions include the SOAP set of tools from BGI, GATK from Broad Institute and CASAVA from Illumina. These methods are configured to achieve the best results, which can cause variation when trying to analyse data consistently across studies.

This is why most of the NGS technology vendors now offer streamlined ways of analysing primary data with freely available tools, onsite or in cloud environments. For example, Ion Reporter from Life Technologies and BaseSpace from Illumina offer cloud access to their users as part of their services. Sequence service providers like BGI and Complete Genomics also support this model. These offerings aim to control the data, metadata and information, allowing a more standardised approach. However one still needs to be able to interpret the mass of data.

Genetic data analysis presents problems, not least around the size of the data that can be generated and good quality metadata management is vital for all studies. Often the data is collected, but sometimes the design of experiment and information about the sequenced individuals is lacking or incomplete and this can cause problems.

Getting to the value in the data by being able to perform tertiary analysis and meta-analysis, allowing for interpretation is the key. Also essential is providing an overview of cohort identification for patients with a given sets of variants so that scientists can then look at comparison groups of patients to determine statistical significance, as in a case control study.

Part of the solution to these problems is by providing a good foundation in data and results management: knowing where data comes from, what a statistician did and which parameters were used as variant calling. These can all vary depending on the type of tools used and the way in which they are used. It is vital that these workflows and results are properly managed.

Content by Robin Munro, Translational Medicine Solutions director with IDBS


Read more about:

Modelling & simulation

Media Partners