The evidence of life's path
In the beginning - that is, no more than a decade or so ago - there was genomics, the study of the genes of each organism as a whole. Ten years on, the new science has come of age, with the genomes of more than 200 single-celled organisms and many important higher organisms already known. And what has become known as the post-genomic era has yielded a multitude of other 'omes', starting with the transcriptome and the proteome: the full complement of RNA transcripts and proteins in a cell, respectively.
A couple of years ago a cynic might even have defined the suffix 'ome' as one that, when added to any field of bioscience, would yield successes in grant applications. However, fashions in science and its terminology change fast, and the hype surrounding the 'omics revolution' may be already beginning to die down. The literature of the second decade of this century is unlikely to contain many references to some of the more fanciful of the 'omics'.
One word that is very likely to find a lasting place in the lexicon of 21st century biology, however, is metabolomics. This 'omics' is derived from 'metabolism': the field of biology that is concerned with the study of the biochemical processes and pathways that are necessary for the life of an organism, and with the chemicals that they work with. The online encyclopaedia Wikipedia defines 'metabolomics' as the 'systematic study of the unique chemical fingerprints that specific cellular processes leave behind'. The condition of a cell or organism, whether it is healthy, under stress or diseased, can be told from the precise nature of the chemicals, or 'metabolites', that it contains; these chemicals are often more accessible to analytical study than proteins and transcribed genes.
Metabolomics, therefore, is the name given to the variety of techniques used to recognise patterns in the chemicals present in biological samples in order to decipher their significance. It involves a combination of 'wet' analytical chemical techniques and 'dry' analysis and modelling. As the analytical techniques (types of chromatography and mass spectrometry) used to identify 'small molecule' metabolites are easier, faster and cheaper than the similar techniques that are applied to identify and measure proteins, it can be relatively easy to follow these in real time, in what is known as 'metabolic flux analysis'. Another advantage of analysing metabolites is that the metabolome is 'downstream' of the genome and proteome, in that changes in metabolites result from changes in the protein content of a cell - which, in turn, result from changes in gene expression. Small changes in protein content may cause much larger and more easily measured changes in metabolites, amplifying the signal caused by the changes in the proteome.
The computational part of metabolomics - the elucidation of patterns in metabolite type and concentration, and their correlation with biological property and function - can be thought of as part of systems biology, defined trivially as the analysis of molecular data in the context of a system. With metabolomics, of course, the 'system' is the cell type (or possibly single cell) in which the metabolites are measured. It is perhaps significant that the word 'metabolomics' was first coined in a paper1 that was co-authored by one of the UK's best known systems biologists: Douglas Kell of the University of Manchester. Kell has recently been appointed as the director of one of only three systems biology centres set up by the country's research councils. The Manchester Centre for Integrative Systems Biology, located in the splendid new Manchester Inter-disciplinary Biocentre, draws its principal investigators from across the spectrum of the life, health and physical sciences. They include Steve Oliver, a distinguished yeast geneticist and the first author of the 'first metabolomics paper'. The centre will employ a dozen postdoctoral researchers in developing testable, parametrised models of cellular metabolism using yeast as a model organism.
'In the pre-genomic era, molecular biologists took a reductionist approach, starting with a biological function and trying to discover which gene was responsible,' explains Kell. 'Often, now, we have all the data, but we still know little about which genes are involved in complex traits and how they interact. We are moving into an era of inductive biology, where we start with the data and infer the hypothesis.' He describes this inductive approach to molecular biology as analogous to 'putting Humpty Dumpty together again'. His group in Manchester is measuring the time course of metabolites in cells and using complex computational techniques, particularly those of machine learning, to infer knowledge about cellular function from them.
Problems with the reductionist approach become particularly apparent when looking at traits that are qualitative rather than quantitative: a property such as height or weight, for instance, is influenced by many genes. Much of biology, therefore, is quantitative, and biological systems can be thought of as analogous to other complex systems. In 2002, in a thought-provoking paper in Cancer Cell, Yuri Lazebnik of Cold Spring Harbor Laboratory in the US compared biological systems to electronic ones2. He described traditional approaches to molecular biology as being like trying to fix a radio by removing and describing each component in turn. An electronic engineer, in contrast, will use a circuit diagram as a formal model to help him (or her) understand how the radio works and how it can be fixed.
It is one of the roles of systems biology to develop such formal models or 'circuit diagrams' of cell types. The process can be thought of as circling from modelling to experimental biology and back again. Biological data and knowledge, such as the types and concentrations of metabolites present in a cell at different times during a biological process, are used to define a model that simulates the process. The model is then run, its output compared to real biology, and then refined using the observed differences between prediction and experiment. This process is repeated until the model converges, to give, hopefully, a model that is an accurate representation of the process under study. One important feature of this type of model is that it is run without any pre-judged hypothesis.
In computational terms, the type of analysis performed in simulating the metabolic properties of a biological process can be described as one of combinatorial optimisation: optimising a large number of parameters to simulate an equally large number of observed variables (the metabolites). It would, theoretically, be possible to do this systematically, trying every possible combination of parameters. This self-evidently stupid approach, however, scales exponentially with the number of variables; applying it to a real biological problem would require longer in computer time than the age of the Universe. The solution is to rely on a group of computational techniques generically known as machine learning.
Kell and his group in Manchester have been using one of these techniques, genetic programming, to identify those metabolites that are the most closely correlated with biological traits. This technique takes its name from an analogy with inheritance and evolution. It starts with a population of individual models, each encoding a solution (in terms of metabolite content) and each with a 'fitness value' related to how closely its solution fits the given data. The models are ranked; altered or 'mutated'; and combined to produce another generation of models. This procedure then iterates until it converges onto a solution that is as close as possible to the input data.
Many species of plants respond to bacterial infection by producing salicylic acid, a simple chemical that is closely related to aspirin, as a defence mechanism. Kell's group used genetic programming, based on data on the metabolism of more and less resistant transgenic plants, to develop a rule-based model to predict which plants would be most resistant based on metabolite concentration.
The algorithm generated a simple rule that predicted resistance accurately using only three metabolites. 'We found that the one metabolite that discriminated most clearly between resistant and less resistant plants was not salicylate,' says Kell. 'It was, in fact, a previously unknown metabolite, labelled number 42. This “42” may not be the answer to life, the universe and everything, but it does represent an important step forward in plant defence research. We have discovered an important and previously unknown metabolite from data alone, without first setting out a hypothesis.'
This 'hypothesis-free discovery' is a crucially important development, if only because of the impossibility of taming the 'avalanche' of biological data that is now being produced by relying on hypothesis-driven science alone.
Taking this process of automating scientific discovery further, Ross King of the University of Wales at Aberystwyth and colleagues, including members of the Manchester group, have developed what is called a 'robot scientist', combining metabolomics with robotics and artificial intelligence (AI). PC-driven robots performing biochemical assays - a technology that is ubiquitous throughout the pharmaceutical industry - are linked to a 'master' PC running AI code for generating hypotheses and selecting experiments to test them via a laboratory information management system (LIMS).
King's group chose the aromatic amino acid (AAA) synthesis pathway of baker's yeast (Saccharomyces cerevisiae) to test their robot scientist. The master PC was programmed with a complete logical representation of the genes, proteins and metabolites involved in this pathway, taken from the KEGG encyclopaedia of metabolism3. This model uses the inference that a mutant with a gene or genes knocked out will only grow if it can synthesise the full set of aromatic amino acids from the compounds in its environment, via the remaining proteins in the pathway, to design a set of experiments to predict gene function. 'Even if there are only 15 possible experiments, there are 15 factorial, or more than 1.3x1012 different orders in which they can be performed. It is obviously important to choose a logical order. The robot scientist generates hypotheses (or predictions of gene function) and then chooses and performs experiments to prove or disprove them,' says Kell. The robot was found to assign gene function accurately and to outperform not only a random selection of experiments but also strategies that may be favoured by funding councils where the cheapest experiments are selected first. Although the experiments chosen were more expensive, the robot's strategy was more cost effective because fewer experiments were needed to assign the functions of all the genes.
Oliver SG, Winson MK, Kell DB, Baganz F (1998). Systematic functional analysis of the yeast genome. Trends Biotechnol 16: 373-378