The systems biology dilemma
The biochemical networks that mediate life are driven by the intricate interplay of biological entities such as genes, proteins, and metabolites. A comprehensive understanding of any given biological condition cannot be achieved by studying these entities separately. Systems biology is a study to understand both the individual functions and interactions of genes, proteins, and other biological components together as a single system. It will transform our understanding of the underlying mechanisms of human diseases.
Systems-level research involves producing heterogeneous, global data that represent different levels of biological information such as DNA, mRNA, protein, and metabolite. This typically entails measuring differences in gene copy number, gene and protein expression, and differences in other biological events, such as mRNA splicing between various phenotypes under study. The generation of such global and heterogeneous data has been made possible by several key scientific breakthroughs in the past 20 years, such as the invention of microarray technology and the sequencing of the human genome. This increased number of candidates for profiling and advancement in array printing and scanning technologies then led to higher density microarrays capable of providing even more higher-dimensional global data. Microarrays soon expanded their application to profiling other biological entities and events such as single nucleotide polymorphism (SNPs), genomic copy number, microRNAs, alternative splicing, and transcription factor binding sites. Now, sequencing technologies have emerged to allow scientists to re-sequence entire genomes and measure known and unknown biological entities and events in a time-frame and at a cost that was unfathomable a few years ago.
Advances in high-throughput technologies and the proliferation of companies offering these solutions have driven down the cost of performing genome-wide profiling experiments. Core facilities at various academic research institutions have made the technology more accessible to labs around the world. All of this has led to more profiling experiments being performed and larger experiments being run on higher density arrays.
Generating the data, then, is not an issue. The key bottleneck is the lack of bioinformatics solutions that allow researchers to perform integrative data analysis necessary for the identification of linkages and concordance between the different levels of information. For example, this sort of analysis will allow scientists to determine whether a detected amplification in a genomic region actually results in an increased transcription of genes within the region. Putting these types of data together will give us a better understanding of the mechanism underlying the biology.
One of the biggest hurdles in systems-level research is the lack of standardisation in data analysis methods, even for each data type alone. For any given application, such as gene expression profiling, there are many array platforms that scientists can use to generate data, and differences in the array platforms require different algorithmic tools to process and normalise the data. These differences can contribute to the difficulties in finding correlations even between experiments testing the same hypothesis, interrogating the same level of biological information. The need to address this issue is evident in the FDA’s undertaking of the MicroArray Quality Control (MAQC) Project, whose goal is to provide the microarray community with guidelines for data analysis.
A first step towards standardisation of data is to provide a single environment for scientists to perform integrative data analysis. The current practice is to analyse the different data types in different software applications, each containing analytical tools that are developed and optimised for the data types.
However, such practice can decrease the ability to find concordance between data types. For example, faulty semantic mapping of biological information between different software applications, can contribute to the masking of any potential correlation between different data types. One of the challenges that the bioinformatics field is tackling is combining tools for analysing various data types into a single environment. Currently, there are two main approaches to overcoming this challenge. One is to create a framework where independent data analysis software applications and databases can exchange data in such a way that enables exploration and analysis of global data in a single analysis environment. Examples of such efforts include projects such as Gaggle, ToolBus, Taverna, and caCore. The other approach is to combine the different analytical and visualisation tools in a single data analysis application. Many software applications such as Bioconductor, Partek’s Genomics Suites, and Agilent Technologies’ GeneSpring take this approach.
The emergence of the modern measurement technologies has contributed to a marked decline in papers that follow the traditional scientific method (of developing a hypothesis and then using targeted experiments to either falsify or verify that hypothesis) and in many cases are replaced with ‘fishing expeditions’ where data is produced, but is rarely preceded by a clear hypotheses.
For many of today’s pupils, studies tend to go ‘a mile deep and an inch wide’. The complex nature of many biological processes is forcing students and researchers to focus on a small area, rather than taking into account the many complex facets of the living creature (or even individual cells).
There are now large collections of data repositories, such as GEO, Human Protein Atlas and others. These may provide the basis for a more communal approach to systems biology, along with innovations such as WikiPathways and caBIG (cancer Biomedical Informatics Grid).
The promise of systems biology is complete understanding of the underlying mechanisms of life. This bold goal requires a holistic approach that includes focused curricula in universities, collaborative approaches on the web, and interoperability of the software tools and data.