In silico optimisation of gene expression

Markus Fischer, of German biotech company Entelechon, overcomes the complications associated with gene synthesis

Pathfinding by pheromone deposition. Artificial ants wander through a graph depositing pheremone traces (orange) on their way.

Proteins are commercially important in the production of new drugs and biotechnological products, but their synthesis is often a costly bottleneck in the pharmaceutical industry. Proteins are usually produced by integrating a gene into a suitable host organism, such as a bacterium, and using the biosynthesis machinery of the host to translate – or express – the gene into a protein.

However, the biosynthesis of proteins is very complex. Just as the chemical industry has improved inorganic catalysts, so the biotech industry is trying to optimise the expression of commercially relevant proteins.

One particularly interesting approach is the in silico optimisation of the underlying gene sequence. Because it requires no laboratory work, it is comparatively cheap. The universal genetic code maps a set of 64 codons (written out in the 'language' of DNA) onto just 20 amino acids (the basic building blocks from which the long, complicated protein molecules are built up). This mapping is degenerate: most amino acids are encoded by more than one codon. This opens up the possibility of strategically replacing codons in a given DNA sequence to enhance protein-production, without changing the encoded protein. This is what is meant by 'optimising a gene'.

Preparing a DNA sequence for optimisation. In order to be able to optimise a DNA sequence, the user must define the segmentation boundaries of the codons (highlighted in blue). The bars beneath the codons indicate the relative codon frequencies.

Some of the parameters that affect the yields of a protein are well understood. Among them are: the so-called codon usage; the mRNA secondary structure; regulatory DNA motifs; GC content; and repetitive motifs. All these parameters are independent and potentially conflicting. This, and the number of possible codon permutations, makes gene optimisation a formidable task. At an average of three codons per amino acid, the number of permutations for a sequence of 300 amino acids is in the range of 1.3 x 10143! This is one case of a class of problems called multi-objective optimisation (MOO)[1].

To reduce this overwhelming complexity, one can partition the sequence into fragments, then optimise each fragment independently of the others. Geneart's GeneOptimizer software does this by stepping through the sequence, optimising a few codons at a time. Once a codon or a string of codons has been optimised, it is frozen and the next codons are optimised, including all previously optimised codons into the scoring function. The drawback to this approach is that it introduces an artificial asymmetry into the sequence (the optimisation moves directionally through the sequence), and that it therefore evaluates a tiny portion of the solution space only.

To overcome this problem, we at Entelechon created a program called Leto[2], which considers the complete sequence at all times. It uses a heuristic to search the solution space. A natural solution is a multi-objective genetic algorithm (MGA), since the underlying data – the DNA sequences – are already in a form usable by the algorithm.

At the core of each genetic algorithm is the selection function, which chooses the fittest individuals from a population of potential solutions. The multi-objective GA needs to be modified to use not one but several such functions – one for each optimisation parameter. To compare the fitness of two individuals, a so-called Pareto criterion is applied: A solution is Pareto-optimal when it cannot be improved in one parameter without worsening another parameter. In the terminology of the Pareto optimisation, such a solution is 'non-dominated', as no other known solution is superior in all parameters.[3]

A multi-objective GA can easily be derived from a single-objective genetic optimisation algorithm, using the Pareto criterion. Simply replace the fitness function with a custom sorting procedure that sorts the individuals according to their Pareto optimality.

Once the set of all Pareto optimal individuals has been found, somebody has to select from it the most preferred solution. All Pareto solutions will be optimal in an abstract sense, but some solutions may be more suitable to the user than others. The selection can be done automatically; if additional criteria are available, it can be random or the user can be presented with a set of solutions from which she can choose the one that best suits her.

Pareto optimisations come at a significant cost. The computation time increases drastically as, for each calculation of the fitness, all single-objective fitness functions for all individuals need to be computed. We have investigated whether an alternative algorithm can improve the performance of the optimisation: The so-called multi-objective ant colony optimisation, or MACO. The first ant colony optimisation (ACO) algorithm was introduced by Marco Dorigo in 1992[4]. To get an estimate of the performance of the DNA optimisation, we have compared the performance of a MACO against the MGA used in Leto.

ACOs are inspired by the foraging behaviour of ants. Ants that are on their way from the colony to a source of food, leave a track of pheromones. This track is detected by other ants, who can follow it to find the food. As they move along this path, they themselves leave the same pheromone. Thus, gradually, the faint pheromone trace is amplified by more and more ants that are attracted to this slowly-building pathway. At the same time, the pheromone has a constant rate of evaporation. Therefore, pathways that are rarely used, or that are too long, eventually disappear.

This collaborative behaviour requires no global information on the part of the ants. Each ant decides on the basis of locally available information where to go. Nevertheless, the emerging pathways are often optimal, and are fault-tolerant. Disrupting a pheromone path – for example, by placing an obstacle across it – leads to local random walks which soon find a bypass.

The fault tolerance and the emergence of an optimal solution from local behaviour led to the adaptation of this ant behaviour into an optimisation algorithm. ACOs usually employ a set of ants that move through a graph, with each edge representing a possible local solution. Once they have found the target, their pheromone track is laid, its intensity proportional to the quality of the global solution. This implies that the artificial ants work slightly differently to their real-life colleagues: The pheromone track is not created ad hoc, but only after the fact, i.e. when the goal has been reached. And its intensity is not constant, but dependent on how good the result is. These two modifications are not strictly necessary, but they improve the performance of the ACO significantly.

An ACO has to be modified, in order for it to be applicable to the multi-objective optimisation of DNA sequences. The original algorithm requires a single objective. The Pareto approach cannot be easily fitted onto an ACO, since it would require very many ant runs. Instead, we integrate the parameters into a single-objective function, using different weights for each parameter. (This approach can also be used to optimise investment portfolios.)[5]

As there is no way to decide which weights are the correct ones, we let the algorithm figure it out: We assign to each ant a different set of weights. In this way, we avoid choosing one weight set over another. The idea is to let the ants compete not only for the best solution, but also for the best weight distribution.

The comparison of both the MGA and the MACO is very interesting: In the case of DNA optimisation and for a set of two parameters, they are very close, which indicates that both algorithms succeed in getting very close to the theoretical optimum. In terms of the computation time required, the MGA performs more favourably, but we believe this to be at least partially implementation-specific, and that there is still room for improvement. Since the MACO has at its disposal both global and local information, whereas the genetic algorithm uses global scoring functions only, the MACO should be able to converge more quickly onto the optimal solution. This is supported by the fact that much fewer iterations are needed (although each iteration takes longer). Once we have finetuned the performance of the MACO, we will use it to replace the MGA currently used in Leto.

The problem of DNA optimisation is, of course, only one example of a large set of problems relevant to many fields. Other examples include the operational variables of large-scale chemical production processes, such as the control of polymerisation; integrated circuit design; optimising hydrodynamic systems, such as jet engine and turbine design; and electric motor design.

The solution space, defined by all possible permutations of codons for a given amino acid sequence, is vast – and the optimisation must consider up to 10 independent parameters, which all have the same priority. Other fields with similar requirements can also benefit from MACOs. Since these algorithms do not depend on a meaningful definition of weights for the individual fitness functions, they can be used wherever such a definition would be difficult, i.e. where the relative importance of parameters is difficult to estimate a priori, or where the parameter ranges are variable.

Fortunately, the implementation of a MACO is not very difficult. Each ant is a very simple agent which decides its movement in a graph on the basis of local criteria. All it needs to do, then, is to memorise its way throught the graph, in order to update the pheromone track once the end of the graph is reached. Thus, it is easy to implement MACOs and MGAs in parallel, to see which one provides the better performance for any specific problem. For our gene optimisation problem, the MACO turned out to be a very useful approach, since its performance can be easily fine-tuned, and the robustness of the algorithm – it almost never failed to converge towards a very good solution, as compared to the results from the genetic algorithm – makes it easy to incorporate into commercial software.

Discovering DNA

In the 20 years since the start of the human genome project the biotechnology market has been driven by genomic and proteomic research, according to a new report from market research firm Kalorama Information. The study, titled: The future of biotechnology instrumentation, predicts that the market will continue to grow, driven by demands for faster in silico analysis of genetic information, and reliable means of data storage, mining and post hoc analysis.

'The field of biotechnology instrumentation promises to be an area of exciting technological innovation that will be a key part of the array of genomic and proteomic analysis technologies extending over the next 10 years,' says Dr Kenneth Krul, the author of the report. 'While the nature of the industry is not expected to change, we can expect to see major advances, particularly in the area of micro chip technology.'

The full report can be found at www.kaloramainformation.com/pub/1186403.html

Advances in technology are driving scientific discovery. For example, Compugen recently announced a new method for predicting novel proteins by analysing random genetic mutations in so-called 'junk DNA'. The Compugen system analyses the long passages of junk DNA found in publicly available databases. The first results of this new analysis, recording previously unknown proteins, have been published in the Proceedings of the National Academy of Sciences (USA).

The DNA sequences utilised by the new method are essentially ancient, mutated copies of current genes, termed processed pseudogenes. By analysing thousands of such human pseudogenes, Compugen's scientists were able to predict the existence of hundreds of novel transcript variants, some of which have been validated experimentally in the company's laboratories. Several proteins predicted by these novel transcripts have been selected by Compugen as possible therapeutic candidates, and are now undergoing further evaluation.


[1] K.C. Tan, T.H. Lee and E.F. Khor, 'Evolutionary Algorithms for Multi-Objective Optimization: Performance Assessments and Comparisons,' Artificial Intelligence Review, Vol. 17, No. 4, pp. 253-290, 2002

[2] www.entelechon.com/leto

[3] Vilfredo Pareto, Manual of Political Economy, 1971 translation of 1927 edition, New York: Augustus M. Kelley, (1906)

[4] Marco Dorigo, Thomas Stützle Ant Colony Optimization, MIT Press, Cambridge MA, 2004

[5] Karl Doerner et al, 'Pareto Ant Colony Optimization: A Metaheuristic Approach to Multiobjective Portfolio Selection,' Annals of Operations Research, Vol. 131, No. 1-4, pp. 79-99, 2004


Analysis and opinion

Robert Roe looks at research from the University of Alaska that is using HPC to change the way we look at the movement of ice sheets


Robert Roe talks to cooling experts to find out what innovation lies ahead for HPC users

Analysis and opinion