BIOINFORMATICS

In pursuit of proteins

In pursuit of proteins

Clare Sansom on the evolution of structural bioinformatics

Scientific Computing World: October/November 2007

Next year the scientific community will celebrate the 50th anniversary of an advance that must have had almost as much influence on the drug discovery and allied industries as the discovery of the structure of DNA. In 1958 – a mere five years after Watson and Crick’s insight, and in the same city, Cambridge, UK, John Kendrew and his colleagues published the first structure of a protein, the oxygen binder myoglobin. By all accounts he was initially disappointed that this first glimpse into the three-dimensional protein world revealed an irregular structure with nothing of the elegance of DNA’s double helix. Understanding what we now know as the globin fold, however, has led to enormous insights into the evolution and mechanism of action of these enzymes. Sicklecell anaemia, caused by a mutation in the closely-related protein, haemoglobin, was the first inherited disease to be understood at a molecular level.

There is, however, nothing unique about myoglobin that made it ‘the first protein structure’. Kendrew and his co-workers chose to study it because it is small, easy to work with, and of significant medical importance. Other scientists were working on similar proteins at the same time, and over the next two decades a trickle of other structures entered the scientific literature. The Protein Data Bank – a publicly accessible repository of protein structures – was started in 1976 with 13 structures. At about the same time, pharmaceutical companies started using protein structure as an aid to drug design. Millions of people owe their lives to dihydrofolate reductase inhibitors, developed ‘rationally’ as treatments for cancer and infectious diseases using that protein’s structure as a template.

Briefly, in the 1990s, the use of structure in drug design seemed to go out of fashion. Once it became possible to design and build millions of related compounds quickly and cheaply using combinatorial techniques, many companies simply threw those millions of compounds at an assay in order to pick out the ‘best’ compounds. This brief phase, however, managed oddly to coincide with the structure-based development of the important class of HIV protease inhibitors. And, although very few now believe – as some in the 1980s claimed –that structure-based techniques on their own can revolutionise drug discovery, the technique is now an integral and important part of 21stcentury drug discovery.

And, in structural biology as in every other sub-discipline of molecular biology, that trickle of information has become a flood. The Protein Data Bank now contains more than 45,000 structure files, representing well over a thousand distinctly different families of proteins. A very high proportion of drug targets – although not some of the most important, and most experimentally intractable, embedded membrane proteins – are represented there. Yet even this is only a small fraction of the protein sequences known. If a company, for example, intends to design a potent and specific inhibitor for a particular protein kinase as an anti-cancer drug, it will need as accurate as possible a model of the structure of that particular kinase: and of the 500 or so kinases in the human genome, only a small fraction have an experimentallyknown structure. Since the 1980s the technique of homology, or comparative, modelling has been used for bridging this ‘sequence-structure gap’. And, during the last quarter-century, software companies specialising in this and other molecular modelling, or structural bioinformatics, techniques have emerged – and often merged and de-merged, and made and occasionally lost their reputations. Almost all these companies either emerged from academia, or take and commercialise algorithms developed there. And superficially, at least, each company now offers a similar range of products for both predicting protein structures and ‘docking’ smaller molecules, such as candidate drugs, into those proteins’ binding sites.

Chemical Computing Group (CCG) is one of the newer entrants to the field. It built its increasingly popular ‘Molecular Operating Environment’, MOE, around a proprietary, interpreted programming language, Support Vector Language (SVL). In fact, the company was not originally set up with a focus on molecular biology. ‘When we were spun out of McGill University in Canada, in the mid-1990s, the original idea was to market a toolkit based on SVL,’ says Steve Maginn, director of scientific services at CCG in the UK. ‘Market research, however, suggested that people prefer specialist applications, admittedly ones that are easily customisable, and we identified structural molecular biology and drug design as our target area.’ Most of the code in MOE is written in this high-level language, which has had much simple chemistry built into it. The compiler is proprietary, but all the code is in the public domain, and 95 per cent of it is written in SVL and can be modified easily by users. This adds flexibility in its use, as Maginn’s German counterpart, Wolfram Altenhofen, explains. ‘Maybe 10 per cent of our customers use SVL to develop their own code to add into MOE. Others, mainly commercial users, prefer to ask us to code algorithms for them, which we readily do, and yet others – such as bench molecular biologists – are perfectly happy with the basic functionality that we provide.’

MOE includes a database of some 15,000 unique structures of protein chains reduced from the full Protein Data Bank and clustered with sequences from the SwissProt database to give a set of protein families with both complete structural information and evolutionarily related sequences. This allows users to pull information about more distantly related sequences into a modelling exercise and select the most appropriate template or templates on which the new protein structure will be built. And important new features are planned for the next release, at the end of 2007. ‘We will be extending our existing homology modelling capabilities to deal with multimers and multi-domain structures, such as antibodies, in one step,’ says Maginn.

Schrödinger, based in Portland, Oregon, is rather older than CCG – it was founded in 1990 – but, like CCG, it has only recently begun to compete with and take market share from its larger rivals. Senior vice president Remy Farid attributes its success to the novelty of its mathematical models, developed in conjunction with key academic partners. ‘CCG owes its success to the phenomenal flexibility and ease of use of the MOE interface,’ he says. ‘In contrast, we rely on using the most reliable and up to date algorithms and the newest force fields, through collaborations with academics such as Bill Jorgensen at Yale.’ The company was co-founded by Richard Friesner, Professor of Chemistry at Columbia University, who is still the chairman of its Scientific Advisory Board, and benefited hugely from investment from a ‘business angel’, wealthy hedge fund manager David E Shaw, a former Columbia professor and founder of the specialised investment management and technology development group, D E Shaw & Co. ‘David may be a personal friend of Friesner’s, but he is a scientist first and foremost; he chose to invest in us because he believes in the science that we are doing,’ says Farid. The company is still private, but is now independent of angel funding. Its Glide algorithm for ‘docking’ small molecules into protein-binding sites is greatly respected in the industry, and very widely used. Schrödinger provides a routine for homology modelling and protein structure prediction, Prime, to be used ‘upstream’ of Glide in the drug discovery process. Unlike most other companies, however, it also offers routines for the refinement of protein structures from X-ray data. ‘The only other commercial company that provides a similar package is Accelrys, and they are no longer developing theirs,’ says Farid. ‘Our main competitors are academic groups such as the CCP4 consortium, and their products, although less costly and scientifically reliable, are harder to use.’ Another important feature of Schrödinger’s software developed recently is an algorithm for predicting induced fit effects, where a protein’s structure changes when a ligand is bound. This incorporates code from both Glide and Prime. ‘We now know that Accelrys is working on a similar product, whereas it is not long since it was we who were playing “catch-up” with them,’ says Farid.



Models created using Schrödinger's Prime software

And what of the market leaders of the 90s? One of these, Tripos – founded in 1979 –recently survived a period of financial turmoil after investing heavily in a loss-making Discovery Research chemistry division, based in Cornwall, UK. ‘[This division] diverted resources and management focus away from our core strengths – our Discovery Informatics business,’ says Simon Cross, product manager for Tripos’ core modelling program, Sybyl. ‘The assets from Discovery Informatics, and the informatics technologies that underpinned Discovery Research were bought by Vector Capital. The new Tripos is essentially a new private company with the same name, same leading scientists and improved versions of the same core software products, but without the distraction of the chemistry business.’

Within Sybyl, Tripos offers a complete suite of programs for Advanced Protein Modelling (APM) based on algorithms developed in collaboration with one of the most respected names in the field, Professor Sir Tom Blundell, of Cambridge, UK. Someone in that lab must be a serious classical music fan, as many of the algorithms developed there and incorporated into Tripos’ package have musical names. These include Fugue, for recognising homologous sequences and building alignments using both sequence and structural information, and Orchestrar, for building comparative models from the homologs selected. The tools all share the common theme of reusing knowledge from an ever-increasing pool of known protein structures, with environment specific mutation tables used to pick out the most likely homologues, loop fragments, and sidechain conformations. Some of these tools, and Homstrad, the curated database of protein families used by Fugue, are available to academic users for free. However, these versions lack the user-friendly interface that is, like MOE’s, almost universally praised. One un-named industrial client recently commented that ‘Orchestra is a pleasure to work with’; another has praised ‘the sequence viewer, and the ability to add homologs during the workflow’.

All protein modellers aspire to do well in the ‘competition’ for structure prediction known as CASP (or ‘Critical Assessment of Techniques for Protein Structure Prediction’), which has been held every second year since 1994. Structural biologists release sequences of proteins they are about to solve so modellers can predict them ‘blind’; when each structure is solved, work on that sequence stops, and the precision and accuracy of all predictions are assessed together at a meeting. The quality of homology models has increased steadily, if slowly, in recent years and there is now little difference between a number of high quality predictors. In 2000, both CCG and Tripos entered CASP4, and both algorithms were in the top four assessed. ‘We haven’t submitted our own models since then, and we don’t know if we’d do as well now,’ admits CCG’s Maginn. ‘We do know that some of our customers enter, but they don’t always tell us how they do.’ It may not be surprising that software houses are fairly relaxed about proving themselves to be ‘the best’. This is now a mature field and drug developers have a healthy choice of reliable tools to choose between.