The cheminformatics challenge

Commentators often quote, out of context, the same lines from the poet T S Eliot to illustrate the dilemma of a pharmaceutical industry that is still finding it difficult to pick useful molecules out of the deluge of genetic and other information available: ‘Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?’

We might also add: ‘Where is the information we have lost in data.’ It is estimated that there are, potentially, about 1,080 different chemical entities with properties that might make them suitable as drugs. The fraction of these that has been synthesised and studied is still tiny, but it is growing all the time. Making sense of this deluge is, in large part, a problem in data management.

Data management in cheminformatics can seem to be rather a crowded field, with many companies offering rather similar products. However, whereas there is a lot of overlap, there is still a lot of variety, and almost every lab will need to turn to more than one provider for their chemical information needs. In 2007, what makes one product stand out from another is not necessarily only the range of functions it provides, but what can be seen as ‘fuzzier’ concepts. These include flexibility – for example, whether a program will run on different operating systems or interface easily with products using different data formats – ease of use for chemists who are not informatics specialists, and even the business models under which the software is provided. Open source software, beloved of academics for decades, is now beginning to penetrate the pharmaceutical industry.

Methods of managing data electronically may still be relatively new, but the concepts of data management in chemistry go back at least to the mid-20th century. One key player in the field, however, can trace its informatics roots back much further than that. What is now the Informatics Division at Bio-Rad, a company best known for research and development in electrophoresis, started life in 1874 as a company founded by chemist Samuel P Sadtler – who, as a PhD student at the University of Göttingen, worked with that Dr Bunsen who invented the eponymous burner. That Sadtler’s grandson began collecting reference chemical data, for Richard Perkin and Charles Elmer of PerkinElmer in the 1940s; this was published first in print and then as a computer-based database. Sadtler’s company was acquired by Bio-Rad as the Sadtler Division in 1978, and this officially became the Informatics Division in 2001. That year also saw the first release of BioRad’s premier chemical data management program, KnowItAll.

The scientific focus of KnowItAll is spectroscopy. The program provides tools for storing and searching chemical ‘meta-data’ from infrared and NMR spectroscopy and mass spectrometry, as well as structures and other chemical information. Last year the company forged a link with a chemometrics company, Infometrix, and began to incorporate features from Infometrix’ package Pirouette, named from an innovative feature in which a 3D data plot is made to spin on its axis ‘like a ballerina’.

‘Our products are useful in many chemical applications, but we have a particular interest in biological applications, particularly where metabolite analysis can be used for diagnosis,’ says Gregory Banik from Bio-Rad’s Informatics Division. ‘We are working closely with scientists from Imperial College, working at Hammersmith Hospital in London, to analyse metabolite differences in disease.’ Last April, Bio-Rad released a version of KnowItAll designed and priced for academia. KnowItAll U has identical functionality to the commercial version of the software, but is priced competitively at three levels, depending on the highest degree – bachelor’s, master’s, or PhD – awarded by the establishment concerned. In contrast, clients from industry are generally charged on a ‘per lab’ or ‘per seat’ basis.

In comparison to the venerable Bio-Rad, Hungarian cheminformatics company ChemAxon, founded in 1998, is quite a new entrant to the field. Its first product, a chemical structure and reaction editor, Marvin, was released two years later, and a database management tool, JChem Base, developed to go with it. ‘Our current products can handle 60-70 per cent of the cheminformatics capability required for drug discovery,’ says ChemAxon’s Alex Allardyce. The company’s latest product, Instant JChem, combines many of the features from these two popular product lines bundled together in a desktop platform for bench chemists. This single, easy-to-use product, which comes with a demo database of public domain chemical data, was put together for ChemAxon by Tim Dudgeon of software consultancy Informatics Matters. ‘Instant JChem will be useful for almost any chemist with at least a couple of hundred structures to search and analyse,’ says Dudgeon.

Most data is essentially based on either text or numbers. Molecular data, being structure-based, is an exception. All ChemAxon’s products combine a facility for handling chemical structures directly with more mundane data management tools, including a useful facility to convert between data formats. ‘We support almost all features of the common chemistry file types being used in cheminformatics,’ says Allardyce.

Although Instant JChem is often used as a desktop tool, it can also handle much larger datasets. This contrasts with a package that many bench chemists still trust for much of their data analysis: Microsoft’s ubiquitous spreadsheet, Excel. ‘Almost 100 per cent of the chemists on the planet are comfortable with using Excel,’ says Dudgeon. ‘But Excel is not an ideal tool for drug discovery, partially because of its limited scalability, but also because it can’t handle chemical structures properly.’ No Excel spreadsheet can have more than 65536 rows; this limits its use in analysis of even moderately-sized chemical databases.

A custom form in Instant JChem allows a user to design a report that contains exactly what they want to see.

It is certainly true that most, if not all, of the features of ChemAxon’s software are provided by other vendors. Possibly, therefore, the most innovative and distinct feature of ChemAxon is the ‘please do use it’ business model that it operates under. The desktop version of Instant JChem is provided free for all users, and all the company’s products are free for academics. ‘We estimate that about 70 per cent of the world’s chemists will be satisfied with the functionality that we provide free of charge. Others, at least if they work in the commercial arena, will have to pay,’ says Allardyce. Examples of operations that commercial users have to pay separate site licences to use include performing structure-based calculations on the fly during a query, and this can work out to be very expensive as there are 15 separate calculation types available.

Tobias Kind, a researcher in metabolomics at the University of California Davis, has been using Instant JChem for a year. His group’s research involves the identification of the complete set of ‘small molecules’ found in blood, plasma and other biofluids. He finds ChemAxon’s products invaluable for what he describes as ‘straightforward cheminformatics’: ‘Using Instant JChem, I can calculate 170 different molecular descriptors of compounds, and use these to identify the compounds we find,’ he says. ‘I have colleagues in my lab who work in both Windows and Linux environments. It has been good to find a product that works equally well on each platform, and that will also run simultaneously on all the processors of our eight-CPU machines. ChemAxon also provide excellent support; they are open to suggestions, and don’t even mind if we ask them stupid questions.’

The cheminformatics software industry is a relatively small one, and, increasingly, many vendors look to collaborate on particular projects rather than compete. Andrew Lemon, managing director of the discovery informatics company The Edge Software Consultancy, based in Aldershot, UK, has worked with ChemAxon on a number of projects, while masterminding the production of its own biological data and project management product, BioRails. This combines some of the straightforward chemical calculation and data management tasks performed so efficiently by Instant JChem with features more reminiscent of an electronic lab notebook (ELN). BioRails makes full use of Web 2.0 design principles; in fact, it gets its name from the Web 2.0 development environment, Ruby on Rails, that was used to build it. ‘Ruby on Rails is a hot topic in e-commerce web development, and BioRails is the first scientific product based on this environment to reach the market,’ says Lemon.

‘The idea behind it is to enable scientists to design and run their own complete drug discovery projects in a seamless electronic environment, without having to run backwards and forwards to the IT department to get different pieces of software set up.’ It can be used to manage projects involving a wide range of experiments, including complex animal behavioural studies used to test drugs targeting the central nervous system and preclinical safety testing.

The Edge also offers BioRails through an innovative business model based on an open source approach. ‘We don’t charge anyone, academic or commercial, for BioRails; we charge only for support and installation,’ says Lemon. Through applying this model, he finds, his relationship with his clients has become one based on service and partnership. ‘It pays us to keep our clients’ interest over the long term rather than continually moving on to the next client,’ he explains. ‘Even given the cost of our support agreements, the overall cost of ownership is much lower over the entire product cycle, which makes good business sense for our clients.’

Screenshot from BioRails.

Most researchers involved with data manipulation will, at some time, have used Wolfram Research’ Mathematica. In the two decades since this company was founded by Boston- based academic Stephen Wolfram – a prodigy with an Eton and Oxford background, who published his first paper at 16 and was awarded his PhD before his 21st birthday – this has grown from a simple calculation tool to a full-scale mathematical and computational environment including (at least, according to the company website) ‘the world’s largest web of mathematical capabilities and algorithms’. The latest version, Mathematica 6, includes the capacity to load data on demand from curated databases. Its chemical database provides aggregated structural, physical, identification and safety properties of more than 18,000 chemicals, integrated into Mathematica’s powerful general computational and visualisation environment.

At the moment, the data management services offered are relatively simple, and no computational chemistry options are provided: all molecular geometry, for example, is read from internal tables. ‘The database you see now is only the beginning of a larger project,’ says Wolfram Research’s Jon McLoone. ‘Our eventual aim is to put Mathematica at the centre of all available databases for a wide range of scientific and technical applications’. One measure of Wolfram’s ambition for its software may be that it can already process a total of 140 different file formats, without having to pass data through external filters.

Although the products offered by the four companies profiled here have much functionality in common, the differences – in style as well as substance – are equally important. At this stage, I would like to offer only two predictions: that no single product will be able to provide all the data management tools required for cheminformatics in drug discovery, and that the future, even when viewed from the pharmaceutical industry, will increasingly be open source.

The cheminformatics challenge

Editor's picks

The 2026 storage survey: strategies for AI and data-intensive research

NEW On-Demand | Ontologies - the missing foundation for AI in drug discovery

On-Demand | One workflow, every tool: how AI-native ELN is changing drug discovery

On Demand: Free Online Panel Discussion | LIMS innovation boosts precision and security

The path to AI federated learning for drug discovery

Workstations vs Clusters for Ansys Applications

Avoid Duplication, Reduce Fragmentation | Integrated Informatics for Scientific Research