Pharmaceutical companies generate huge amounts of data. Their staff are intensive users of computing power. Yet the companies do not enjoy the full fruits of the research and knowledge that they produce. Datasets sit in silos so that users in one department of a company are unable to access relevant information held elsewhere.
The problems of dealing with data that have been generated by different instruments and are therefore in different formats are well known. However, Yike Guo, CEO of the London-based software house InforSense, believes that there are deeper issues – that past approaches to data management, particularly in the life sciences, have been too inflexible.
According to professor Guo, corporate IT departments have become a bottleneck constricting the adoption of scientific computing in the pharmaceutical industry. The task facing the developers of scientific software is to put the scientist in control of the computing. InforSense’s name for the way to achieve this is ‘Integrative Analytics’. The goal is software that allows users to access data from different sources, use different programs to analyse that data, often involving distributing computing, while all the time maintaining an ‘integrated’ overview of what is going on.
Matt Hahn and his colleague David Rogers had the same insight, prompting them to leave their jobs with Molecular Simulations Inc (MSI) and set up SciTegic in 1999. According to Dr Hahn: ‘We came to the realisation that the issues people were struggling with were more mundane than the problems for which MSI was offering solutions – data was all over the place and people were not able to tie the data together.’
They felt that a new type of software, unencumbered by the compromises inherent in legacy systems, was needed. SciTegic’s answer was ‘data pipelining’, a software tool for integrating, manipulating, and analysing huge quantities of data in real time.
Data pipelining complements modern relational databases and is not in itself a data management tool. But it offers flexibility – so greatly desired by those working in drug discovery informatics – because by processing data in real-time, it is not constrained by what has been pre-calculated and stored in a database.
The key is that by guiding the flow of data through a network of modular computational components, very fine control over analysis is possible. The components are configured to act on the data in different ways, so that when they are linked together into ‘pipelines’, they form a protocol integrating the performance of many computational steps.
The problem besetting drug discovery is two-fold: the ever-larger volumes of data that are being produced; and the fact that the data is in very different formats. Professor Guo cites the example of some data management systems for biological data, developed a decade or so ago, that are now difficult to maintain because of problems in scaling. ‘When people are doing high-throughput screening, there is a lot of diversity that can be embedded into the database, but it may have been designed in such a way as to be difficult to change.’ Data may have been warehoused on the basis of fixed questions ‘but the problem is that science is not done that way. Scientists ask new questions.’
The recent development of proteomics exemplifies the point, he believes. How can proteomic data be integrated into, say, clinical informatics, if the clinical data has been warehoused according to a fixed schema? The alternative approach, and the one taken by InforSense, is to define data management as a workflow, as a process, rather than as a fixed schema. By following how the data is used rather than fitting it into a predetermined structure, workflow technology can be flexible enough to accommodate the new questions that scientists might ask in the future but that are – by definition – not known today.
In 2004, SciTegic was acquired by Accelrys, where Dr Hahn became chief science and technology officer of the parent company. Dr Hahn sees the next step for SciTegic as broadening the applicability of Pipeline Pilot by integrating horizontal technology such as reporting, text analytics, and image analysis. But as well as such enhancements in the technicalities of integrating different computational formats, he wants to extend the scientific reach of the software. SciTegic (and Accelrys) are both well-established in the preclinical stages of drug discovery. He would like to see data from the clinical phases being integrated, through Pipeline Pilot, and feeding back up the line to the discovery laboratory. ‘Big pharmaceutical companies need a platform to provide the feedback. But at present there are just silos of data – in different locations, used by people from different scientific cultures – and today it’s difficult to link them up. The flexibility of Pipeline Pilot can help breakdown these barriers.’
However, he is clear that ‘no one vendor can provide all the tools that the industry needs’. While a few years ago, some companies did see themselves as the ‘Microsoft’ of bioinformatics, the emphasis in life-science computing has now very much turned to integrating different databases and different analytical tools from different vendors. The evidence can be seen from SciTegic’s Independent Software Vendor (ISV) programme. Launched in March 2005, this is intended to allow other companies to integrate their software with Pipeline Pilot – so that the tools from these other companies can be easily used among the ‘modular computational components’ put together in the pipeline. There are now 22 companies in the ISV programme, ranging from IBM to Tripos and Spotfire.
For some software vendors in this sector, their biggest competitors are the IT departments of the large pharmaceutical companies themselves. But the nature of Pipeline Pilot allows such organisations to integrate their own technologies into the software, according to Dr Hahn. ‘It’s a very flexible environment for our customers who can pick and choose the technology they want.’ But software that is very flexible may turn out also to be very complex to use in practice. Dr Hahn is alive to this possibility: ‘I worked with visual programming environments 20 years ago and so I had seen what made them appealing and also what made them complex – so we try to keep things simple.’
Two recent announcements from InforSense in quick succession also highlight the issue of integration and compatibility with other companies’ software. At the end of March, the company announced a new cheminformatics package, ChemSense. This is a vendor-neutral, enterprise-wide platform offering access to, and integration of, both tools and data sources from leading cheminformatics vendors including: ChemAxon; Chemical Computing Group; ChemNavigator; Daylight; DeltaSoft; Elsevier MDL; and Tripos. A fortnight later, it announced a collaboration to integrate its technologies with Matrix Science, which makes the Mascot search tool and data management product for protein identification and characterisation.
But flexibility in respect of data and analysis is not the only advantage of workflow technology, according to Professor Guo. He sees a need for flexibility in terms of the different users. Those ‘scientists’ who are being put in control of computing, as mentioned at the beginning of this article, need to be disaggregated. According to Professor Guo, at one level there are ‘power users’. These are not the IT staff but the data owners, the scientists who know the data, know what it means and who can map the science to the data. There is also an important middle tier of user – the scientific manager or knowledge manager – who is interested in how different types of science can be put together. Ontologies are an important tool to this user. A more numerous class of user is the information consumer, who cares more about the answers than about the provenance of the data and for whom usability is a big issue – ‘they just want to press a button’.
InforSense has thus been spending some time adapting the flexibility of the workflow approach, so as to make the same system look different to different users. The key to this feature is role-based access – depending on your role in the workflow you can access certain data differently. The vice-president of cheminformatics might well have access to all the data in the system while someone working on toxicology would only access a subset of the data, depending on their role in the research programme as a whole. ‘We address how at the enterprise level you can use all the data of knowledge available,’ says Guo.
The system is built on an oracle database, again with enterprise-level applications in mind, and uses a Grid architecture. ‘We always present this as programming the Grid. Workflow is the way to extract uniform services from distributed programming. And who is the programmer? The scientist,’ professor Guo said.
Pipeline Pilot runs on Linux and on Windows. ‘It is a robust and highly scalable environment that can run on large-scale Linux clusters,’ Dr Hahn said. ‘Our customers have rolled it out to hundreds of users across multiple sites.’
Integration of all data from the research lab to the results of clinical trials has been an aspiration since the earliest days of bioinformatics. Some companies attempted it too early, and attempted too much, designing their own bespoke systems intended to exclude competitors but which were not flexible enough to accommodate the diversity of user requirements. Now, there is a corresponding diversity of companies offering different types of software for the management and analysis of all that data and the integration is itself being provided by yet other companies who are providing the tools to make integration a reality at last.