Better computational descriptions of science

Scientific computing could soon be conducted in terms nearer to the problems that scientists would like to solve, thanks to the adoption of 'ontology-driven software development'. In philosophy, ontology is the study of the nature of Being and the essence of things. In the early 1990s computer scientists, particularly those in Artificial Intelligence, hijacked the term and gave it a new, but related, meaning. Modern computational ontologies are pragmatic data structures borne out of a need for computers to co-operate in sharing information and in solving problems. With the emergence of the 'semantic web', they will play an increasingly important role in information systems.

Scientific ontologies are being developed and used in disciplines ranging from biology and medicine to geoscience and astronomy. For example, an ontology will facilitate the sharing of astronomical information as part of the Astrophysical Virtual Observatory (AVO) (www.euro-vo.org); scientists at NASA/JPL are developing a semantic framework called SWEET (sweet.jpl.nasa.gov/sweet) for the exchange of earth science information; and several biological ontologies are listed at the Open Biological Ontologies site (obo.sourceforge.net). One of the challenges for scientific ontologies is that disciplines often overlap and can have different views of the overlapping areas. In response to this, the medical ontology GALEN (www.opengalen.org) can transform between different points of view on the same or related concepts - e.g. between 'viral hepatitis' and 'hepatitis virus'.

When building an information system, it is desirable to separate the descriptions of things that exist in the real world from the 'baggage' that is necessary to make the system work. An ontology is a set of descriptions of real-world things - particularly when they refer to classes of things rather than individual items. The ontology is a declarative specification of the representations that will be embedded in the system, but it has the advantage that it can be inspected and refined independently of the system itself. This makes it far easier for computers, or humans for that matter, to share a common understanding of domain terms and reuse the same set of terms in different projects.

As an example, consider a very simple ontology of part of the animal kingdom. In this ontology, the domain concepts are plants and animals, herbivores and carnivores, antelopes and lions. Carnivores and herbivores are both kinds of animal, so we establish 'is-a' relationships such as 'a carnivore is-a animal' and 'a lion is-a carnivore'. The 'is-a' relation is special because it has transitive properties, allowing us to derive the fact that a lion is an animal, even though this is not stated directly. The 'eats' relation is not transitive, though following it does allow us to analyse the food chain.

In practical terms, an ontology is both a controlled vocabulary of things in the real world and captures the relations between them. It is more than an electronic thesaurus, because it captures relationships as well as vocabulary. It is therefore a model of the so-called 'domain of discourse'. The real-world things are more usually called concepts or classes, and may refer to concrete things, such as 'pencil', or more abstract things, like 'project'. As there is generally more than one way of modelling domain concepts and their relationships, we usually speak of an ontology, meaning a particular model of that domain. Information that is expressed in accordance with the concepts and relationships given by an ontology is said to commit to that ontology.

Note that although an ontology is a model, not every model is an ontology. For an ontology, 'yes' would be the answer to all the following questions:

Is it a declarative, explicit representation of a domain? In other words, can the representation be inspected independent of the system that will use it?
Is it consensual? Does it contain the combined knowledge of more than one domain expert?
Can it be used to solve more than one problem in the domain?
Will it be used in multiple applications?
Is it stable (i.e. changes little over time) and long-lived?

The promise of ontologies

The promise of ontologies is that they offer a common language for sharing knowledge in any given domain. Users simply state that within that domain, they commit to a particular ontology, and they can subsequently share knowledge with anybody else that commits to the same ontology. This ability to share runs far deeper than the ability to share information between, say, two organisations that both use the Extensible Mark-up Language, XML. XML alone is not enough, because it lies at the level of syntax, rather than semantics. The problem is that there are many different ways of describing the same information in XML. For example, the statement 'herbivores eat plants' could be written in XML as:
<herbivore eats='plants'/>
or as
<eats>
<animal class='herbivore'/>
<food class='plants'/>
</eats>

An information system that expects the first XML document as input but receives the second would not be able to process it. The structural choices made at this level are important because they affect the volume of code that needs to be written to process the representation. It follows that every XML element carries with it a significant cost implication, so it is often better to think in general terms to provide a more uniform structure and to minimise the number of elements. As important, however, is the need for precision in making statements such as 'herbivores eat plants'. Do all herbivores eat plants, or only some of them? Do herbivores eat all plants, or are some plants poisonous even to herbivores? And do herbivores eat only plants and nothing else? The answers to such questions might be clear to human readers, who already possess the necessary background knowledge, but need to be made explicit to computers.

Ontologies usually contain a taxonomy of concepts, where each concept is annotated with its properties (which may be definitions or assertions). Similar classification schemes have been used by scientists since the 18th century, when Carl Linnaeus used it for his classification of living things, Systema Naturae. There are now many different ways of classifying and annotating real world objects, and there is a confusing array of technologies to support the different approaches. Some examples are the Resource Description Framework, RDF; the Ontology Web Language, OWL; and F-Logic. Each technology for representation has its history and proponents, with each taking a slightly different approach to representation and reasoning.

Ontologies and the semantic web

Tim Berners-Lee described some of the technologies in his vision of the semantic web, in which resources on the internet would provide descriptions of objects that could be processed by machines (as well as humans), rather than 'just' human-readable text. The figure below shows the layers in the architecture of his vision and how they fit together (adapted from www.w3.org).

Starting from the bottom, the encoding layer contains a mapping from numbers to visible character glyphs; the mark-up layer contains text organised into structured elements through the addition of mark-up tags; the ontology layer constrains the meaning of those elements by specifying the relationships among them; and the rules layer provides for the ability to automatically derive property values, prove properties of elements, or assess the trustworthiness of a description.

XML has been with us since the late 1990s, and is now an accepted and widely used technology. Technologies such as RDF and OWL are newer, but are now stable and reasonably mature. In contrast, there is no W3C (World-wide Web consortium) standard technology for the rules layer; nor is there likely to be. This is because inference capabilities differ enormously, and what is appropriate in one situation might be wholly inappropriate in another. It is therefore reasonable to expect to see a number of different technologies emerging to be used as rules engines for the semantic web. Perhaps one of them will be used more than the others, but this will not happen as the result of a W3C recommendation.

Semantic warfare

We can expect to see a similar pattern at the level of ontologies, not with the technology used for representing an ontology, but with the vocabulary and structure of the ontology itself. A number of different ontologies are likely to emerge in a given market sector, but eventually a small number (often one) will dominate. Competition of this kind has been labelled by Peter Murray Rust, one of the originators of the Chemical Mark-up Language CML (www.xml-cml.org), as 'semantic warfare' because the stakes are so high - control of the leading knowledge representation format in a given market sector represents a significant lever of control over the market sector itself. So there is now a keen contest in which many small companies are reinventing themselves as ontology vendors. And not without reason - there are genuine opportunities to map out and lay claim to specialist areas of knowledge in ontology space.

Ontologies in practice

Although there are now some accepted technologies for representing ontologies and sharing their vocabularies, there are surprisingly few accounts of the successful re-use of existing ontologies. The most common accounts of re-use are of concepts defined in a so-called 'upper ontology'; that is, highly abstract concepts that can be applied to a number of different domains. Ontologies that are specific to a particular discipline are still largely being built from scratch. Some barriers to re-use are that:

it is difficult to find existing ontologies (ontology libraries and ontology brokers are still in their infancy);
it takes time to assess whether an existing ontology is suitable; and
it is a risk to commit to an ontology whose stability is not assured.

As ontologies become more established, these barriers will diminish and re-use will become more common. As re-use increases, scientists and software developers will be able to focus more on how to solve the problem at hand than how to represent the relevant domain knowledge. This view is consistent with the history of computing, in which software has become increasingly influenced by the mindset of people, rather than the features of the computer.

We are on the verge of a new era of computing, in which the productivity benefits from ontology-driven software development will be considerable.

Simon White is Managing Director of Catalysoft Ltd (www.catalysoft.com) in Cambridge, UK, which specialises in the use of ontologies to accelerate software development.

Better computational descriptions of science

The promise of ontologies

Ontologies and the semantic web

Semantic warfare

Ontologies in practice

Topics

Read more about:

Editor's picks

Out now: The Laboratory Informatics Guide 2025

Online Panel Discussion | Optimise your HPC storage strategy

On-demand | AI in Life Sciences: Practical applications in small molecule design

On-demand Webcast: Transform your labs with cutting-edge AI solutions

Centralising analytical data from mass spectrometry in drug discovery and development

AI-driven Laboratories: Navigating Challenges and Embracing the Future

Choosing a flexible digital platform for drug discovery