Quantifying the ideal

Plato posited a cosmology in which the variability of what see we around us results from imperfect manifestation of ideal forms. So, for example, the large black horse and the small piebald horse you pass in a field are variant images of an ideal pattern of perfect horsehood. It’s not difficult to square this, at least intuitively, with modern ideas of micromutation and genetic blueprints. It is also a view of the universe mirrored in the idea of a theoretical model – the model being an idealised image of the varying reality (whatever reality may be, once you start looking at it closely). If all of this seems whimsical, consider the work of Dor Abrahamson[1] and others on how students learn statistics and probability concepts.

Statisticians have a very close, if ambivalent, relation to this business of models and ideals. Like sculptors working a block of marble in search of the form within, they seek to reveal the model which sits as a shining ideal behind the grubby uncertainties of real data – in fact, statistics could be defined as the quantification of deviance from the ideal. This Platonic vision is clearest in classical frequentist statistics with the attempt to place key population descriptors on one of a predefined list of approved mathematical distributions but it’s just as real, if less obvious, for Bayesian approaches. Even where models are not themselves statistical, do not themselves use statistical methods, and eschew any statistical connection, they are usually derived from origins in probabilistic investigation and their relation to reality is therefore a statistical concern.

IBM, for instance, is interested in building cerebral cortex biomimetics using confabulation theory. UCSD’s Robert Hecht-Nielsen, in a talk to IBM’s Almaden Institute on Cognitive Computing last year, emphasised that confabulation architecture contains ‘no algorithms, rules, Bayesian networks, etc’[2]. Confabulation, however, depends on maximisation of cogency, and cogency is defined by a probability statement of relation between assumed facts, so evaluation of outcomes as mimesis is a statistical exercise in biocomparison (Bayesian and otherwise). I hope to return to this in a future issue.

Less ambitious than modelling the functions of the cerebral cortex itself, though not necessarily less interesting, is statistical study of the braincase which contains it. This is the aim of a research study, still in the unfunded preapproval pilot phase, by a young African academic. Since the work is contentious for sociopolitical and religious reasons, and nascent careers fragile, I won’t identify the researcher or university more closely than that; my primary interest here is, in any case, not the study itself but the informational framework within which it is to be conducted. It asks whether there is an evolutionary ‘trajectory’ for evolutionary development of the cerebral cortex, indirectly inductable from analogous trajectories of three hundred descriptors for the physical forms of the cortex components and the container within which they sit. A range of standard software is in use, Genstat and mathStatica dominating.

The original impetus for the idea arose from physiological elements in the work of Steven J Mithen[3] at Reading, which seeks to provide an evolutionary hypothesis for the origin of music, though direct connection ceases there. The human form has always been intensely measured in all its aspects, and measurement of the skull has always been one of the few hard factual bases for paleoanthropologists. Nor is statistical analysis of such measurements new, with examples back into the 19th century (or earlier, depending on your definition of ‘statistical analysis’), Rao[4] being an early case of modern work 60 years ago, while collaborative work by van Vark and Schaafsma[5] exemplifies current multivariate examination.

One of the most common comparisons drawn between hominin species, and for that matter other primates too, is cranial capacity – 1.1 to 1.7 litres for you and I, much the same for Mithen’s neanderthals, something like 0.9 for Homo ergaster, dropping to 0.6 or thereabouts for habilines, and so on down to Australopithecus afarensis in the region of 0.4 litres. There is no evidence that crude cranial capacity has much correlation with capacity for mentation (though the ratio of brain mass to body mass is an approximate indicator). Evolutionary changes don’t occur arbitrarily, however, so the assumption must be that either changing cranial dimensions or perhaps associated alterations in configuration are associated, in some way, with a survival advantage. The development trajectory study puts aside all consideration of what such advantages might be, and concentrates on seeking the patterns themselves – not just in capacity but in finer and more localised linear or area measurement. Construction of a metamodel from those patterns might enable comparison with environmental data which might, in turn, throw up material for future hypotheses.

This is a prime example of statistical model as ideal. The fuzzy clouds of metric data will feed generation of initial parameters in traditional frequentist style from multiple measurements with associated temporal and geospatial coordinates. Each parameter set comprises a local submodel, and each parameter is then analysed across the range of submodels to derive its tree of diverging 3D trajectories through time and space as fitted curves. Those curves then become components of a larger metamodel, probably visualised as a set of solid confidence volumes. Putative linkages between volumes would then, it is suggested, be explored by Bayesian means. The complete resulting structure would in effect be an ‘ideal’ – a reference template, an ideal statistical idea of a skull, against which actual skull form at a particular time and place could be compared.

The main stumbling block in this scenario is that those ‘fuzzy clouds’ of data to which I just referred are in most cases only small clusters, or even single examples. The fossil record for hominin development is really very small in most cases, and the skull is often not complete so incomplete sets of measurements are available in most specimens. For this reason, exploratory work has to be done on species for which numerous perfect examples are available, to generate a control body of methods and results – homo sapiens sapiens being the obvious choice. Previous studies within the same frame serve as reference points; a 1905 data set[6], for example, recording four skull dimensions (maximal breadth, basibregmatic height, basialveolar length and nasal height) in samples of 30 Egyptian skulls from each of five time points over the first four millennia BCE.

For many in the life sciences Genstat is their statistical first language, with a strong history and literature that adapt well to anthropology. Raw data from on skull metrics, as they become available, are imported from disparate source formats to a Genstat book (GWB) file, which becomes the base archive from which everything else is calculated. From these data Genstat derives a range of descriptors, from basic summary measures to a set of different diversity indices (designed for other purposes entirely, but adapted for internal use with variant type frequencies), as complete as the data permit. Exploratory work, including general linear modelling, is also carried out in Genstat. Early immersive examination through the GUI is followed by development of automated routines, several of which are now in use after eight months of out-of-hours work. The results are exported as CSV or other generic files for easy accessibility by collaborators, and become the basis for subsequent work.

The metadata then shift to mathStatica, which places symbolic manipulation at centre stage with the resources of Mathematica behind it. Symbolic exploration of probability density function derivatives and expansion of infinite series offer ways into understanding of summary parameters behaviour, among other things. Flexible investigation of theoretical structures also enables model-building to work backwards as a check on the work built up from raw data. Symbolic approaches also enable some modest speculative extrapolation beyond data already entered, providing predictions against which the models can be compared as new groups of data are added. It’s here that the longitudinal skeleton of the eventual model volume will be built and progressively refined; the platform for building of the volume itself has yet to be considered.

Collaborators and those interested in either the model or its methods include various workers whose interests extend beyond the human ancestry line to wider hominoid species. A small team concerned with detailed comparison of gorilla populations, for example, is maintaining updated copies of the current metadata files and beginning to develop a parallel metadata set of its own for comparison. Initial attention has been focused on the possibilities of multivariate comparison between particular gorilla micropopulations and specific points on the human ancestry trajectories. Every statistical analysis product has its own particular toolset and, since this US-oriented group uses Statistica as its platform rather than Genstat, there is not a perfect match between all elements of the two databases. The diversity indices in particular, not designed for this sort of use, are built into Genstat and constructed in Statistica. Nevertheless, the fit between the two is extensive enough to be valuable.

Modern gorilla data, like H sapiens data, are obviously more plentiful than those from early humans, and the data for each item more complete. This team has used its US university and commercial links to have experimental design investigations run using S-Matrix’s FusionPro as guidance for efficiency in future sampling strategy, not just for its own work, but as an aid to focus in building up the hominin set. Application of experimental design principles to observational data collection is an area for caution, but improves statistical efficiency if used with care. FusionPro, which combines these principles with those of data mining, is designed with industrial science and engineering in mind, but the statistical bases for its tools are as applicable to any other expensively data intensive area.

Individual contacts, mostly in the US again, but some in Australia and New Zealand, also keep an interested eye on the work and lend occasional expertise. A forensic scientist involved in war crimes and genocide investigation maintains a watching brief. Most of the mathStatica expertise is provided by an Australian supporter, which is appropriate since mathStatica hails from Sydney. Some of the Genstat routines are being adapted by another friend of the project from his existing R stock. While the similarities between human and gorilla skulls are clear, exploratory adaptation of S-Plus versions of the R routines to a much reduced set of whole body metrics for mustelids is intriguing. Parallel checking of subsets from the mathStatica work, or assumptions upon which that work depends, is being done in programs as diverse as Maple and Cytel’s StatXact in locations as widely separated as Adelaide, Christchurch and San Diego – which goes to show how much of a global village science has become, and the breadth of appeal which a potential statistical ‘ideal’ can command.

I have always been delighted by the ability of statistics to provide new views of the world around me. When I discovered the normal, binomial and Poisson distributions, in my teens, I felt as if I had been given an equivalent of Superman’s X-ray vision: a set of new spectacles through which to see order in rampant variety. To see, as here, how an as yet unmaterialised idea can trigger subliminal ripples in the fabric of scientific imagination to touch distant disciplines brings back that visceral excitement. An ‘ideal’ that can embrace both hominins and mustelids needn’t stop at anything. In principle, the hominin skull trajectory could be just a fragment of a vast set of volumes writing back through all living things to the first cell. And even if it never gets off the ground in its own terms it is, like all ideas and all ideals, already important just by having been conceived. In an age of corporate research with colossal budgets, a lot of vital things still happen in the cracks and interconnecting spaces of small science – and it’s statistical computing power, these days, that gives them reach.

Abrahamson, D., Bottom-up stats: Toward an agent-based “unified” probability and statistics, in Small steps for agents… giant steps for students?: Learning with agent-based models. 2006 San Francisco, American Educational Research Association.

Hecht-Nielsen, R., The Mechanism of Thought. 2006 San Diego, California, USA, IBM Almaden Institute on Cognitive Computing.

Mithen, S.J., The singing Neanderthals: the origins of music, language, mind and body. 2005, London, Weidenfeld & Nicolson. 0297643177 (hard).

Rao, C.R., Journal of the Royal Statistical Society., The Utilization of Multiple Measurements in Problems of Biological Classification. 1948. 10(2): 45pp.

van Vark, G.N., Statistica Neerlandica, Some applications of multivariate statistics to physical anthropology. 2005. 59(3): 10pp.

Thomson, A. and Randall-MacIver, R. Ancient Races of the Thebaid. 1905, Oxford, Oxford University Press

Quantifying the ideal

Editor's picks

The convergence of HPC and AI: Innovation in the post-Moore’s Law era

Online Panel Discussion | Optimise your HPC storage strategy

On-demand | AI in Life Sciences: Practical applications in small molecule design

On-demand Webcast: Transform your labs with cutting-edge AI solutions

Centralising analytical data from mass spectrometry in drug discovery and development

AI-driven Laboratories: Navigating Challenges and Embracing the Future

Choosing a flexible digital platform for drug discovery