DATA ANALYSIS: INVENTION

Analysis is the mother of invention

Analysis is the mother of invention
Composite design process views for caDNAno, including (top left) sequences record in a spreadsheet.

Felix Grant explores the role of statistical analysis in the field of invention

Scientific Computing World: August/September 2009

Invention isn't what it used to be. When I made the decision (at age 11, or thereabouts) to be the freelance scientist that I now am, I had a lot of role models in mind, but prominent among them was the anarchic spirit of absent-minded inventor Professor Branestawm[1]. It was still easy to find real inventors like him, then; I knew one, and was taught by a couple more as an undergraduate. Nowadays, while part of him still lurks in many outwardly staid scientists, they would never acknowledge him. His demise was inevitable, but was finally assured by the arrival of computerised data analysis.

Invention comes in two basic types: ‘let’s try this and see what we get’, and ‘this is what we want, so how are we going to get it?’ Branestawm, surprisingly for a scatterbrained professor archetype, was of the second kind: he started from an idea for an invention, and pursued likely paths to its realisation. Modern equivalents are easy to find, from James Dyson who envisaged a better way of picking up dust and made it happen, to Altair moon lander architect John Connelly. Shifting from physics to pharma, the search for an H1N1 vaccine (proceeding apace as I write this) is in the same category, but there are also many examples of the first type: batteries of agents tried in combination and in various situations until a therapeutic effect is identified. For the most part, of course, invention programmes are mixtures of the two approaches, especially when seeking a nonpatented way to compete with an existing solution. Regardless of approach, invention is (like all post-Branestawm science) highly dependent upon data analysis.

The Dyson example is a good illustrative case. It started from an idea (centrifuge instead of bag as a more efficient dust trapping method) for improvement on an existing technology (the vacuum cleaner), and progressed through several thousand prototype cycles. Those cycles started with the application of strategies that narrowed based on progressively refined data analyses.

Given that those cycles of experiment, analysis and improvement took five years between concept and first patent, invention also tends to be a secretive process. Science thrives on shared discourse, but depends for survival on financial return. Inventors talk in elliptical terms, reluctant to be too specific about their methods, their degree of progress, and their results. In a hospital somewhere in the Maghreb, medical, surgical and engineering staff spend a large chunk of their off-duty hours working on an air centrifuge analogous to Dyson’s, using lash up equipment reminiscent of Heath Robinson’s Branestawm illustrations. Primarily concerned with solving a contamination problem of their own, they also hope that the result will financially underwrite much needed expansion of facilities. With collaboration too commercially risky, and physical trials expensive in work hours, good planning and maximally efficient use of data are vital.

While the ability to call on sophisticated data analytic methods is probably of greatest proportional significance in small and sparsely resourced contexts, it applies strongly throughout the size and funding scale. Existing giants in the vacuum cleaner industry, having spurned Dyson’s patent initially, have since scrambled to catch up with equivalent designs of their own. Where my small hospital team are squeezing maximum mileage out of one notebook computer and a copy of XLstat[2-3], these corporations have thrown whole computing departments at the problem along with every statistical and mathematical software tool known to industry or academia.

The biggest invention-enabling budgets of all, of course, belong to military and space research, both of which (in so far as they can be separated) have their own rotational concerns.

Most projectile weapons impart spin to their payload using barrel rifling, simply for the resulting gain in stability and accuracy. Latest micro grenade munitions developments (though based on work by Kurschner et al[4] more than a decade ago) utilise the AC current generated by the shell’s spin, via a transducer inside, to provide dead reckoning calculation of its current position. Branestawm’s constantly bewildered friend and companion, Colonel Dedshott, would have loved this one and it appears, at first sight, to be worthy of Branestawm illustrator William Heath Robinson. The obvious way to calculate the position of a projectile would be through basic Newtonian mechanics from elapsed time and muzzle velocity. The approach actually used is, instead, to count the revolutions which the spinning shell makes about its axis. An acquaintance, who worked on early prototypes of what is now the XM25 individual air burst munition before moving into more life-affirming employment, claims intensive field trial data analysis showed spin count delivering better accuracy than ballistic prediction.

Putting aside the rationale (and other questions both operational and moral) the quality, quantity and rapidity of data analysis involved in the weapon itself is sobering. Microchips and software in both shell and launcher, using laser range finding data and feedback from the shell’s spin count, are used to provide realtime dead reckoning ‘on the fly’ to centimetre accuracy within the fraction of a second the projectile is in transit. The 40th anniversary of Apollo 11 provides a prompt to consider that we can now place better computing power at the service of a cheap battlefield grenade than was available to the lunar lander and its human cargo.

Spin is often imparted to spacecraft, too – both deliberately for similar reasons to projectile munitions and as a result of events. Data analytic approaches to dealing with this, rather than utilising it, are exemplified by a Boeing patent [5], which outlines methods for determining the vehicle’s attitude and actions dependent upon it from moment to moment using star tracker data. Star tracking is a routine part of astronautic instrumentation these days, and computerised data analysis have come as far from 1969 levels here as in any other field of endeavour.

As plans are made for missions further and further from earth, or into environments more and more hostile to human life, this area of development will be more and more vital. Only robots can go to the stars, or into the gas giants, and as distance increases so will the necessity for sophisticated and dependable machine data analysis to replace human guidance in both predictable and exceptional time critical circumstances. One result of this, as described recently[6] by CalTech’s JPL, is a push to invent fault-tolerant, high-performance computing hardware and software structures locally analogous to the internet and cellular telephony systems, constructed from numerous low power multiple core components forming a miniature computing cloud. Around the world a wide variety of academic, commercial and government invention effort is going into evolving systems that can not only analyse data, but metanalyse their own analytic processes and ‘think the unforeseen’ in support of autonomous decision making.

Framework and operational flowchart for SPRINT parallelisation of data analysis in R.

Meanwhile, back on earth, CalTech is also one of the centres for DNA-focused, objective driven invention streams in the territory of ‘bottom up’ molecular manufacture – not only an intensive consumer of data analysis, but a field whose raw material is information. The use of DNA folding, as mentioned a couple of issues ago[7] in relation to surface coatings, is a growth area with a great deal of potential. The DNA tiles used for those surface coatings are a well established payoff, but a major stumbling block on the road to the wider realisation of such techniques is the scarcity of reliable seed molecules to encapsulate the initial information. Barish and others[8] at CalTech have recently demonstrated the generation of one such programmable seed with up to 32 binding sites, along with three exemplar algorithmic crystal nucleations. By setting initial counters, tiles are concatenated into ribbons, ribbons knitted into layers, and layers assembled into solids, to prespecified dimensions.

Turning demonstrated ability into realised products is one data intensive aspect of such work; while the root controlling factors are easily stated and deceptively few, the combination of issues needing optimisation as theory is manifested in practice becomes extremely complex. Accelrys Pipeline Pilot, as described in the surface coating context, is a well respected solution for handling the data flow involved.

At the other end of the development chain, the prolonged and initial sequence design process, caDNAno is an open source software tool – software tools being inventions too, in their own right, not just data analytic aids to invention. Three dimensional nanotech components are blueprinted by caDNAno for construction from pleated layers on a scaffolding template. From a user’s viewpoint, a three-panel interface controls a four-stage process in which the intended form is first approximated, paths assigned and segmented, and the final form populated. Though there is considerably more to it than that in practice, the download site offers detailed video tutorials. In a related vein is Uniquimer 3D, a Windows program for generating, visualising and testing large, complex nanostructures with deformation checking and internal energy minimisation.

Following the software tools thread in a different direction, SNOW is a web-based facility for analysing the topology of intracellular interaction networks, one of many collaborative systems for detailed study. From gene or protein identifiers, components and articulation points are calculated along with network parameters and their statistical significance. With increasing understanding of the ways in which cell differentiation occurs, invention of new biological entities is set to be an important aspect of life sciences research over the years ahead and thorough mapping of corollary intracellular relations will be a significant part of the ways in which that develops.

Invention of new gene-based health applications is highly dependent on research utilising microarray data analysis from samples that can be huge with data points running up into the billions or beyond. The rise and rise of high-performance computing offers a clear path for rapid expansion in this direction, but workers in the field need a relatively painless way to access its power. The University of Edinburgh’s Parallel Computing Centre (EPCC) and Division of Pathway Medicine have Wellcome Trust funding to develop a prototype framework that provides ways to parallelise existing R code with minimum modification and through novice friendly mechanisms. Dubbed SPRINT (Simple Parallel R INTerface), it shows reduction by approximately 70 per cent in execution times on test data, and on larger data sets (which also benefit from relief of memory constraints) the benefit margin can be expected to grow with size.

Analytic visualisation (in Statistica) of impact on medical contamination levels of a hospital-based invention programme.

In many areas, directed invention at some point boils down to seeking the optimum mixture of ingredients to maximise an identified beneficial effect. Those ingredients may be as disparate as metals in an alloy, capital instruments affecting economic performance, or chemical agents in a pain relief tablet, but all of them rely on multivariate analysis of response to inputs. I choose the metal alloy example deliberately, because A primer on mixture design[9], by Mark Anderson of Stat-Ease, uses it as an illustrative example, which made me think about the nature of a live project on which I am currently consulting.

Anderson’s example invites the reader to consider the invention of gold solder by ancient goldsmiths through adulteration of pure gold with small quantities of copper, to lower the melting point and assist fine filigree work. The interesting point here is that since the melting point of copper is higher than that of gold, there can have been no prior theoretical basis for directed invention. It must presumably have originated in a chance discovery, probably then proceeding by Branestawmesque trial and error towards an optimum mixture. Were the same discovery to be made nowadays it might still be initially serendipitous or it might be the result of deliberate scattershot experimentation, but its progress thereafter would be planned for efficiency and economy.

My own case may not seem related, since it arises from work for a development charity in a conflict zone where agriculture is afflicted by numerous chemical and biological drawbacks. The similarity lies in the completely surprising combination of ingredients that has enabled local farmers to simultaneously eliminate or at least mitigate their problems. It also lies in their unwillingness to ‘waste’ time, energy and material (just as the goldsmiths of antiquity might well have been reluctant to waste valuable gold) on experiments to improve the mixture from something that is already known to work. The invention has been made, presumably by fluke accident or inspired experiment, but refinement is a slow process using (as it happens) DesignExpert 7 to analyse experiments in which proportions are varied before use on small plots of land  not otherwise considered productively useful. Since 17 different bizarre components make up the mixture, the final optimum dressing is going to be quite a slow-burning invention process. I feel quite close to Branestawm and the crackpot inventor stereotype.

Ultimately, however slowly and inefficiently, invention has always from the first flint tool included a post initiation process which, in essence, was data analytic. Basu et al argue[10] that invention is a phenomenon that can only really catch light and progress in a record keeping culture. The mathematisation of that process was another essential, and relatively recent, milestone. But the real leap forward came with the application of computing methods that not only give us speed and efficiency in the invention process but, in many cases, make it possible at all – and can be applied back onto themselves for an ever accelerating improvement cycle.

References

The references cited in this article are available online at: www.scientific-computing.com/features/referencesaug09.php

Sources