Recognising the future
Nine years ago, I wrote in Scientific Computing World about using weightless neural networks from ITS to speed the sorting of decorated potsherds from an archaeological dig. It seemed pretty impressive at the time; nowadays, such is the march of progress, it seems laughable. Things move fast in information technology these days, even if implementation can’t always keep up with research and development. Within that statement of the obvious, computerised pattern recognition is an explosive growth area, with implications of which it is impossible to keep track. Within my adulthood, useful applications have graduated from science fiction fantasy to the stuff of day-to-day politics; and if events continue at the current rate, extrapolation suggests that I shall see Isaac Asimov’s I Robot  walking (or rolling) the earth before I die.
Sounds fantastic? But all the key building blocks are already there, driven by political and commercial considerations. All of them are contentious: they collectively move George Orwell’s Big Brother from the pages of futurological fantasy into the weave of everyday life. On the other hand, they also represent new tools for science whose eventual use can hardly yet be guessed at, even by science fiction. While I deplore the first, I can’t put genies back into bottles and, that being so, their cooption to constructive ends becomes a matter of acute interest.
Honda, for example, has for a decade been developing humanoid robots leading up to the present (aptly-named) Asimo. Asimo is the size of a human 10-year-old, and its body mass index (mass divided by square of height) is only slightly above that of a healthy human. There are, of course, other robot development programmes (humanoid and otherwise), but they are for the most part military or similar in emphasis; for scientific work, Asimo’s interest is in its civilian approach and also in the direction of its perceptual development – it will, among other things, respond intelligently to human gesture, posture and movement. I don’t seriously suggest that Asimo is yet the manifestation of Asimov’s dream, nor does it really matter whether this is the way forward: what matters is that it demonstrates the feasibility of onboard perceptual intelligence based on sophisticated algorithmic pattern recognition. In limited ways, Asimo responds to the external world as a human of two to five years might; in some other ways, it demonstrates how such responses might be managed in ways not analogous to the human.
It’s no longer inconceivable that useful subsets (or even, one day, all) of adult response and analysis might be built into Asimo’s small frame; but for the most part they don’t have to be. With the spread of dispersed computing, the internet, wireless networked communications, and so on, a robot need only perform onboard perception and cognition where speed of response requires it – roughly the equivalent of instinctive functions in a biological organism. Everything else can be farmed out to external storage and processing power – after all, the delinking of knowledge and skill bases from perishable individual memory is one of the key factors in our own explosive success as a species, and the principle is already alive and well in the internet. Although I have taken Asimov’s self-contained robots as a metaphor it is likely that mobile shared avatars, permanently or temporarily extending the perceptual reach of larger ‘minds’ (I use the word loosely) stored elsewhere, will be the reality. Asimo will need onboard mechanisms to maintain balance, but speech and voice recognition (for example) could perfectly well reside in the building to which it is assigned, at the other end of a local short range encrypted radio link along the lines of Bluetooth, while infrequently used memory storage (for recognition of visitors to the house, for instance) could just as well be accessed from anywhere on the planet.
This differential importance of delay means that our ‘robot’ will not always be well defined in any spatial sense. Some aspects of its existence may be global (or even extend to lunar and satellite orbits), while others are contained within an Asimo-style humanoid shell. Military and hazardous environment applications are likely to focus on localised (but not necessarily humanoid) concentrations of capacity with a minimum of outside communication, but civilian interest in general and scientific concerns in particular will tend towards open data pools and distributed analysis with multiple access points. This lack of a tie between perception and location may in the long run be the crucial difference between human and robot, and the implications are immense. The system will increasingly exist within the tool rather than vice versa, as has always been the case in the past.
Pattern recognition technologies are of central importance to this. Just as a biological entity acts not on the raw data received from eyes, but on the analyses and hypotheses derived from it, the data from telescopes, microscopes, endoscopes, and every other sort of artificial eye, will be mediated through pattern recognition algorithms designed to make it machine usable.
The essence of pattern recognition is a simple one. Take a collection of examples from a known object; identify the features that appear to be constant in different instances of that object; use those features to identify, at a pragmatically acceptable level of statistical error, other instances of the same object. In some cases, the detection of the pattern itself is the object of the exercise (for example, when monitoring modulations of the hydrogen line in the search for extraterrestrial intelligence) but, even then, if achieved, it will inevitably be applied to subsequent identification of other cases. In this case I’m concerned with those technological applications of pattern recognition that seek to mimic the human senses, with a particular emphasis on visual identification.
There are other issues to consider, when defining the topic. For instance, how general must the recognition be before it can be said to approach human standard? Three decades ago, computerised speech recognition in the hands of government agencies with big budgets was limited to scanning specific wire taps for occurrence of a small set of key words spoken by known voices. Ten years ago it was reasonably reliable as a way to control menu-driven software. It can now be used for efficient direct-to-software dictation, is built into many mobile phones in the form of voice dialling, and plentiful cheap processing power allows governments to lock onto conversations not specifically tapped when words or names are spoken by strangers. This is impressive progress, but doesn’t begin to compare with human ability to overhear a conversation fragment amongst many others on a crowded train and thence deduce connections from context in the absence of any preprogrammed triggers. Nothing suggests that bridging the gap will be impossible, but it illustrates the high standard of comparison set by natural selection: at what stage would it be reasonable to start claiming that machines are beginning to mimic human speech recognition? To some extent, that depends on how you make the comparison. One researcher I consulted compared progress of the technology with stages through which a human individual develops – with the current state of play mapping roughly to a child aged about 12 months, though without the multisensory interlinking involved in, for example, tying the arbitrary signifier ‘Mummy’ to a visual image, scent, taste, tactility and voiceprint of the specifically signified individual.
In the human child, there is a long vocabulary acquisition (or training) phase, followed by experimental application and progressively refined cognition development. A child who sits when told, and looks for a ball when the word is spoken, is operating on a command level analogous to mobile phone voice dialling, rather than understanding language as a synthesised system of codes carrying complex meaning. Computerised methods have successfully negotiated the first two steps, and are now engaged on the cognition bit. Hidden Markov models (see panel: Hide and Seek), widely used in commercial systems, seem to offer the best bet for scalable transfer of spoken language into machine code over expanding vocabularies; neural networks enable greater complexity of recognition; neither yet appears likely to offer a human comparable solution alone. The future probably lies with some sort of hybrid system within which neural network learning feeds hidden Markov language modelling; the bases for conceptual leaps from conversational fragment to original insight will have to wait on as yet undeveloped methods, probably drawing on multisensory triangulation or triggering.
Visual recognition has analogous developmental stages, but starts from a simpler base level. Unlike auditory communication, the visual channel has an existing code used for mass data storage and transfer: the printed word. OCR (optical character recognition) offered a relatively easy first-plateau target for algorithm development, promising both wide applicability in response to an existing demand and early financial return on investment. From that basis it was (in retrospect, if not at the coal face) a relatively easy climb to applications such as ANPR (automatic number plate recognition), which locates, reads and stores vehicle identification signifiers from video images. At the time of writing, there are police systems in use which can extract from existing (not very high quality) closed circuit video cameras identifiers for vehicles travelling at up to 160 kilometres per hour, at a rate of roughly one identifier per second. The databases in which these identifiers are stored (with, of course, the time and location of capture) can then be analysed in various ways – British police, for example, have mentioned the value of knowing which vehicles are travelling in association with others of known interest.
Just as the physical location of a pattern-recognising robot may be notional rather than literal, the interpretation of such human-based concepts as ‘visual perception’ needs to be elastic. Fingerprint and grip recognition are illustrations in point: we look at visual renditions of fingerprints and grip patterns, so they are best classified as visual pattern recognition for intuitive purposes, though the machine perception of them may in some cases be closer to olfactory or tactile metaphors. Both of these, like other biometric identifiers, such as iris recognition, are already being applied – even if opinion is divided on their readiness for mission-critical use. Fingerprint recognition is so developed that it has moved out of government security circles into low level applications such as taking the register in schools, childproofing of a washing machine, and controlling under-age entry to pubs or night clubs in the small, rural English town of Yeovil. Recognition of pressure patterns has been built into the butt grips and trigger mechanisms of handguns to render them inoperable by strangers (as have other pattern-recognising biometric mechanisms – see, for example, the New Jersey Institute of Technology’s ‘personalised weapons technology’ project reports ). Iris, retina and DNA recognition are similarly well established.
All of those applications so far, however, are dimensionally constrained in various ways. The position of a hand on a pistol butt, the movement of a car on highway, the position of a finger or iris for scanning, can be predicted to a close degree. They are also broadly two dimensional, and their form largely static. The next move up the visual pattern recognition ladder, to human faces, is a whole new order of problem – not only are faces far more complex, but they deform continually in changes of expression while tilting on two axes and swivelling on the third. Achievement of this step is also, from a scientific computing point of view, an accession to whole new classes of complex machine perception.
Surveillance, with its socio-political implications, is the most publicly discussed application of facial recognition, but not the only one. Identification may be actively sought by a compliant subject as authorisation for a desired purpose, such as access to a place of work or entertainment. This is much more straightforward, since many variables can be eliminated. The simplest and most reliable form of facial recognition relies on this, by replacing the face itself with a superimposed matrix such as a wireframe or an array of laser points.
One such system is marketed commercially by A4Vision, a California-based company servicing the security industry. The user presents a full face view to a camera and projector, a structured light pattern is projected onto the face and the result is recorded by the camera. What is being recognised here is not directly features of the face itself, but the set of distortions that its geography causes in the structure of the light pattern – which is, relatively speaking, a simpler process. The distortions are converted by software triangulation into sets of (x,y,z) coordinates representing three-dimensional positions of points regularly distributed on a two-dimensional image of the face. Since any head movement can be represented as a single transform of the whole data set, this system allows tolerance in subsequent recognition placement and/or orientation of a face once it has been ‘captured’.
I had a surreal conversation, while researching this article, with a gate guard at a facility that utilises a grid deformation system. He described how one researcher within the facility has made a plastic head, onto which he has grafted a careful latex reconstruction of his own face. The researcher carries this head with him and, to the amusement of the security staff, uses it instead of his own when entering his laboratory. After months of continual tinkering, the deception now apparently works half of the time. I feel that this story illustrates something important, but I’m really not sure what!
Recognition of subjects who have not consented to such capture (for example, scanning video imagery of the crowded, jostling entry point to an underground rail station, faces partly and variably obscuring each other, for known or suspected terrorists) is a different kettle of fish. There is some experimental investigation going on into covert capture, using infrared or ultraviolet projections of the grid onto individuals passing a video camera in a constrained space such as a tunnel or an airport check-in gate, but even if successful it would not cover all cases. To become a generic method rather than a context-specific tool, recognition must work in a way analogous to human matching of one analogue image (from any angle) with another at a different time and under different conditions.
It’s worth thinking about how we, humans, recognise each other; it’s quite a complex mix of processes. When we meet a new person, we undertake the most detailed audit of their appearance – starting with the most obvious features and working down the scale into subtlety. If we meet that person regularly, we gradually reduce the amount of data used and reference an ever smaller set of key indicators. The set of a pair of eyes, a way of walking, the aspect ratio of cheekbones against distance from chin to bridge of nose, are all in the catalogue of ways I recognise individual friends and colleagues; this is analogous to the ‘points of comparison’ method used in identification of fingerprints. When I scan the crowds in a street or department store for my partner, it’s a particular combination of hair colour and arrangement with memory of what she is wearing that I am subconsciously seeking to pattern-match.
Over the years, I’ve become aware of patterns in my recognition of students who, arriving and departing en masse on an annual basis, present a particular set of problems not dissimilar from that facing biometric surveillance systems. A subset of individuals must be rapidly assigned reliable identification tags and consistently recognised in a much larger crowd. My software performs well enough on the whole, and is rarely found wanting within restricted contexts like seminar rooms, but becomes less reliable in outside settings and displays instructive artefacts in particular, identifiable cases. A student who changes the cut, arrangement or colour of his or her hair produces a temporary scramble of the system. Students with similar appearances who are frequently absent become confused with one another. And stereotypes begin to appear: every year throws up a young woman with an expressionless face, long straight blonde hair, and a weak chin, who finds herself called ‘Annie’ whatever her real name – because that was the name of her archetype, some years ago. I greet a face on the street, momentarily believing that I know it, only to realise too late that this must be a stranger since the person I think I have recognised is currently doing his thesis in one of Her Majesty’s most secure prisons.
The ‘expression invariant’ model
Those same problems (inconstancy in a selected identifier, insufficient differentiation, false positives) are among the ones most encountered by artificial systems. The context is also similar: unlike symbolic recognition systems like ANPR, whose operational role is to capture new entities (vehicle registration plates) all the time, and can easily cope with an unexpected form, facial recognition systems usually seek only to recognise entities that they have previously ‘learned’. That need not forever be the case. There are human beings who have a talent for storing huge numbers of faces ‘on the fly’, for instance at the door of a casino, and there is already work to suggest that machines will one day do the same thing for a crowd in an open space; but not yet.
In many (though not all) ways, the distinction between visual and auditory recognition is a human one not applicable to machine perceptions for which both are simply pattern analysis problems. It’s not surprising, for instance, to discover that facial recognition, posture interpretation and gesture analysis (the latter two already part of Asimo’s brief) are based not on single techniques, but on a repertoire of which hidden Markov models are again an important component.
One approach that conceptually calls upon the grid deformation approach in looking towards a more generic recognition is described by Michael Bronstein at the Israel Technion. Generalised multidirectional scaling (GMDS) of spherical target definition spaces is used to produce wire frame representations that isolate underlying constants to produce an ‘expression invariant’ model – varying expression being one of the most significant influences on facial topography.
An obvious extrapolation of present surveillance technologies would be the combination of more than one recognition type as observational triangulation for enriched results or selective handling for focused use of resources. Current trends in the development of ANPR systems, for instance, include the capture of driver and front seat passenger facial images as a video sequence. The system marketed by Video Fit of Cheltenham, to take one example, tracks and records the number plate then goes on to 20 subsequent frames of the front seat occupants. At the moment, such facial images are for human reference once the vehicle has been identified as being of interest from its number plate; but their potential suitability for automatic recognition as technology develops is obvious. A car driver, and to a lesser extent front seat passenger, is highly constrained – facing forward, unable to move very much from a static position. Twenty frames recording 20 different time slices of facial movement offer a longitudinal dimension to software recognition methods. The combination of methods would also allow selective recognition – for instance, a known vehicle registration could trigger a check on the driver and passenger.
From a science viewpoint, such combination could be used not only for confirmatory triangulation or multisensory enrichment but, more simply, to seek indicators of a crude type that automatically trigger more sophisticated analysis. For example, a particular set of distinctive animal markings could awaken a system designed to analyse movement and posture patterns in that particular animal. I’ve done some Blue Peter-style home experiments, with some success, using neural network software on using size and direction of travel to identify a particular ant not following the general pattern of its column, then analyse its deviant pattern in terms of known movement components. Nor does the trigger have to be visual or auditory; it could, for example, be a particular chemical signature (such as a pheromone, an industrial pollutant or medical isotope tracker) detected by an ‘artificial nose’. Olfactory (and other chemical) analysis and recognition is less developed than visual, auditory and tactile, but a lot of work is being done – see IBM’s Zurich research Laboratory website, for instance.
I’ve not begun to scratch the surface here; my rough notes, for what started out as a simple article on recognition and gradually metamorphosed, would fill several copies of Scientific Computing World. What is clear, though, is that we are heading for a world that not only contains machines analogous to Asimov’s Robot (and others with similar sensory capacity but non-humanoid structures better suited to their roles), but in its turn exists with one. Having been born in a simpler age when the word ‘cyberculture’ hadn’t been coined, I’m tempted to be nervous. But clocks can’t be turned back, so I’ll just settle for being excited instead.
Hide and Seek
A Markov chain is a stochastic sequence (taking its name from Russian mathematician Andrei Andreyevich Markov, 1856-1922) in which a probabilistic change (or lack of change) from a current system state decides the subsequent state, uninfluenced by any previous states or events within that system.
A hidden Markov model (HMM) describes such a system in which the state and perhaps other parameters are not known, so must be deduced from those parts of the system which are observable. Generally speaking, the observable parts will comprise variables which are assumed to be in some way affected by the system states.
Classically, analysis of HMMs is a tripartite process in which the probability of output sequences, the probability of hidden state combinations which could have produced a given sequence given known or assumed model parameters, and the probability of state transition and output combinations, are computed by well established algorithms. Current practice tends towards iterative computation of marginals through clique trees.
|A4 Vision||Biometric facial firstname.lastname@example.org|
|FRVT||Face Recognition Vendor Test||http://frvt.org|
|Honda Motor Company||Asimo humanoid robot|| http://world.honda.com/ASIMO |
|IBM Zurich||Artificial nose and other nanoscale sensory work||http://www.zurich.ibm.com/st/nanoscience/artificial.html|
|Michael Bronstein||Generalised Multidimensional Scaling||http://www.cs.technion.ac.il/~mbron/research_facerec.html|
|PITO (Police Information Technology Organisation)||ANPR systems in law enforcement practice||http://www.pito.org.uk|
Tel. +44 (0)20 8358 5555
|Bluetooth SIG, Inc||Bluetooth communications|| http://bluetooth.com; |
|Videofit Ltd||ANPR with video face email@example.com|
1. Asimov, I., I, robot. 1st ed. 1950, New York: Gnome Press. 253 p.
2. Liska, M.R. Personalized Weapons Progress Report. New Jersey Institute of http://www.njit.edu/v2/pwt/