The joy of text
It’s a mundane truism, not normally worth mentioning, that words and phrases as signification units in natural language have only the fuzziest of relations to that which they signify. It is, however, a live issue for the many researchers trying to computerise data-analytic activity using text as raw material. It’s also a truism of which I have been reminded afresh as I discussed this issue’s topic with practitioners of textual analysis – no two of whom used the term in exactly the same way.
Strictly speaking, textual analysis describes a social sciences methodology for examining and categorising communication content. In practice, though, it is widely used to cover a range of activities in which unstructured or partially structured textual material is submitted to rigorous analytic treatment. What they all have in common is a desire to wrestle the petabytes of potentially valuable information locked up in an ever-inflating text reservoir (blogs, books, chat rooms, clinical notes, departmental minutes, emails, field journals, lab notebooks, patents, reports, specification sheets, web sites and a million other sources) into a form that is susceptible to useful, objective data analytic treatment.
Temis, of whom more below, has on its website a headline which sums it up neatly: ‘Big data issue #1: a lot of content and no insights’. Text mining, the consequent knowledge bases, and analysis of the results have become a major component of biomedical and pharmaceutical research.
For our purposes here, I have taken it to mean analysis whose purpose is to extract scientific value from texts, to examine those texts scientifically, or some combination of the two.
A case history which meets both of those criteria is the application of SAS Text Analytics to patient records at Lillebælt hospital in Denmark[1,2] with a payoff in dramatically improved error trapping. Quite apart from the value inherent in better validated information, records can be compared here (literally or statistically) on the basis of their content, and statistical data on medical issues can be derived from them to inform practice. As The Guardian’s Jane Dudman comments: ‘All healthcare policy decisions are based on the statistics that each clinic contributes by registering data. If data is wrong, the basis for decision-making is also faulty.’ She might also have added that, if the data is not accessible for analysis, it is missing from those statistics which again, therefore, become flawed. This issue of accessibility, for data buried in unstructured text, is a crucial one – and one which text analytic methods seek to address.
Another Danish example of the first type (extracting scientific data from texts) is the use of SPSS by the not-for-profit information arm of cooperative retail conglomerate, FDB. By using text analytic approaches to mine supermarket data in combination with interview and survey records, they have generated dietary healthcare outcome indicators and provided a public interactive exploratory interface for immense data reserves that would normally be inaccessible though sheer contextual volume.
For tidiness, let’s stay in Denmark for an instance of the second application type: scientific comparison of texts which are not necessarily scientific in themselves. StatSoft’s Statistica text mining tools are being used by researchers at three Danish universities to analyse similarities and differences between different north European mythologies and storytelling traditions. While this work has attracted interest from sociologists, human geographers, ethnologists and others, the primary motivation is scientific classification of what one of the group describes as ‘literary DNA’ – the fundamentals of story as a form. From ancient oral folk tales to modern magical realist novels such as Peter Høeg’s Forestilling om det Tyvende århundrede and the influence of Muslim immigration, threads of connection and degrees of separation are statistically defined in objective ways. This is not dissimilar, in essence, to the more familiar quest to decide authorial attribution of Shakespeare’s plays – a favourite playground of textual statisticians for as long as I can remember.
All of those examples, as it happens, use specialised tools within well-known general data analytic products, but it needn’t be so. There are an increasing number of products which are designed from the beginning to either specialise in, or be weighted towards, text analytics. There are also plenty of people using scientific computing methods to analyse smaller, focused sets of textual material through wholly generic means and a little ingenuity.
Two representatives of the rapidly growing market sector that focuses specifically on this area are Linguamatics and Temis, both of which apply cutting-edge natural language processing methods to the task of acquiring, organising and analysing data locked up in textbases of various kinds. While they have different customer profiles, a significant area of overlap is the life sciences where text mining is, as noted earlier, a major resource.
Although Europe doesn’t yet have a consistent (or, sometimes, any) electronic medical records (EMR) system, some of its jurisdictions have started down this route with some success. In the US, there is widespread adoption, if not universal satisfaction. Such systems are certain to arrive everywhere, sooner or later, and they bring with them wells of data of immense value to both healthcare and research communities. Toldo and Scheer assess the use of Luxid (core product of Temis) to access information within the free text sections of these records, not just to track adverse events, but for other strands such as ‘clinical trial optimisation and pharmacovigilence purposes’. Temis’ list of clients includes a dozen household names in the medical and life sciences area, from agriculture to industry, plus a string of others from the American Association for the Advancement of Science to aerospace technology leader, Thales.
Linguamatics has an equally impressive spread of application. Again the importance of textual data in medicine and the life sciences is reflected here, with the company’s key I2E product well established in genetics and molecular biology among other scientific growth fields. Particularly intriguing are a number of projects to extract and curate useful research data from conference Twitter feeds and other microblogging sources. ‘We’ve got more than a hundred possible leads to new ideas from I2E processing of hashtagged backchat at just one recent international jolly,’ said a researcher at a European pharma company, cheerfully. ‘If only one of those turns into an actual research proposal, it was still a high-profit exercise.’
Both providers offer online options as well as client-side processing and links to other software with, for example, Accelrys Pipeline connectivity for I2E and several powerful specialist expansion modules for Luxid.
‘Adopters of these commercial tools can realise savings because of the scale of their operations, despite significant investment to purchase the tools,’ as Hirschman et al point out. But what of small scale research which is ‘typically funded by grants with limited resources to invest in infrastructure?’ There’s a lot of hand-curated work going on in this area, right down to the level of individuals using desktop tools in innovative ways to shorten the loop between source and result. A popular approach is to use a statistics package and a bibliographic database manager in concert, sometimes with a home-coded utility or two to automate transactions between them.
Whilst schlepping around from one lab doorstep to another, gathering background for this article, I also sought reactions to developments in the latest releases of EndNote and OriginPro, both of which I happened to have on review. This serendipitously led me to discover the young materials science researcher who has built up a series of EndNote databases to which automated search feeds contribute raw material. A search and filter utility written in BASIC (remember BASIC?) extracts specific material using equivalent terms lists, summarising it in CSV files for exploration using OriginPro.
I also met two post-grad researchers in a university technology spinoff programme, creatively using the flexible combination of text acquisition and organisation tools (Archiva, Ibidem, Ibidem Plus, Orbis) available through NotaBene. Once again, equivalent term lists (provided within the NotaBene cluster) were used to aggregate material into summaries, which were exported to spreadsheets for further processing in data analytic applications. By running these processes on the fly, or in otherwise idle time, they extracted, processed and fed into their workflow surprising quantities of statistical data without onerous increases in overhead workload.
All this is all very well, but for text to be analysed it must first be readable – and that usually means readable by a computer. These days most text is electronically originated, but legacy material often is not and a lot of ad hoc notes may not be. Optical Character Recognition (OCR) is the workhorse of text transcription and, while we all grumble about its shortcomings, it does a good job of rendering graphic images of printed fonts into digitised text for analysis.
Even at the lowliest manual level, OCR is a useful tool. A colleague and I recently had to add a 650-page 18th century text to a digitised textbase for analysis. The rare and valuable paper original which we located was in a library and could not be removed, and filing a request for the digitisation to be carried out would take weeks. With the consent of the library we used a smartphone, a 10-year-old copy of Abbyy FineReader 5 (now in release 11, and correspondingly more developed, as part of a software range for different text tasks) and a netbook. Even allowing for manual error correction, we had our validated data within three hours and the library added a copy to its own digital records. A similarly-sized text already available as graphic-only PDF was transferred more quickly still.
Nevertheless, OCR does have its limitations; one of them is in dealing with handwritten material. Where the material to be transcribed is written by known authors, training akin to that used for voice recognition can bring error rates down to vanishingly low levels; one medical research centre with which I work absorbs large volumes of handwritten notes and (often parallel) vocal commentary into its textbase. Handwriting by authors not known to the transcription system, however, is a different kettle of fish. Submitting to the same medical system an A4 page containing 238 words in my own handwriting (which it had not seen before) produced not one correctly transcribed letter.
Acquaintances in intelligence-related occupations tell me of systems which are used to transcribe unknown handwriting with ‘useful’ levels of recognition. These, however, are based on training for key word recognition as a way of identifying matter for manual reading – an example given by a customs officer involved training a system using numerous handwritten examples of words like ‘bhang’, ‘coke’, ‘MDMA’, ‘methamphetamine’, ‘weed’, and so on. Where such words accumulate, increasing levels of priority for closer examination are applied. For such tasks, a class of approaches collectively known as word spotting is more productive than OCR.
The neat thing about word spotting is that it can be used to search directly for text entities within graphic images, without any requirement to first convert those images into text. A researcher need only offer an image of the required word and the system will seek statistically-similar visual segments with a set of JPEG files.
References and Sources
For a full list of references and sources, visit www.scientific-computing.com/features/referencesfeb13.php