Thanks for visiting Scientific Computing World.

You're trying to access an editorial feature that is only available to logged in, registered users of Scientific Computing World. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

Mining for answers from mountains of text.

Share this on social media:

Out of a mass of documentation, Felix Grant needed to identify some avenues for research. Text mining software made the job easier

What, really, is text mining? Interest is high, with the explosive expansion of electronic information in quantities beyond human comprehension. Search engines claim it. Commercial intelligence activities include it. Many procedural activities are now grist to the automated text-handling mill. And labour-intensive phases of a scientific research study (particularly the literature search) are now susceptible to degrees of automation.

The question recently became acute for me in connection with a hypothesised effect in bioengineering. Evidence for and against was anecdotal; we needed to identify, from a mass of diversely structured documents, a restricted set of specific avenues on which to concentrate investigation. First examination revealed three main categories of document: administrative material dominated by email; numerous individual working bibliographies; and web documents.

Truth to tell, 'text mining' is still in that stage where it means whatever the speaker intends. It has to be pinned down by its ends: what is meant when it is used, and what is understood when it is heard or read. Broadly speaking, it is the computerised extraction of useful answers from a mass of textual information - by machine methods, computer-assisted human ones, or a combination of the two. Subsequent analysis may or may not fall within the definition. Explicit definitions on offer most frequently refer to the high-volume, machine-intensive methods; but there are other, quieter, equally important aspects.

As a general rule, when my clients say 'text mining', they mean a philosopher's stone that will extract order from informational chaos, without any need for effort or change on their part. In many cases, a good file indexer will do the trick. More sophisticated, and more interesting, but requiring some mental discipline to get the best results, is a structured information storage and retrieval product. There are also tools that concentrate on mapping graphically the conceptual topography of a set of information.

Those in the field itself will emphasise the opposite reach of the scale. Here the aim is extraction not just of key words but of quantitative data. The gloriously fuzzy complexity of natural language makes this a minefield of conflicting opinions. What is or isn't possible, and may or may not be possible in the future? How far can machine intelligence take on natural language, and how far must natural language move to meet it? Which methods are appropriate or permissible? The questions raised, never mind the different answers, would fill a small library.

My concern, luckily, was more modest: to explore the application of a few ready-made solutions in an unashamedly practical context. After the three source types, the team agreed three phases of work. The first phase would be machine-assisted sorting, categorisation, indexing, and retrieval, in support of human intuitive pattern-seeking. Next came further focusing of attention through visual mapping of content domain relations. Finally, full-blown machine-driven analysis of data from the material thus selected.

After looking at a wide range of possible tools, yet another threesome emerged: a trio of products, complementing each other. At the apex of this triangle was the giant of the show: Statsoft's TextMiner, a superset member of the Statistica family. At one of the base vertices came askSam, a freeform text database manager from askSam systems. Finally, in the third corner, RefViz from ISI ResearchSoft. Each of these also represents one or more broad areas of software development activity. Pursuing the 'mining' metaphor, RefViz as a stand-alone application resembles seam mining, while askSam invites an opencast image; in the larger context, however, both are also exploratory prospecting tools, locating the sites at which heavy-duty mining can commence.

RefViz is a specialised implementation of components from the more general discovery informatics provided by OmniViz. Large clients may find OmniViz itself the way to go, but if your requirement is restricted to strategic handling of your reference bases, a large, general program is a sledgehammer to crack a nut.

Click on the picture to see a larger version in a new window

  • RefViz views onto a single bibliography. Galaxy views show selection (top left and bottom right) of groups containing given terms picked from the title and term lists, selection (top right) of a single record which is also (lower centre) expanded to full text, and entries and groups (centre) found by a boolean search. The matrix view at bottom left shows a detail (outlined in the thumbnail) from a subset of 200 records, illustrating cross relation of terms.

The first step with RefViz, as with all these packages, is import of data. A variety of content-provider formats is supported plus, of course, ISI ResearchSoft's own bibliographic managers (in the case of EndNote 7 and Reference Manager 10's network version, a single, push-button step). Capacity is limited to a ceiling of 32,000 records. You don't work on the raw, live text. In practice, we found that the most useful approach was to make initial screening imports of separate file copies (partitioned, where necessary) as filtering runs. File copies were then subsetted on the basis of RefViz analyses, the subsets being then aggregated into new and more usefully-focused entities.

Once in, relationships within the reference base are presented. RefViz concentrates on two visualisation methods: 'matrix' and 'galaxy'. The matrix, a two dimensional array, gives quick oversight of internal associations between concepts and groups within the reference base. For a reference base of any worthwhile size you have, at any one time, only a small 'keyhole' view onto this matrix; but its position is shown on a navigation thumbnail of the matrix as a whole. You can move around incrementally, in the usual ways, or grab the window with the mouse and move it around the thumbnail - rather like using a microfiche reader. The matrix itself can be set up (and changed on the fly) to show relations between key words and key words, or key words and groups, row-sorted alphabetically or by similarity, and colour coded (in current windows colours) either by relevance or frequency. Depending on your current choice of view, clicking on a matrix element (crosshairs align with the relevant row and column, as you do so) will select the group or theme; clicking on a column header selects all records containing the given term.

The 'galaxy' is a scattergraphic map in which references and their groups are clustered within a visualised concept space. Selections made in the matrix view are maintained here, allowing easy flipping back and forth between strategic overview and tactical detail - between wood and trees, so to speak - of the content. Below either view is a listing of individual reference entries with group, content, and author information; clicking one of these highlights the spots in the galaxy, while double-clicking opens the entry record itself. This multiple perspective viewing makes for a very productive discovery environment.

We used askSam to great effect in organising material, extracting the first and most accessible stocks of meaning, and identifying the places where more intensive work should be concentrated. The emphasis here is not visual or numerical summary (though search counts are provided); your overviews remain primarily in textual form, resembling multiple indices; this is the first level endeavour, mapping out the terrain.

Click on the picture to see a larger version in a new window

  • askSam imports and searches information from email and spreadsheets.

Using askSam for the first time produces simultaneous feelings of familiarity and disorientation. This isn't a criticism, but a symptom of its effectiveness. You are using a cross between several familiar application programs, word processor and database being dominant. Pull more and more existing text documents into your text base, and you feel an increasing resemblance to other environments as well. Here, recognisably, is your email base (an email client is a one-trick database system); here, in a freshly starched shirt, is your spreadsheet data; and they are, if you wish, all in the same database file. There are limitations: PDF files come in with the text intact, but the headings may be scrambled because non-standard methods are used to format them; text in non-Roman fonts may suffer similarly; word-processors other than MS Word or Corel WordPerfect may have to save first in RTF file format; but the vast bulk of material imports like a dream, and there's very little that can't be usefully incorporated with a little thought. Original documents can be attached to the import; this increases file size (and is not relevant to my context here) but allows opening in native form from within the text-base.

Text-bases formed in this way have tools familiar from other database managers. Forms and record structure follow the imported data - intelligently in most cases, with some 'wizard' style help in others. Report creation is immediately familiar to anyone who has carried out the same process in another DBMS program. Other facilities more resemble a word processor though; in particular, you can quickly and easily format text in the same intuitive ways, from a familiar set of tool buttons. Hyperlinks are used for internal structuring and control.

Once the data is in, search in various ways. Words and/or phrases can be search targets globally or within restricted fields, simply or in Boolean combination, and by proximity. Dates and numeric values have specialised search mechanisms with ranges, comparison types, Boolean constructs. The search can be either for content detail or (very useful for initial overflights of the territory) for a frequency of occurrence summary. A tick box, 'fuzzy', in the search dialogues, optionally relaxes or tightens the literality of the search.

While different formats (word processor, spreadsheet, email, and so on) can all be contained within the same text-base file, the novelty of this soon wore off. Unlike most conventional database managers, askSam will apply a single search to multiple database files, so there is no need to aggregate stores unless the purpose to hand requires it. More effective is to keep material in logically discrete files. To search for [['genetic cross' AND 'lateral transfer'] NOT 'alfalfa'] across an email base, a set of web sites, a discussion group and a collection of spreadsheets becomes a trivial task no more taxing than using 'Find' within MS Word. The results of such a search come up as a set (one per text-base) of dual pane windows - listing in the lower pane, preview of record content in the upper.

In a real world where even the power of a computer cluster must be focused for maximum effect, askSam worked wonders in establishing exactly where to stake the claim, before heavier mining operations commenced. Bear in mind that askSam (like many packages) comes in two variants. Some of the bells and whistles that made our 'professional' model so powerfully effective are missing in the standard version; but they may be unnecessary for the application you have in mind. With that caveat in mind, check carefully which one is for you. You can download a free evaluation version to assist in making the decision.

And so to the dusty, sweaty, heavy-duty business of actually extracting high-grade ore from the identified locations. Selection of one particular tool or another is a matter of weighing different product strengths, but also of individual perception. As I've mentioned, the battle is still raging about exactly what software can and cannot do; and your choice will depend on your own assessment. The gamut of product approaches on offer ranges from claims bordering on machine-sentience to profound scepticism about anything beyond mechanised bead-counting - and, of course, many nuances in between. If you gravitate toward one particular point on this continuum, you will tend towards a product that offers a similar outlook. Personally, I see no point in fighting over this; each method has its different niche in the research ecology. Proprietary black boxes shift an awful lot of material to give a useful, rough-hewn result in a short time; application of general tools shapes a finer understanding; you decide what you are after, do your homework, then 'pays yer money and takes yer choice'. The real tension is between speed and volume on one hand, fine-grained confidence on the other.

In this, as in many other areas, Statsoft inspires confidence by not overselling what its products can do; and that confidence carries over into the implementation. Statistica Text Miner (in full, Text Mining and Document Retrieval, but hereafter Text Miner) is conservative, though not falsely modest, in its claims, all of which are based on well-known and verifiable methods.

Click on the picture to see a larger version in a new window

  • Statistica Text Miner arrays of textual statistics, also displays of singular values and a file subset classification tree.

However it may be dressed up or presented, the aim here is to numericise unstructured textual source material. (There are facilities in place for other activities, including search-engine style index queries - we used these for side tasks, and they are very efficient, but they are not Statistica's raison d'tre.) First step in this numericisation is an exercise that most of us remember from some stage of our pre-degree or undergraduate statistics education: enumerating significant word-occurrence frequencies within documents and storing them in a matrix. Apart from the basic matrix of crude counts, you are also offered various transformed arrays including weights, binary and log frequencies, inverses and singular values. Defining 'significant' words involves a stop list of trivial ones, and stemming or other syntactic reduction to aggregate counts across all forms of the same basic meaning (polymerise, polymerisation, polymerization, polymerised, and so on). Statsoft offers language-specific facilities for such indexing management issues, although our own investigation was restricted to variations on English. You can also restrict indexing in various ways, systematic or ad hoc, based on inclusion or exclusion of terms by structure, size, relative frequency, semantic weight, or role, and so on. These tailored rules are an investment of expensive, skilled time and are very dependent on linguistic register, so we welcomed the discovery that they can be saved for subsequent reuse.

The software has to be able to visit the documents in question, and Text Miner can take directories full of specified source files, locally or remotely located, crawling along links as required, chewing up the contents and spitting out a frequency matrix with commendable speed. Source formats include HTML and XML, MS Word, PostScript, PDF, and RTF, as well as pure-text files. The text can also be in a Statistica input sheet variable. I can imagine contexts, particularly where the same text will be repeatedly revisited, where this would be extremely useful, although they didn't include this particular project. Despite the sprightliness of the software, the resulting index database represents a significant investment and is (like the rule set) stored, with facilities available for its subsequent maintenance and updating. This database can also be refined and used as a 'gate-keeper' reference base against which subsequent incoming documents can be measured on chosen content criteria.

In a dedicated proprietary package, this stage would often segue unnoticed into the analytic phase. Statistica, as always, prefers an open-hand philosophy of providing the tools and the good advice, then leaving the usage to the user. For Statsoft, and for the Statistica user, this is just one more module in the armoury with analysis being carried out by the others already available. Great emphasis is placed on selection of known and tested techniques, frequent references being made to the literature. At the root of all this is an assumption: that concepts lie in the perception of the reader, not in the text. Proprietary algorithms generally make the opposite assumption, which inevitably tends to pass definition of those concepts to the designer of the package. Given words, phrases or other clusters, selection of tools places explicit responsibility for the nature of the quest on the one who seeks.

Click on the picture to see a larger version in a new window

  • Statistica visualises selected detail aspects of data from the Text Miner term arrays.

The tools, therefore, are those already available to Statistica: from basic descriptives, through classification and pattern, to neural networks and the Data Miner environment. Clustering and factoring approaches are likely to feature prominently, as they did with us - verbal and textual discourse follows distinct chreodic pathways between defined loci, and mapping these is the nearest we can (as yet) get to a modelling of their conceptual content. Identification and optimisation of appropriate linear models through SVD yields important pointers to those pathways. At the other end of the scale, a surprising amount of gold dust can be extracted using basic exploratory methods, including graphical visualisation - in fact, I'd say the meaning and significance of more sophisticated analyses can only properly be teased out using an internalised understanding of the big picture, developed through more straightforward hands-on means.

We had several thousand source documents, ranging in length from 2,500-500,000 words. The number of documents was an advantage; the size of the longer documents less so. The usual laws of statistical work apply here: the more cases you have, and the fewer variables, the tighter will be your grip. In fact, though, despite my initial suspicions, which diverted us into extensive, early control testing, Text Miner copes well with this. Even a trial analysis of three overlong theses, only slightly overlapping in content (not a scenario on which I'd like to pin much hope in real use), yielded a useful set of corroborative nuggets. Correspondence analysis and SVD reduce to a statistically manageable level the complexity of the semantic volume occupied by the input texts. In this way, if all goes well, aspects of the latent structure within that volume can be teased out.

So, returning to the question at the beginning: what is text mining? At the risk of irritating those who are intimately involved in its development, the answer from the coalface is: whatever enables the extraction of informational product from textual ore. We were advised from all sides that, having a big hitter like Text Miner available, we should harness it to all aspects of our search. So we could have done - Text Miner contains all the means, and in some circumstances it would be a sensible choice. However, I am a strong believer in fitting the tools to the job; askSam and RefViz provided ways to call upon fifty or so experts, who thus provided Text Miner with greatly enhanced quality of input material. Asked to undertake a similar task in the same circumstances, with the same luxury to choose, I would make exactly the same decision.