REVIEW: S-PLUS 6.1 AND INSIGHTFUL MINER
Northern odyssey reveals the very best of both worlds
Felix Grant took data-mining software, S-Plus 6.1 and Insightful Miner, into the frozen wilderness and found that he could not choose between them
This review started as a very routine exercise, and turned into something of an adventure in several ways. This time, just for a change, I have not been attempting single handedly to save the world using two matchsticks and a washing up liquid bottle rescued from the Blue Peter studio. I have, instead, been tied up in recent months with industrial contracts that actually pay the bills ... or, as Scientific Computing World's editor mischievously put it, 'selling out to corporate capitalism'. So, with S-Plus 6.1 and Insightful Miner 3 in hand, I prepared for work on a waste elimination and quality improvement contract in a large bakery. This sort of operation provides a very good platform for reviewing analytic and data mining software. I also find it personally satisfying to hunt down and eliminate unsuspected factors within a process; but that, my stepson assures me, is because I am 'a sad git'. Even I must concede that this is not the stuff of scintillating party anecdotes or gripping journalism. So, when Rudi contacted me about another and very different industrial project, I jumped at it.
When I first met Rudi, there was still an Iron Curtain. I was on a climbing holiday, and he was working out his national military service as a border guard on one of the curtain's more remote, porous, and picturesque extremities. A lot of water has flowed beneath the bridge since then. Rudi qualified as an engineer, and built a career. Then the Iron Curtain rusted away, and the world changed. Now Rudi finds himself in another remote location: a town in the freezing wilderness, around a manufacturing plant deliberately constructed far from threat of invasion. In its time, this plant has produced most things that can be fabricated in metal - from spring washers to tanks, cameras to cars, safety pins to components for the space programme. Around it are subsidiary facilities, for the ancillary processes serving such manufacture: optical grinding, surface coatings, anodisation, packaging, receipt, and despatch. Now left behind by the ebb of the politico-economic system which it once served, the town finds itself rich in capital and skills, but without markets in which to exploit them and with no modern high technology.
Also standing near the main plant is a records complex, containing an immense store of incredibly detailed records - mostly paper, though some on magnetic tape and disk - of costs and benefits, inputs and outputs, successes and failures, right down to itemisation of quality control measurements by individual lathe operators. And this is where I come in. Rudi is one of a group planning to resurrect the manufacturing centre, save his town from decline, and give the community a future. A motley collection of computing equipment has been begged, borrowed or otherwise obtained, and volunteers have made a good start on converting the millions of paper records into accessible digital form. The renaissance is to be achieved by careful specialisation on the basis of marginal advantage, and this specialisation will be driven by mining operations in all that old data. Not much hope for me of getting rich on this one, but it did attract by dangling an intriguing challenge. And here to hand was ideal software not yet committed to the bakery. Had I known what was in store, I might have hesitated; but I plunged in.
S-Plus is well known to anyone involved in any way with scientific computing, whether they have used it or not. It is one of the high-end data-analysis products built around a purpose-written language, a large subset of which can be accessed from a graphical user interface (GUI), working well in both exploratory and batch modes. Its application in science is widespread, and there is an extensive base of ready-made user-generated programs, routines, and applets, for most purposes. There is also a strong constituency out there who will argue that S-Plus is the best tool around for building scientific data-mining operations.
Insightful Miner ('Miner' or 'IM', from now on) is a newer arrival. Although this release is marked at 3, it has been preceded by whole-number version increments rather than S-Plus's fractional ones. It represents Insightful Corporation's entry into the market for dedicated data-mining products, as opposed to the application of existing tools to data mining. Its philosophy falls somewhere between the existing products in this market. It follows the now familiar drag-and-drop visual programming approach employed by both its competitors - the user clicks on icons representing process components to insert them into a workspace, to pull them about into a useful configuration, and to stretch links between them to assemble a functional whole. It allows highly flexible exploration of data in a very approachable way, with packaged export of code to either S-Plus or ANSI C. One small, but productive, point where it scores over every alternative I have tried, is the ease with which the assembly space can be tidied up after one stage before moving on to the next - a great time and energy saver.
Miner uses an expression language similar to S-Plus, but distinct from it. Where S-Plus 6.1 is also available, however, there is an S-Plus library, which extends certain native IM functions to their full S-Plus versions. There are other synergistic links too - one particularly useful one being the presence of S-Plus script and graph nodes, of which more later.
I installed both applications on one machine, complete with S-Plus library, for an experimental crack at Rudi's data. At this stage an oversight, on my part, led to problems later. I assumed the same system requirements for Miner as for S-Plus, but it ain't so - beware! Miner is available under Windows or Solaris at the time of writing, but it was the Windows version that concerned me.
Although S-Plus will run under any version of Windows from 98 (or NT4.0) upwards, IM3 insists on XP (or NT4.0SP6) or above (strictly speaking, the quoted IM3 spec calls for XP Professional, but it went onto an XP Home box without complaining and then ran just fine). I didn't notice this; I checked S-Plus, and made the unjustified assumption that IM would match. Only when my pampered, temperate-zone machine died in the sub-Arctic cold, and we tried to continue on another one which had been suitably winterised, did my error come to light: no other copy of XP was available, and Miner refused to install onto a Windows ME replacement. Efforts to persuade Microsoft that the same copy should be reactivated on the new machine fell on deaf ears. Luckily we had, by that time, completed enough work for me to have thoroughly reviewed Miner itself; we had also recorded in other formats the insights thus far gained; but any further work had to be in S-plus alone.
As a general philosophical point, I wonder why this situation should arise. One of IM's leading market peers has a similar discontinuity, running only under NT where its stablemates will happily accept less strict versions. Another, however, runs cheerfully and powerfully on anything so there is clearly no reason why dedicated data-mining products have to be restricted in backward or sideways compatibility.
Rudi's volunteers and their assortment of tools had left me with a wide range of data formats to deal with, so IM's omnivorous acceptance of data input (despite the fussiness about OS version) was very welcome. If an existing data set is in, or can be imported into, almost any mainstream spreadsheet, database or analytical package, or any other program for which ODBC drivers are installed, then it is accessible to Miner. Furthermore, the import filters have both intelligent defaults and extensive tuning controls, and allow easy navigation to specific worksheets or tables within the source. Native drivers handle the more likely sources, which speeds things up considerably. The only data which had to be manually prepared for import was that stored on early magnetic media in long forgotten formats; the rest came in without a hitch, regardless of how it had been transcribed or stored.
Dedicated programming nodes handle reading or writing of data in S-Plus chapters or transport files. Both packages also allow data to be written out again in widely useful formats. S-Plus's more traditional, and more easily automated, methods were also being applied, and it too is accepting of different source formats. S-Plus has also recently started a move towards fuller localisation - or, in its own description, globalisation through locales. Essentially, this is provision of switchable interface information sets to suit local conventions, as is familiar in operating systems these days (such as Windows' 'regional settings', or PalmOS' 'format presets'). This localisation hasn't extended to the non-Latin alphabets and local conventions which would have suited my hosts, but does, crucially, allows their use of the more usual comma as decimal separator in place of a US/UK dot. North-western European requirements are already catered for, apart from interface messages, which are still all in English, the underpinning structure is there, and the range will extend with time. A factor to be aware of in use is the exclusion of time/date functions from this localisation development, with the result that formatting of these is independent of locale settings.
Not part of the localisation measures, but to be bracketed with them, is improved facility for tailoring S-Plus help files. These are now HTML, can be compiled or converted from the old format, and can also therefore be in any language or script. I have to confess that we haven't done much in this direction, so far, but we did compile a glossary to assist in interpretation of dialogues and suchlike.
Once the data are in, Miner provides a good set of tools for preparatory manipulation, cleaning and evaluation. These include dual input comparison nodes, which did an excellent job of comparing datasets and thereby trapping transcription or other errors. This is not, of course, the intended purpose of such nodes (they would more often be used to compare outputs, such as predicted and actual results) but they did the job very effectively anyway. Datasets can be transposed, a facility familiar to spreadsheet users but less common in database managers or analytic software; a useful trick to have up your sleeve in the right circumstances, but to be used with care and forethought. Since the provenance and amateur transcription of the data led to high incidence of both redundancy and error levels, and there was in addition a natural language barrier between me and the data sources, this was a very effective way to use one problem in dealing with the other.
I said above that the programming interface for Miner uses drag and drop node icons; in functional terms, these come in several classes and one distinction is the link types which they support. In addition to the standard links, which pass Cartesian data sets from one node to another, there is a new type (represented on the icon as a circle), which passes models instead, denoted by dotted lines rather than solid. Prediction, C-generation, and mark-up language export (hypertext or the XML predictive model dialect) all sport these new circles on their right hand (output) side; principal components, regressions (Cox, linear and logistic), K-means, naive Bayes, neural nets and trees (for classification or regression) all have model ports on the left hand (input) side as well.
This is a useful and flexible facility, and I found it fairly easy to work directly with local industrial experts without an interpreter, using the purely visual metaphors that the package provides. There is an instantly usable explorer model, with each page holding a library of components (the S-Plus library, if present, being one of these). Additional libraries can be created and managed by the user and pages internally administered too, while their contents can moved around and reorganised at will.
Miner's worksheet has some interesting features, which complement this intuitiveness. There are user controls on data block memory usage, for example. Components can be swept up together and represented as a single 'collection node' - effectively a black box conceptual package, saving space and simplifying layout to improve visual comprehension. Options specific to the application context (maximum categorical levels, for instance) are storable within the sheet, which makes for manageability; this is supported by the provision of annotation boxes containing pure documentary rich text which increases process readability There is a clutch of options, too, for fiddling about with the presentation of the program - node names and icons can be changed, and other properties altered. While talking about communication, it's worth mentioning the capabilities of IM for producing packaged solutions for external use. One of these is the S-Plus script node mentioned above; along-side it is a C code generation node.
My having experimented with the S-Plus script component proved to be fortunate when the hardware failure put Miner out of reach; we already had a number of useful scripts that could be run as they were in S-Plus on the replacement platform, or modified for generality. There is an 'add a parameter' table to the S-Plus script properties dialogue, allowing passage of arguments to the script, and editing is done through the standard Windows notepad (or a preferred alternative, if specified in global properties) reached from a button. There is validity checking, a radio button to specify where and how names, types, etc, are to be provided, and an option to control showing and/or storing of results enables tuning of speed against monitoring.
There's more, but detail gets tedious - suffice to say that the script node is a well implemented facility that extends the reach of IM models with very little user effort. The same can be said of the C generation component, which produces fully portable routines in standard ANSI C; models are described in file, allowing substitution without recompilation of code - I'm no C programmer, but I found the results easily usable and understandable.
S-Plus, of course, can itself call C routines (and Fortran, for that matter) as well as its own scripts. It can also call, and be called by, Java. A new tool, Connect/Java, is now provided in 6.1 as a replacement for the old winspj library, which remains only for backward compatibility.
Miner's S-Plus library also provides S-Plus graph nodes, providing the sort of facilities familiar to anyone who has, for example, used S-Plus in conjunction with Excel - though with a greater level of control. Depth and flexibility of control is a common feature throughout the package, in fact, from the import of data through order of operations specification, handling of unknown levels in prediction, to tuning of output.
Ultimately, data mining is all about prediction - whether as an end in itself, or as an input to some other process. You won't be data mining in the first place unless you want a prediction to emerge from it. In my case, I wanted first of all to see whether, after training from data-subsets hauled out of the records building, the programmed structures we built could correctly predict successes, failures, and surprises elsewhere in the archive. After that hurdle had been passed, I could perhaps chance my arm on predictions of future viabilities for Rudi.
IM's prediction node is therefore the de facto centrepiece of the whole show. The node has twin input ports - standard port for the data, and model port taking input from the model node, which is to provide the basis on which predictions will be made. Models can be copied to storage inside the prediction node, or left dynamic - the port changes colour to show which of the two states is in play. Results for the first phase, testing predictions against known past outcomes, were impressive. The future, of course, is still to be seen. An interim stage was also added, in which predictions for new small-scale operations were to be generated and tested. This, unfortunately, has to wait until another XP platform can be mounted.
Survival analysis is an important part of this sort of work, too, and for that I shifted to S-Plus, even before IM's hardware host froze to death. There is a very good library in S-Plus, updated in this release, of user supplied functions and a number of other additions as well - including enhanced censorship handling and three new functions for distribution cumulative probabilities, densities, and quantiles. Other contributed library arrivals from prominent workers in their fields include updated versions of Harrell's hmisc and design libraries, the MASS4 (mass, class, spatial, and nnet) libraries¹, and updated mixed effects models. S-Plus also has numerous incremental changes with which I won't take up space here, including tweaks to functions scattered throughout the package and several improvements to the presentation of output.
Which to have, S-Plus or Insightful Miner? Like Winnie the Pooh, if given the choice, I have to say 'both'. The degree of synergy between them is not something to lose from choice. Failing that, Miner will work very effectively with most existing software setups, and is effortless to use - so if you have an existing analytical regime (or don't need one) but want to add data mining without tears, then IM is for you. If, on the other hand, you are a confident analytical software user, are not afraid of building your own solutions, and don't need a point and shoot interface, then go for S-Plus: it offers more generality, and can be made to do anything Miner can do, though with considerably less ease and convenience. Working with Rudi, S-Plus was the tool preferred by those engineers who already knew what they were doing on a command line; but Miner brought in a much larger group of individuals, from a wide range of backgrounds, all with expertise to offer, with no prior experience of such work but able to quickly get to grips with the ideas involved. Between them, the two packages opened up a number of promising possibilities to which development groups have already been assigned; IM did it quickly, S-Plus in depth. When I asked which they would rather have, most of Rudi's people said Miner; those most crucially involved in taking the project forward said S-Plus. Almost all of them, though, said that if given the choice they would join me in solidarity with Pooh.
Reference: ¹ Venables, W N and Ripley, B D. Modern Applied Statistics with S. Fourth Edition. Springer, 2002. ISBN 0-387-95457-0, 2002.
System requirements on Wintel platforms
Insightful Miner 3
*Pssst... don't tell Insightful, but both programs run perfectly well under Windows XP Home Edition as well