Human beings have a great and tenacious emotional attachment to the word 'encyclopaedic'. Since long before Pliny the Elder first had a stab at it, we have nursed the dream of completeness: to encompass within a finite space all that is and could ever be. Rationally, we know this to be a chimera; but at a more fundamental level we pursue it anyway. There is often a similar psychology behind investment in one of the 'big beasts' of the data analysis market, rather than a smaller package with closer horizons and a shallower learning curve. It offers the comforting feeling of investment in future completeness. At the same time, however, the publishers of such products know that nothing is ever complete, so they design for the unexpected. When the ready-made tools are not enough, there are the means to build new ones.
In a sense, the mechanics of extension do not vary; all are code additions reached by application hooks of one kind or another. At one end of the range is the ad hoc extension of a language by ability to define new terms; at the other, the prefabricated black box 'add-in'.
Illustrative of the ability to extend a language word by word are the likes of Genstat and S-plus. Though they have learnt to wear GUI clothes to suit the times, they are, if you scratch the surface, languages pure and simple. This is, of course, not news to long-term devotees of such packages; but it comes as quite a shock to many new users who start out by happily driving the GUI and then, one day, open the cover to look inside. Like any high-level programming language, these allow you to define a new word to suit your purpose, using the ones already there. This is a fiddly business, though, and where possible it makes sense to adopt an already existent vocabulary.
Everybody has their own favourites among the extensions of existing languages; I tend to rely on those which have become trusted friends, but it's always interesting to see what other people are finding valuable. Which is how this survey came about: peeking over other people's shoulders to see what they were using. I only have time for a quick whiz across the surface here - methods can almost always be applied much more widely that their marketing target suggests (and frequently were, in the contexts where I found them).
Some were free; others commercial. The free ones tend often to be those which, in a better world, a more dynamic version of myself might conceivably have written. Decisions on elimination of outlier cases is an example: a small routine running a Grubbs test in Statistica Visual Basic, available from Statsoft's website, saves a lot of time and tedium. Alongside this GRUBBS_TEST.SVP, and in the same place (see sources table), there is a whole trove of other straightforward but useful utilities covering a range of requirements, sizes and complexities.
Others, though, are extensive libraries that even the supercharged alter ego of my dreams would never contemplate producing at home as a cottage industry - and the Biometris GenStat Procedure Library, while still a free download, is one such. Available for several generations of GenStat (from release 4.2 through to the current seventh edition) it contains what a colleague in California calls 'fraternisation tweaks' (procedures which in some way enhance user/software interaction) and methodological routines. There is enough in here to make a whole article, but a couple of examples will make the point. Going to the first entry in the alphabetic table of contents, BICORRELATE is an intelligent development of Genstat's Correlate command: where Correlate discards any case with missing variable values, Bicorrelate works through pairwise correlating wherever data exists in the variates in the columns involved. Jumping to the end of the alphabet, the eye alights on WEAVEVECTORS (which, to my imagination at least, sounds delightfully like something from Tolkien or Terry Pratchett... but perhaps I'm just a hopeless romantic). This does exactly what its name suggests: weaves two sets of vectors into a new third set, according to a target vector derived from the leading vector of each source. Moving back up the list, the equally evocative TSQUEEZE reduces table size according to preset decisions on classifying factors - like WEAVEVECTORS, a good example of high-value automation in a labour-intensive process. A greater spread of useful extensions (timing, checking, stacking and unstacking, text manipulation and generalised model-fitting) is covered in the 47 listed functions than I can even touch on here. If you use Genstat, or are contemplating it, and haven't yet explored Biometris, go and take a look.
Most analytic environments these days have spawned both official and unofficial extensions - often user generated and contributed. Insightful's S-Plus is no exception, there being a whole ocean of contributed code out there and a good collection of it on the Insightful user-contributed download area. But rather than look at another example, I will divert into commercially packaged extensions. Insightful offers several of these and, though I have dubbed them commercially packaged, some are still free to download: they are the research libraries, under development for later inclusion with the main product, but available now 'as is', so I classify them as add-in for the time being. Then there are the fully commercial (you pay for them separately) extensions, which are explicitly presented as specialist add-ins by Insightful. In between come the developers' tools. (Similar approaches are apparent from other publishers - StatSoft, for instance, starts from 'Statistica Base' and offers extensions for particular purposes.)
- Composite of output from Insightful's S+FinMetrics add in (top left and top right), S+NuOpt (bottom) and S+SeqTrial (centre background).
- Correlation in GenStat. Despite the missing values in the worksheet (empty cells with asterisks) BICORRELATE takes every row containing a value in the two correlated variables (the largely complete columns 1 and 3).
- Composite of output from Insightful's EnvironentalStats for S-Plus add-in.
There are five research libraries at the time of writing. Most specify compatibility with a particular S-Plus release, and current states of platform support varies too. Bayes supports a number of (surprise!) Bayesian approaches. BEST provides for non-parametric adaptive estimation (PDFs, hazard regressions, and parabolic survival amongst other GLMs) through B-splines. CorrelatedData collects together specific methods for a range of correlation types, including GEE quasi-likelihood approaches. FDA deals with smoothed transformations of data for analysis in functional form, including differential or integral methods; clustering, canonical correlation, GLM and principal components are embraced, amongst others. Finally, Resample brings extensive support for parametric or nonparametric bootstrap (with tilting), jackknife, and so on.
The developers' tools variously add the ability to use web 'graphlets' and develop S related or Java and C++ clients. Since my competence in C++ and Java is on a par with my schoolboy Latin and Greek, I paid closest attention to the graphlets, which in the example I saw were being used to automate user enquiries on a continually developing research base. Graphlets are Java applets that present data within a browser, with full control and allowing user interaction (investigating individual data points within a display, for instance, or zooming in and out), in any graphical form supported by S-Plus. Although the live intranet implementation I played with was only running reasonably straightforward views, its more advanced successor - in development - offered remarkably sophisticated data-mining queries through the ability to interlink graphlets with each other, with analyses, and with hyperlinked pages.
The extensions that are specifically marketed as specialist add-ins, regardless of software stable, take things a step further, and are priced accordingly. In the case of Insightful, there were eight on offer from which I managed to track down users of six, and spent extensive amounts of exploratory time with three. My professional involvement being primarily with spatially affected data, often with environmental implications, I concentrated on 'S+SpatialStats', 'S-Plus for ArcView GIS', and 'EnvironmentalStats for S-Plus'. 'S+ArrayAnalyser' was in use by one of the same hosts who let me play with their SpatialStats installation, so I got to explore it in some depth on their data. 'S+FinMetrics' and 'S+NuOpt' deserved more time that I had available, offer strong examples of techniques much more widely applicable than their targeting suggests, and will certainly have been revisited by the time this article is published. The remaining duo, which I didn't get to see in the time, are 'S+SeqTrial', which brings group sequential methods to clinical trials, using group sequential methods; and 'S+Wavelets', which addresses image, signal and time-series data.
ArrayAnalyzer comes in several different weights, from an individual desktop prototyping or investigation setup for microarray analysis to a heavy-weight, web-based collaborative setup. My knowledge and understanding of genetic work is not sufficient to make knowledgeable pronouncements on its detailed professional use in, for example, identification of differentially expressed genes; but even a journeyman statistician can track and appreciate its efficient implementation and use of generic tools for a specific purpose.
The ArcView add-in was running in conjunction with SpatialStats on one project, a powerful addition but easily described: it harnesses the facilities of ArcView and S-Plus. SpatialStats augments S-Plus with an additional, spatially-oriented modelling, exploration and analysis armoury, so the two are a natural combination. As well as presenting a task-efficient environment of techniques and associations, SpatialStats links out to key web resources, creating a coherent working bubble from spatially correlated data handing. I would have loved to have seen SpatialStats in the same place as EnvironmentalStats, but had to make do with a separate installation on another project. Both illustrate the difference between a library and an add-in, according to Insightful's taxonomy: EnvironmentalStats augments the interface with a 'one stop shop' of pull-down menus accessing graphic and analytic functions, covering major published methods for this area of work. It also places on tap a substantial body of relevant reference data sets, and embodies a help system which threads its hypertext way through background, usage and source referencing. There is a heavy emphasis in the support material on the needs of civic or corporate professionals working directly to legislative or regulatory requirements, but the core capabilities would be invaluable to a much wider user base - and not just those working with environmental data, either, since many of the extensions to S-Plus statistics are generically useful.
Generic utility accounts for my time on FinMetrics and NuOpt. Both are aimed prominently at financial users; but, just as the financial markets are these days head hunting physics graduates, so other fields can learn a great deal from the methods and approaches used in financial contexts. FinMetrics is advertised for ecomometric work, while NuOpt hints at a more catholic market 'including portfolio optimisation, nonlinear and robust statistical modeling, and circuit optimisation', but both offer a degree of wider fetch not exactly replicable elsewhere. The person who gave up his time to walk me through FinMetrics was using it (from choice, not perforce) as a major contributor to his psychosomology PhD research project. In the particular setting where I watched NuOpt at work, its focus was a problem in strong materials.
Looking back over the past couple of months' investigation of add-ins, it's clear that you can, theoretically, do in any of the big analytic products everything that these add-ins offer (import GIS data and literature, build and debug your own tests and comparisons, and so on) but not with the ease, efficiency, certainty, and immediacy that they bring to their respective specialisms. By the same token, you could sit down without a computer and accomplish everything with a pen, paper, and perhaps abacus, but the arguments against doing so would be the same. As I said recently in reviewing free software, there are many ways to measure cost. Some of these add-ins (taking the widest definition of the word) are available at no financial cost whatsoever. Depending on the frequency, depth, extent, and nature of your requirements, the financial cost of a dedicated add-in can very quickly amortise against savings in overhead. I invested a day, and a 150km round trip, in borrowing study time on a copy of EnvironmentalStats; I estimate conservatively that I saved a week's worth of long days by doing so. In my position, that's a good deal; it wouldn't take many such occasions to make the purchase price, and the time-for-needs analysis, an even better one.
STATSOFT (STATISTICA): www.statsoft.com
Statistica Visual Basic downloads (including GRUBBS_TEST.SVP)