FEATURE

An education in add-ins

Felix Grant tried out StatTools on some new users, who found the whole experience painlessly educational

'Add-ins': those symbiotes (usually third-party) that colonise a generic program and provide specialised functions or services, usually relating to the data it contains. For some time, now, we have been told that there is a market trend away from them. Even some of their publishers and promoters feel obliged to preface the pitch to reviewers with 'I know the trend is away from add-ins, but ...' Yet, despite this, add-ins seem to be alive and well and doing very nicely thank you - and still reproducing, too. The reasons for their enduring popularity is easy to see. The big four office applications have become part of life now, an information-age equivalent of the modern family car, shielding us from the need to mess with oily innards and complicated details. Excel, in particular, is everywhere, is familiar, and provides a very convenient container for data.

 

For most day-to-day purposes, it also provides a very serviceable and easily-learned analytical tool as well. But Excel's statistical routines, in particular, are not the most robust, accurate, or extensively proven and, although adequate in simple applications, are soon found wanting if pushed too far. Cue for an add-in.

 

Add-ins come in several distinct forms. Some of them, like Excel's own Analysis ToolPak (see 'Only as strong as the weakest Link' below), concentrate on making more accessible the power that is already within the host application. Others, like the excellent UniStat (see 'Standing alone' below) are actually separate programs, which run in their own right but take their data from the host and return results to it for display. Between these two extremes come those which, like the NAG Statistical Add-Ins for Excel collection, represent packaged external code called from within the host itself as if they were native Excel functions. There are other analytical applications that offer to perform certain functions at other points on the same spectrum. S-Plus and Axum, for example, offer add-in toolbars that enable them to take Excel data ranges and, from them, generate charts, graphs, and plots, which then appear within Excel. Then there are the special-purpose add-ins, which address particular needs rather than general analysis: XLfit from IDBS, for example, reviewed a while back in version 3 and recently upgraded to 4, specialises in fitting models.

 

StatTools, which I have been trying out recently, comes from Palisade Corporation, and falls somewhere between the pure NAG and Unistat models. It replaces some functions with its own, then adds both a new menu bar option, and a toolbar from which extensions are run.

  • Composite of steps in using StatTools to carry out a chi-squared test on a contingency table (bottom left) and create a true histogram (remainder of image) - neither task being possible in native Excel.

 

Each type, and each product, addresses a different market; in that respect, none is 'better' or 'worse' than another. Assessing the success of any one of them comes down to two issues: how well does it merge into its host environment; and how effectively does it extend the usefulness of the host application?

 

It's very difficult for anyone already familiar and comfortable with a range of analytical tools to look at them with the eyes for which these add-ins are designed, so I didn't try; I asked for help. Four users reasonably comfortable with routine Excel use, but not with analytical work, used my copy of StatTools to tackle live tasks related to their needs: a social worker doing action research; a theatre nurse working towards promotion; an administrator with a report to publish; and a 15-year-old student tackling an exam coursework task.

 

On the first count, protective colouring, StatTools does as good a job as it is possible to imagine. A statistical analysis regime is unavoidably different from a general spreadsheet but, given that unalterable fact, the various menus and dialogs keep very close to the spirit of the host interface. All of my guinea pigs were apprehensive about dealing with new tasks using unfamiliar tools, but none felt that the handling of StatTools contributed to this apprehension. When later asked to repeat the same operations in a conventional (and to my mind friendly) statistics package, by contrast, they unanimously described it as 'hostile', and all were under the impression that they had used previously unknown aspects of Excel itself on the first time round. All of them were given help sheets from Frances Clegg's Simple Statistics (one of the most user-friendly statistics primers around, with help to guide the user through detailed execution of each task), but all felt more confident when using StatTools than when following the step-by-step instructions in the operational schedules. So, in this initial acceptance test, the software passed with flying colours.

  • Composite illustration showing the StatTools menu (top), from which the data manager has been selected in order to create a dataset from the underlying Excel sheet. The user is offered (lower left) a default choice, which can be accepted or, if more sophisticated datasetting is required, bypassed. The dataset manager window (bottom right) shows the created datasets - only one in this case - with tick list of contained variables.

 

A nice touch, worth mentioning and not universal amongst add-ins, is that you can choose if and when StatTools loads. You can, if you wish, set it always to load as a permanent add-in, but you can also use a separate launch icon. This second option means that the power is there when you want it, but doesn't slow the loading or clutter the toolbar when you (or a colleague) simply want to do some basic data management or balance the budget.

 

Dataset and variable management is a particular highlight of StatTools. The starting point is always the definition of a dataset - nothing else works until at least one such definition has been made. StatTools assumes, by default, that the dataset is coextensive with all data in the sheet, but waits for confirmation; you can define any number of datasets, either in blocks with a sheet or across sheets and workbooks, in rows as well as columns. (This seems a good moment to mention that StatTools comes in two flavours, with differences that you should investigate before making a choice. The copy I have been playing with is the 'Pro' version and can, for example, handle 13 million univariate cases, which circumvents Excel's irksome row limitation. The 'Standard' version is limited to 10,000). There is a high degree of intelligent assistance available, variable names and data locations being suggested as you work through from the data blocks found. Thereafter, instead of repeatedly highlighting ranges, you simply pick variables from a tick list each time you run an analysis. As far as import of data in the first place goes, you have access to whatever Excel can do - which means ODBC, formatted text, most spreadsheet interchange formats, and so on.

  • Composite showing StatTools creation of a univariate summary measures table (top left, results at right-hand side) and regression analysis (bottom left to centre).

 

Before moving on to statistical performance, it's worth mentioning human performance. A pitfall opens up whenever a powerful instrument is put into the hands of an inexperienced user; the instrument is, in general, no better than the user. This is not a StatTools issue; it applies across the spectrum, from pure Excel use up to top-of-the-range dedicated analytical software. One day, some sort of AI validity checking will probably become standard - checking data for 'believability' and checking with the user - but not yet. The result is that a user who taps each value individually into a pocket calculator, is more likely to spot anomalies than one who sucks data in from a spreadsheet or database, and feeds them directly to an analytical engine. Amongst my guinea pigs, this problem receded with experience and familiarity with context, but never really disappeared.

 

The 15-year-old, ploughing her way through an exam board spreadsheet containing details of students in a secondary school, was most susceptible to this problem. Decimal point errors resulting in IQs of 10,000, and boys 162 metres tall, went cheerfully into the mill, which dutifully processed them and delivered hilarious answers that defied common sense but looked good in the nicely formatted output. The social worker, by contrast, working with data that he had gathered himself, usually spotted anomalous results; but even then, not when a single error in a large input moved the mean only slightly in favour of his theories, for instance.

 

That, however, as I say, is not the fault of the particular product in use. Putting it aside, the statistical performance of the product is the other issue to be tested. There is no point blending perfectly into the host if nothing is added to that host's native capability.

 

Unlike some other add-ins, StatTools doesn't provide additional functions alongside, or on top of, Excel's own built-in statistics. It replaces them with its own, though (a little confusingly) this is sometimes under the same name and sometimes under another. This applies even to built-in functions such as Rand() and Stdev(), whose replacements are called from the external DLLs. This sounds slow, but since those DLLs are written in C++ rather than using macros (or whatever), there actually isn't any performance degradation in practice. I don't claim to have done industrial-strength tests to see how far StatTools' robustness goes, but I did run against known answers on some old problems known to break Excel's inboard functions, and StatTools showed itself still reliable at that level.

 

The function replacement goes further than just reliability; in some cases usage is modified too. This is in all cases a strength, as far as statistical usefulness is concerned, though it may be a marginal weakness for anyone used to using the native version. Most commonly used statistical procedures are covered within StatTools' repertoire. A spread of statistical procedures is clustered under command headings (three headings for normality testing; four serving time-series and forecasting; another four for quality-control use; five for summary graphs and measures; six for inference) backed up by a clutch of data-handling and management utilities. There is also the architecture and, with the Pro version, the tools as well, to allow addition of custom procedures written in-house or provided by other users. These are in Excel's VBA language, but can call on StatTools own C++ procedures and management tools, through the object-oriented interface bundled with the Pro version. Because functions are called through cell formulae, they are 'live' in the same way as other Excel content - update data within a range and the associated results recalculate to reflect the change, as do graphics. In fact, there is an extension in some cases of Excel's native abilities - StatTools replacements providing live update where Excel alone cannot. And in case you are thinking nervously of time overhead in large projects, don't worry; live updating can be turned on and off at will.

 

StatTools seems to use Excel's own reporting mechanisms. Output is in Excel format graphics (even where Excel doesn't support the plot type - true histograms, for instance) and tables. This doesn't matter from the point of view of accuracy (since the underlying calculations are being run by StatTools routines). It has the big advantage that resulting spreadsheets can be distributed for information dissemination purposes, though not for recalculation, to any Excel user, regardless of whether or not that user has StatTools in place. As long as the spreadsheet is still under StatTools control, updating remains live at the same level of reliability. Once the sheet is moved out into another copy of Excel, the output becomes static in computational terms but can still be manipulated in other ways - rescaled, edited, dragged and dropped elsewhere, linked into a word processor, and so on.

 

Handling of data and outputs is generally well-designed for the environment. Missing values are, in most cases, ignored (if you want to eliminate them, Excel's own data tools are up to the task), but in a few relevant settings (quality-control commands, for instance) are not allowed. Excel's cell comments are used extensively to provide discrete process documentation that is just a click away.

 

From an original position of purist distrust when I first encountered them, some years ago, I have come round to enthusiasm for statistical add-ins - so long as they are kept within their intended usage range. I have seen all the products mentioned here provide valuable service in settings where neither a spreadsheet nor a dedicated package would fit the bill. StatTools offers another avenue of value: its data management regime introduces the spreadsheet user, in a relatively painless manner, to structured thinking about data analytic and management processes. As a consultant, I work often with highly skilled people for whom data analysis is unfamiliar or built on shaky foundations; as an educator, with students for whom incidental data analysis is the terrifying barrier between them and their main goals. For both of these groups, add-ins are often a way forward. I have been deeply impressed, in this respect, by StatTools' apparent ability to act as a 'flight simulator' on which to develop data skills and thought processes in an unthreatening environment, while learning the main subject: preparing the user for future graduation (should it be necessary) to purely analytical software environments. I shall, over the next few months, be taking my copy into tutorials and seminars to see how that idea works out.

 

 

Only as strong as the weakest link

 

Clients to whom I recommend an add-in usually ask me 'why do I have to buy new statistical tools at all?' Excel, they point out, provides an add-in of its own, the Analysis ToolPak, which offers data analysis options not on the basic spreadsheet menus. They're right on the surface, of course; the ToolPak is convenient and useful, within its defined scope, but it doesn't actually offer anything new. Those extra commands that you now find at your disposal, are simply superstructures built from the underlying Excel kit, which you could access for yourself if you felt so inclined. You save time and effort, which might otherwise have been wasted on reinventing the wheel, but what you end up with is still the same wheel with the same inherent weaknesses. To switch metaphors, a chain is only as strong as its weakest link - which, in this case, is the limited nature and scope of Excel's statistical routines.

 

 

Standing alone

 

Unistat, another route to Excel-friendly analysis, is actually a very capable, freestanding, statistical analysis and visualisation program. You can use it on its own, and never go near Excel at all. It has its own engine, its own input and output capabilities and, in particular, its own worksheet and interface. It doesn't need Excel at all. The last two aspects, however, can be subordinated. Unistat can, using a code bridge, be made to hide itself away and relate to the user through a host application. The host can be almost any program at all. Unistat can latch onto anything for which the necessary connections have been written, from a Windows NotePad to a complete factory process-control system. Out of the box, however, it stands immediately ready to put on a Microsoft Office face; it can take its data from the Excel worksheet, process it, and return the numerical or graphical results to Excel, to Word, or to an web browser. The user sees only the host application, behaving as usual, apart from a toolset expansion added to the menu and tool bars. But all the work is being done behind the scenes by Unistat, with the host simply acting as a reassuring front-end for document management.

 

Xlfit, at the other end of the scale, concentrates on doing just one thing very well, very transparently, and exclusively within Excel. It takes ranges as variables and fits relational models to them - from a supplied library of types, or from new ones written (in very Excel-like formulae) by the user.

 

Like StatTools, and other add-ins worth their salt, each of these takes great care to avoid frightening the horses by showing its own face more than is absolutely necessary. Data is taken from existing spreadsheets and output placed in Excel tables and graphics, without troubling the user.

 

 

An extension example - StatTools versus native Excel quartiles

FORMULA Quartile(A1:A100,1)   INTERPRETATION Excel's own function.

FORMULA StatQuartile(A1:A100,1,-1)   INTERPRETATION Calculation based on input data.

FORMULA StatQuartile(A1:A100,1,0)   INTERPRETATION Continuous; equivalent to Excel's percentile function.

FORMULA StatQuartile(A1:A100,1,1)   INTERPRETATION Continuous; asymmetric endpoints interpolation.

FORMULA StatQuartile(A1:A100,1,4)   INTERPRETATION Continuous; symmetric endpoints interpolation.

FORMULA StatQuartile(A1:A100,1,2)   INTERPRETATION Discrete; closest observation.

FORMULA StatQuartile(A1:A100,1,3)   INTERPRETATION Discrete; empirical distribution function.

FORMULA StatQuartile(A1:A100,1,5)   INTERPRETATION Discrete; empirical distribution function with averaging.

Feature

Robert Roe reports on developments in AI that are helping to shape the future of high performance computing technology at the International Supercomputing Conference

Feature

James Reinders is a parallel programming and HPC expert with more than 27 years’ experience working for Intel until his retirement in 2017. In this article Reinders gives his take on the use of roofline estimation as a tool for code optimisation in HPC

Feature

Sophia Ktori concludes her two-part series exploring the use of laboratory informatics software in regulated industries.

Feature

As storage technology adapts to changing HPC workloads, Robert Roe looks at the technologies that could help to enhance performance and accessibility of
storage in HPC

Feature

By using simulation software, road bike manufacturers can deliver higher performance products in less time and at a lower cost than previously achievable, as Keely Portway discovers