# Testing times

Statistical testing need no longer be difficult or time-consuming, according to Felix Grant

Peter Elbow, charismatic professor emeritus of English at the University of Maryland Amherst, uses a metaphor for the writing process that can equally be applied to many an intellectual endeavour – including the work of a statistician. He likens it to a journey of discovery, in two phases. The first requires abandonment of preconception in order to perceive the new: ‘getting lost is the best source of new material’, he urges. The second calls for disciplined, objective interrogation of that new material, ‘a process of convergence, pruning’.

For the statistician, the romantic phase, the voyage out, consists of doing all sorts of cool stuff. Wading into a brand new datascape, compass in one hand and notebook in the other, our intrepid explorer discovers fresh wonders at every step. A rich seam of correlations here, a riotous tangled undergrowth of isopleths there. And is that really an astonishing range of complex curves marching away over the horizon into the misty distances at the far limit of vision?

For the statistician, as in most other areas of work, this glamour stage is by far the smaller part of the whole enterprise. In the second stage, each new discovery must be validated, subjected to unemotional examination. Many of those correlations will be nothing more than shallow chance markings: isopleths will, when untangled, turn out to be unremarkable; and many of those curves, when closely approached, will not be visionary mountains but clouds already drifting away on the wind. Not all, however; for all its broken dreams, this is still a beautiful data-set that will yield enough valuable nuggets to justify the explorer’s efforts and the backers’ funding. This second stage, the separation of valuable from illusory, Elbow’s process of pruning, is where the bulk of the statistician’s day-to-day bread and butter work is done: and this is statistical testing.

But the good news is that, these days, this dichotomous image of practice is becoming out of date. In statistics, at least, scientific computing tools mean that the second stage has collapsed into the first. The explorer now carries the powerful, lightweight statistical testing instruments along on the voyage of discovery, applying them to each new observation in real time, immediately discarding the clearly spurious and carrying away only the promising. Even a notebook computer is not necessary these days; valuable initial field testing of hypotheses is even increasingly being done through a smart phone.

The best guardians of the holy testing grail are statistically sophisticated researchers in their own fields. Most advances in statistics, and most useful statistical tests, have been developed not by pure statisticians but by researchers who needed new tools for their own professional purposes. Agricultural research is the cradle of modern statistics, a fact reflected in the deeply flexible power of Genstat which originated as a spin off from the seminal work of Ronald Fisher at Rothamsted research centre. I’ll be quoting, a little later, a current example of methodology evolved in phylogenetic systematics.

Nevertheless, the limiting factor in actually deploying all the new portable statistical computing power lies where it has always lain: not in the available tools, but in the levels of confidence and perceptiveness which researchers bring to the conduct of statistical tests which they use. Anyone who regularly teaches statistics to those in other disciplines (not, in other words, to statisticians but to those who will use statistics in their own adventures) soon becomes familiar with the range of student facial expressions from joyous wonder, through baffled determination not to be beaten, to revulsion or outright fear and loathing.

Some years ago, a highly talented undergraduate student (who has since achieved great things in her chosen specialism) scrawled across the bottom of a statistical testing assignment ‘Help – I feel like a jellyfish grappling with a cloud!’ At the other end of the scale, mathematicians often, and equally understandably, dislike statistics for its frequent pragmatic use of distribution models which are ‘good enough’ rather than true.

The challenge for publishers of statistical software, and one which they do a sterling job of meeting, is to provide for both constituencies. Those who regard statistics as significantly less enjoyable than a visit to the dentist nevertheless need robust ways to do valuable work. Those who have found a true feel for the nature of the testing process itself need ways to wade in and plough their own furrow, guided by their own professional compass. These two extremes are, in turn, reflected in two poles on the continuum of usage, though there are of course many shades between. How software publishers deal with this varies, often in line with the dominant psychology of their primary user-base.

In engineering production, ‘go, no go’ gauges are a common tool: pieces of metal designed to test whether or not a component does or does not fall within defined dimensional tolerances. An operator turning out valve-seatings with a nominal internal diameter of 50mm and a tolerance of ±1 per cent, for example, will have a gauge of which one end will pass through a 49.5mm or larger hole (‘go’) while the other will not pass through one smaller than 50.5mm (‘no go’). If either test fails, production is halted and the machine recalibrated. No consideration is given to how close the component is to failing, or passing the test; only whether it passes or not. There is, of course, no fundamental law of the universe which says that diameter of 49.49mm or 50.51mm suddenly and catastrophically becomes unusable; the tolerances have simply been selected by somebody, somewhere, as the best real world compromise between competing risk and penalty gradients.

This ‘go, no go’ gauge is an exact analogue of the role which statistical tests usually play in hypothesis evaluation: the hypothesis being the part to be tested; tolerance limits representing the significance level selected for the particular test and purpose.

The cautious user corresponds to the shop-floor lathe operator, using standards set by others, using only a few tests (possibly only one) which will usually be classical frequentist (see box: Beyond the green Bayes door). More often than not, the test will be defined by the user’s own background or the norms of her/his team or wider professional environment: ‘nobody here ever got sacked for doing a t-test’, one company laboratory employee told me, grimly, when I suggested that an alternative might be more appropriate. I know of many researchers who have taped to the wall above their desks a test selection flow chart published by SPSS, even though they actually work in a rival package. With the test decided, the level of significance is also fixed: usually at 5 per cent, simply because that has become the widely and consensually accepted default norm.

The more proactive statistical hypothesis tester, by contrast, is analogous to the ‘somebody, somewhere’ who set the tolerances for that valve-seat in the first place. Assessing the interplay of parameters and implications, a decision on the best methods and assumptions is evolved to meeting the requirements of the task at hand.

There are, of course, infinite shades of user between, using standard approaches for much of their time but willing to stretch boundaries to a greater or lesser extent on occasion.

Broadly speaking, statistical software publishers address this range of user adventurousness in the same way. They provide a range of tests which usually expands, upgrade by upgrade, over time. For each test they pre-set defaults which reflect majority practice, but provide controls (by a variety of means) for temporarily or permanently altering settings. Deciding the emphasis placed on the two sides of that balance, and the prominence given to one side or the other, is not an easy task and typically reflects the publisher’s assessment of the product’s dominant market profile.

Unistat is a perfect example of a product aimed at providing the statistical explorer with a balanced, easy to use, broad-spectrum toolkit, encumbered by as few preconceptions as possible about how it will be used. Its latest release at the time of writing offers just over 40 headings spread under eight broad categories – one category and one heading being carefully selected new additions since the previous release. It can run alone as a dedicated, freestanding data analysis application for the greatest user autonomy, hide itself within Excel for maximum convenience, or be tailored on demand to colonise any other data-handling environment. Results of tests are reported as probabilities, not as decisions, leaving the user to decide the criteria against which they should be judged. Running a paired t test on a chemical composition database here, for example, first reports one and two tail probabilities in the region of 3.9x10-5 and 7.8x10-5 respectively, then follows up with the commonly used lower and upper 95 per cent values as additional information.

In practice, this approach is remarkably successful across a wide range of contexts. Over the years, I’ve trialled Unistat with an increasing range of client groups. Scientifically literate, but statistically and mathematically inexperienced, lay volunteers on field projects took to it like a duck to water. Statistically sophisticated professionals, used to heavyweight command-line languages, quickly learned to appreciate its quick and incisive yield of essential information on specific tasks with minimal effort. Mathphobic humanities undergraduates, who would have preferred a graphics package, nevertheless came for the first time to understand through its clear presentation of results how statistical decisions are made. A charity director, who had steadfastly refused to reconsider lifelong practices that were ruining the effectiveness of his organisation, underwent a Damascene conversion to a new approach after being stepped through its clear comparative analysis of outcomes.

There are, nevertheless, other market segments. Many other Excel symbionts offer a much more restricted selection of tests, catering only to the half dozen or so which are most commonly remembered from undergraduate stats survival courses within other study programmes. They do so, generally, not out of laziness or economy, but because they know that a greater choice would intimidate their target audience. Similar reasoning leads them not to offer any option for operating outside Excel.

I recently spent a week working with a disaster relief paramedic who was comfortable with exactly such an add-in package. He could, in theory, have greatly extended his reach with a product which gave him space to spread his analytic wings but, realistically, he would probably not have done so: in fact, he would probably have achieved less. He is doing valuable, life-saving work with his basic tool set, undistracted by choices he would not have felt competent to make after I left.

Then there are technical graphics packages, primarily for visually presenting data rather than analysing it, which offer a couple of most used tests. The product more widely used in scientific research than any other, in fact, falls into this category. I’m currently coaching a philologist in how to make the most of just two tests included in his chosen working environment of this type, and he is opening up genuinely new areas of insight as a result.

In both those cases, the publisher has judged its users’ needs well. The tests are quicker, easier, more intuitive, less error-prone and of higher analytic quality than would be the case in a generic spreadsheet, without causing unnecessary learning-curve stress.

Moving in the other direction, most of the industry big names offer far more choice, and far more control, than any individual user is ever likely to utilise. Their target audience is assumed to have at least the knowledge and confidence for plucking out of the cornucopia just those specific routines (whether they be well known or obscure) most appropriate for their particular task. They are also assumed to be capable of judgements on either making, or refraining from decisions, about the fine-tuning of every aspect of the tests selected for use. At the same time and place as my paramedic with his enhanced spreadsheet, an epidemiologist, a nurse, and an administrator were using completely different aspects of Statistica (and completely different levels of control) to jointly work out a way to extend the effective coverage of medicines in short supply.

Statistica, of course, like many of these heavyweights, is actually an analytic language with a graphical user interface overlaid on top of it. Scratch the surface of these, and you are down to command-level control. Nor is the choice restricted to dedicated statistics languages. A lot of heavy duty statistical hypothesis testing is done in Matlab. Highly elegant solutions are developed in MathStatica, the specialist third party mathematical statistics extension for Mathematica. The language chosen is not irrelevant to the work that can be done, but nor are tests ultimately dependent upon it. My favourite recent example of extension to statistical testing literature is a paper[1] on a refinement in topological hypothesis testing of phylogenetic trees – a discussion that requires no particular software or language approach, but can be implemented in any which permits the necessary subtlety of control.

The most widespread standard choice for such development work is R, and the proprietary language product most prominently associated with practitioner development is VSNi’s Genstat. Much of Genstat’s powerful repertoire is the product of a virtuous circle in which routines generated from its base language by enthusiastic users are taken up by others in the same community, then incorporated back into the language. It’s no coincidence that Genstat (and its stablemate ASREML) have evolved to call R tools as well. A similar, if less ebullient, process can be found elsewhere. If you want to extend your statistical testing range, rummage around under the bonnet of every language-based product you can find.

Whatever your final selection of scientific computing environment for statistical testing, the important thing is (as always) the best choice for your own particular needs. You want something that will speed and enhance your exploration, not one that will either weigh you down or prove unequal to the task. Choose it wisely, and it will make your adventure everything that you could possibly wish for.

*References*

* *

*1. Bergsten, J., A.N. Nilsson, and F. Ronquist, Bayesian tests of topology hypotheses with an example from diving beetles. Syst Biol, 2013. 62(5): p. 660-73.*

Approaches to statistical testing of hypotheses fall into various categories, most prominently frequentist or Bayesian (although those are not the only options) according to the assumptions made about probabilities. Bayesian tests are more often appropriate, especially in ongoing studies, than actually used.

Classical testing is the most literal expression of my ‘go, no go’ industrial engineering analogue. Based on a so called frequentist view of events, either a test result is a binary pass/fail decision or the true value of a variable lies within a confidence interval derived from study of a sample. Frequentist tests usually (though not invariably) also assume that unknowns have fixed values.

Bayesian testing, on the other hand, is more likely to see test results in terms of probability distribution and to credit unknowns with probabilistic variation. It starts with a ‘prior probability’ and a ‘likelihood function’ and derives a consequent ‘posterior probability’, updating the probability estimate for a hypothesis as evidence accumulates.

In practice, these are not as separate as I make them sound here; the best testing practice will often use mixed methods to get the most appropriate solution for the purpose to hand – but this does require a level of confidence in one’s own statistical grasp and application which not everyone may feel.

Nor are these two the only candidates: decision theory is important, and there is current research which resurrects Fisher’s fiducial approach. Frequentist and Bayesian outlooks nevertheless dominate the applied statistical testing landscape.