Stats for the million
A significant proportion of my consultancy work boils down to maximising the validity and quality of statistical work by lay staff or volunteers, while minimising the impact of inevitable errors. Off duty, my inbox is rarely without a request for informal guidance or reassurance on a piece of data analysis in which the sender lacks confidence.
Statistics is a vital, all-pervasive tool, comparable to driving a car and often with as little detailed knowledge of what goes on under the bonnet. In the real world, whether we statisticians like it or not, an overwhelming majority of statistically based decisions and statistically oriented actions are taken by non-statisticians.
Having made such a sweeping statement, it would be prudent to take a moment for definition of terms. What, exactly, is a non-statistician?
There is no single, clear-cut answer to that, which is why I used the alternative description ‘lay staff’ in my first paragraph, above. Individuals with the same level of expertise will place themselves on opposite sides of the line. In one market segment with which I am very familiar, professionals who are hazy about the difference between mean and mode are counted as statisticians and make million-euro decisions. In another, science graduates whose transcripts include all the usual statistics courses claim to be completely baffled by the subject.
In the 24 hours before writing this, for example, I have advised a mathematical physicist that using the arithmetic mean of percentage data is not a good plan, a dietician investigating health impacts of recession, a literary theorist wrestling with contingency factors in a comparative semiotic analysis, and a group of conservation volunteers looking for patterns in a sprawling set of ad hoc arboreal data. None of these people, advanced in their own fields, are statisticians.
In any professional literature, particularly in the medical and life sciences, there is frequent reference to this. Discussing the widespread use of polytomous logistic regression models in cancer case control studies earlier this year, for example, Xue and others point out that ‘the validity, accuracy, and efficiency of this approach for prospective cohort studies have not been formally evaluated’, and later in the same paper refer to ‘SAS and S-plus programming codes... provided to facilitate use by non-statisticians’.
None of this should be seen only as a problem. Any healthy organisation seeks to draw upon all the strengths of all its human components, not just those defined by particular labels. One of the most inspiring examples I have personally encountered, at the opposite end of the spectrum from my examples so far, is a medical research worker whom I will refer to here as Carla.
Having failed to graduate from high school, Carla started work as a beautician’s apprentice in a mortuary. Reasonable typing speed shifted her into a temporary clerical job within a medical school, and hard work made the job permanent. But her progressive promotion, despite lack of any formal training or qualification, to research analyst was driven by a talent for intuitively interpreting experimental data passing across her desk. Carla doesn’t make any final research decisions, but from her instinctive feel for data (and with the support of good software) she provides an extraordinarily high proportion of successful initial leads for those who do.
Carla is a deliberately extreme example, of course, but far from unique – and illustrates a universal principle.
What, then, to do? How should an organisation or individual best approach this reality – in which an essential tool of all scientific work is, for the most part, wielded by non-specialists? The answer, unsurprisingly, varies from case to case.
Publishers of data analytic and visualisation software are, of course, aware of all this. While they generally don’t say it in so many words they do, without exception, make great efforts to bridge the gap in practice. In various ways, they provide responsible and (so far as is possible) robust support for inexpert users and then back it up with the means for those users to gradually increase their expertise through experience. That is probably the biggest source of statistical education, and an important part of any serious attempt to place statistical work on a solidly productive foundation within an organisational environment.
Provision of facilities doesn’t mean that they will necessarily be used, however. I recently took a straw poll of non-statisticians using Minitab, and none of them had ever looked at, nor even been aware of, the power and sample size submenu. Sample design is a crucial aspect of good statistical work by non-statisticians, but is notoriously under considered by non-statisticians. Some years ago, an interesting lesson emerged when Statistical Solutions donated a copy of nQuery Adviser to a research project run by a group of mature research students; the statistical design quality of their work immediately improved. This seemed to be a psychological result: an unglamorous facet had been promoted in status by the arrival of a separate, dedicated program to service it. Subsequent evolution has seen nQuery Adviser, already an excellent tool, enhanced by combination with nTerim, a merger reported to be fully integrated in the forthcoming upgrade due for market around the time you read this.
For people like Carla, my intuitive medical data jockey, there are two good approaches that happen to complement each other nicely. One (as Carl-Johan Ivarsson of Qlucore points out, see box: Five steps to Eden) is the use of visualisation to explore data for interesting features. All data analytic software offers this, of course, but for the non-statistician a dedicated plotting or visualisation package may be more accessible. The other is a set of clear guidelines or even black boxes for accept/maybe/reject decision making zones on statistical values – the best known example, and a good model for adaption, being process control charts. The careful provision of sensible defaults software – non-statisticians being the least likely to set these parameters for themselves – is a variation on this.
Exploring different visualisation approaches with non-specialists, from mathphobic pre-degree students to highly skilled professionals, I’ve found that the best tool for the same task often varies not with the level of competence but with individual temperament. Preference for OriginPro versus SigmaPlot, for example, seems to correlate with different general mindsets, which shows that it is important for users to experiment with a variety of available options, making full use of the trial copies which are usually available.
An interesting (and relatively recent) entrant to this market segment is MagicPlot, which blends commercial and free software to good effect. This is especially effective in encouraging productive approaches in groups with both lateral and vertical structures. Small teams with technical, secretarial and administrative staff, for example, or academic staff, technicians and students, working together in exploratory conversation, which ties in with the importance of support networks to which I’ll return later.
Carla, as it happens, uses both OriginPro and SigmaPlot for initial visual exploration, then tests her discoveries progressively in the analytic facilities of both packages before progressing to Unistat (for what she describes as ‘its straightforward, no-nonsense structure’ and its particularly close integration with Excel in which her data is supplied for their extended facilities) and Minitab. Her recommendations are based on cross comparison of her results in all four environments. Her working methods, observed objectively, show interesting informal parallels with the more structured approach outlined by Qlucore’s Ivarsson.
Putting aside specific software considerations, what general approaches can be recommended?
A researcher commented last year in Advances in Nutrition that: ‘An important issue with epidemiological studies is the inaccessibility of the data to reader analysis. Data is so heavily processed through multiple layers of mathematical filters that results are intractable to non-statisticians; conclusions must be accepted on faith.’ This is widely true, with two aspects which need to be considered. From a consumer viewpoint, there is the snare which caught my mathematical physicist mentioned above: the lesson is to always work from raw, unmanipulated data (which percentages, for example, are not). At the producer end of the process, always supply the raw data from which your analysis was built, so that others can check your process and help you correct any misunderstandings.
Following from that is the old advice often abbreviated to KISS: ‘keep it simple, stupid’. Don’t apply more esoteric methods than you need, and stick with those you are confident you fully understand. One professional proudly brought me a good sixth power polynomial fit to his experimental data; there are cases where a sixth power fit is appropriate, but they need to be examined carefully for suitability. In this case, a high enough order curve was bound to fit eventually but the points in his small sample were actually random with no association whatsoever.
While raw data should be your ideal starting point, a simple transform may help you to see what is going on within it. Peter Vijn of BioInq consultancy suggests one likely candidate (see box: Go lognormal) and most data analysis software offers a selection which can be applied at the click of a mouse. Remembering my KISS advice, don’t apply transforms just for the sake of it – but do explore carefully whether one of them might be useful in your particular setting.
Try to find people who will be your personal support network. Not consultants like me who will charge you for advice, but colleagues, friends, acquaintances who are happy to discuss aspects of your work. Carla has a team leader who knows how to provide her with clear parameters for assessing the usefulness of exploratory discoveries to her department. She also has an informal support line to a statistics professor in another university who is willing to help her develop data analysis ideas through their keen shared amateur interest in archaeology.
Putting that all together brings me to development paths. Start with those techniques which you understand best, but then seek both to deepen and extend your understanding. Your statistical software has a wealth of guidance material hidden within its help system; some of it (perhaps even most of it) may seem like gibberish, but look at the sections which build upon what you already know, and gradually you will find increasing areas of fog becoming clear. Minitab, which originated as a teaching system, has served Carla particularly well in this respect; Statsoft (publishers of Statistica) offer an online statistics textbook as well. Try applying what you learn to old, already solved problems in your own area, to see how they respond without any pressure to get it right. Discuss what you are learning with your support network, and you’ll progress all the faster.
Keep it simple, start from what you know, look for ways to gradually build a carefully expanding repertoire of techniques. Welcome the willingness of others to help you. Seek reliable guidelines against which your results can be tested. Make the most of both your analytic software and the wealth of helpful support which it supplies. Follow all of those guidelines, and you should be able to do great statistical things without being a statistician.
References and Sources
For a full list of references and sources, visit www.scientific-computing.com/features/referencesjun13.php
Carl-Johan Ivarsson, president â€¨of Qlucore
Sheer size of data sets is a common problem in science. The human brain is very good at detecting structures or patterns, and active use of visualisation techniques can enable even the non-statistician to identify them very quickly with instant feedback as results are generated. At Qlucore, we recommend a five-step approach to ensure repeatable and significant results.
High dimension data should first be reduced to lower dimensions for 3D plotting, usually using principal component analysis. Data colouring, filters, and tools to select and deselect parts of the data set also enhance information.
Step one: detect and remove the strongest signal in the dataset. This allows other obscured signals to be seen, and also usually reduces the number of active samples and/or variables.
Step two: measure strength of visually detected signals or patterns by examining variance in a 3D PCA-plot compared to what would be expected with random variables, giving a clear indication of the identified pattern’s reliability.
Step three: if there is significant signal-to-noise ratio, remove variables most likely contributing to the noise.
Step four: apply statistical tests to any or all of the other stages of the five-step process.
Step five: use graphs to refine the search for subgroups or clusters. Connecting samples in networks, for example, can move you into more than three dimensions, providing more insight into data structures.
Repeat all steps until no more structures are found. Used this way, visualisation can be a powerful tool for researchers, without having to rely on statistics or informatics specialists.
Peter Vijn, data consultant at BioInq
A rewarding generic step in data analysis is to use an appropriate data transformation directly after data acquisition – a subject not usually covered in basic statistical courses or textbooks, frequently causing a communication gap between statisticians and non-statisticians.
Of all possible transformations, the logarithmic is the true work horse, effectively switching to the lognormal probability density function. While this only works for positive-valued and non-zero data, most real data complies with it. If there are zero-valued data points, replace them with the smallest non-zero value that your instruments can detect.
The lognormal distribution often gives a superior fit, especially if the coefficient of variation is large. Create a dataset from the number of words in the email messages in your current mailbox, calculate the mean and standard deviation, and plot the resulting (mis)fit with the normal curve. Now plug in the logarithmic transformation, do the same, and look at the great fit.
This simple recipe also addresses another persistent and often overlooked problem: heteroscedasticity (unequal variances of group means) causing output interpretation issues in most statistical tests.
The best thing about the log transformation is preparation of large dynamic range datasets to meet smaller range assumptions. The only thing you have to get used to is that confidence regions transformed back to the original measurement scale will be asymmetric, but even that is a natural consequence if you realise that data cannot cross the zero line.