14 August 2012
Reviewed by Felix Grant
Missing data values are the bane of a statistician’s life. At best, when the distribution of missing values is truly random, they reduce sample size which has a knock-on effect on conclusions and confidence. At worst, when there is an extensive pattern of systematic omissions, they can render a whole data set completely useless.
At the same time, attempting to fill in (or, in the language of the trade, impute) the missing values is a fraught business full of pitfalls. There is an ever present danger of building assumptions into the data which are then expressed in analyses.
Very often, however, the benefits of imputation in retrieving at least part of a study for useful non-content-robust analysis outweigh purist qualms. The question is then how to go about the process in the best way and this is where Solas comes in. Irish company Statistical Solutions has a good track record in providing focused provision for this type of important, but under-addressed, problem and Solas is no exception.
A variety of methods are provided for both single (group means, hot deck, last value carried forward and predicted mean) and multiple (predictive model, propensity score, predictive mean matching, Mahalanobis and an option to combine the last three) imputations. Graphics, tabulation, and so on, back up all of this.
A full gamut of controls are available, and rightly so, though making good use of them is a matter not only of knowledge, understanding and experience, but a degree of intuition as well. That being so, it’s good to see that sensible settings are in place and ready to run without touching anything.
The multiple methods default at start up to five imputed data sets, but can be set as high as 50; this is important because while efficiency capture levels off rapidly above five, statistical power gains considerably as the number of sets grows. In an ideal world, perhaps it would be nice to see the upper limit expanded to 100, but that would involve a considerable expansion in computational load and 50 is, in practice, a sensible compromise.
In the majority of cases, multiple imputation is far and away the best bet, but results in a need to first execute each analysis on every one of the imputed data sets and then combine the results, which is a less than trivial task. Fortunately, Solas automatically does this for you in any analysis you choose from its menu.
As a test of how useful Solas actually is in practice I took an existing 14-variable data set of 65 cases, with no missing values, and calculated (in another data analysis package) a selection of standard descriptors, tests and regressions available within Solas. Progressively eliminating data values, first randomly and then systematically in a variety of ways, gave a selection of simulated sets with between 35 and 85 per cent remaining. After each elimination, the same descriptors, tests and regressions were calculated again. Solus was then asked to impute the missing values in various ways, up to and including multiple imputation, by combination methods over the maximum allowed 50 generated sets and then setting the program to calculate the same descriptors, tests and regressions for the resulting imputed sets.
I did some quick spot checking of intermediate steps and of final combination; not surprisingly, everything was right on the button except where I had myself made a mistake. The acid test, though, was how much the final analytic results for partially missing data sets were improved (that is, converged with the known true results from the complete set) and the answer was ‘significantly’ or, in the case of high power multiple imputations, ‘dramatically’.
Nothing can ever completely replace lost data, but the improvements made by Solas shifted even some hopeless cases into territory where analysis became worth doing – and without onerous workloads. This, in real world statistics, is often worth its weight in funding documents.