The use of statistics is broadening, says Laurie Miles, head of analytics at SAS
There’s a buzz in the marketplace surrounding Big Data, but we focus on big analytics because it’s not the data that’s important, it’s what you do with it. The world has changed for statistics with the, if you’ll forgive the term, democratisation of analytics. It used to be that a few specialists would perform analyses in the back office, but point-and-click interfaces, wizard-driven tools and algorithms that self-learn have all meant that an increasing percentage of non-statisticians are now using analytics.
Increasingly, elements are becoming more scalable and growing data volumes mean that analyses need to run faster. Software is currently meeting this challenge, however. In the past, if someone wanted to run a complex mathematical formula, such as a neural network, on a large volume of data it might have begun on Monday morning and finished on Wednesday. We recently ran some tests with our new high-performance analytics software and an enormous volume of data was run through the neural network in 38 seconds. Something that used to take days has come down to less than a minute. What this means is that analysts can try different approaches that previously wouldn’t have been possible due to time constraints. By having the flexibility to explore a variety of options, users get a greater level of accuracy rather than talking a more innovative approach.
Analytics have been around for a long time, but it’s only in the past five years that I’ve seen non-statisticians getting excited by the things you can do with data. It’s still the tip of the iceberg but in the next five years I believe we’ll see the use of statistics opening up even further.
Stephen Langdale, senior technical consultant at NAG, shares his view
On a basic level, there is a consideration of accuracy for every algorithm that’s written. The numerical stability has to be checked to ensure, for example, there are no divide by zero errors or an accumulation of cancellation effects which can render the output meaningless. Furthermore, a lot of work needs to be done to get a computer to consistently calculate reliable results – a necessity given that if the routines you’re relying on to make decisions are returning garbage, your decisions will be the same.
In terms of statistics, one of the things you need to look for is a wide variety of choices of methods for modelling data, allowing problems to be approached from many angles. A library of fast, reliable methods is needed because even though computers have advanced so much over the years, the amount of data people are collecting has increased many more times over. It wasn’t that long ago that a data set of a few thousand would be huge, whereas today many data sets are orders of magnitude larger.
As the number of computing cores continues to increase, it becomes a question of how to split up data and tune statistical methods, some of which were developed decades ago for serial computing. For example, random number generators – the building blocks of Bayesian estimation – is one key area that greatly benefits from computers using threaded models, meaning that problems can be split up and concurrent calculations can be run to save time.
At NAG we have noticed that industry is moving away from simpler models to more complex computer-intensive ones. The NAG Library itself contains over 1,700 methods, including 11 chapters dedicated to areas of statistics. Key areas include probability distributions, random number generators, clustering and regression ranging from multiple regression to generalised linear models to time-series. Feedback from NAG’s users and academia drives the contents of these chapters, but NAG only adds new algorithms if they’ve been stringently tested and are deemed likely to have longevity. But it’s not all that easy to predict which ones will be adopted by the wider community – it’s not always the ones you might expect.