FEATURE

Yours, free to keep...

From a dune in the depths of the Sahel desert, Felix Grant accessed free statistics software using a handheld. But, he discovered, there are costs other than monetary ones.

Unsolicited samples, books, and internet connection CDs often drop through my letterbox calling themselves 'free gifts' but they are almost invariably bait, designed to haul me in for something very expensive. The world of free statistics software, however, is gratifyingly different.

 

Much depends, of course, on what you mean by 'free'. For the purposes of this survey, I adopted the phrase used to describe Britain's National Health Service medical provision: 'free at the point of delivery'. In other words, you get something real and useful, of lasting value, with no catch or punitive small print, and you pay nothing in monetary terms for its procurement. That doesn't mean there is no cost. National Health Service beneficiaries meet the cost of their treatment through taxes and national insurance contributions; free software users, in general, pay through investment of their time.

 

That is the crux of the question: 'why pay'? There is a market in which price is a significant factor; two subsets with whom I have a lot of contact are students and charitable organisations. Few readers of Scientific Computing World though, I suspect, would go for free software instead of purchased because their company or institution couldn't afford the purchase price.

 

Researching this piece, I talked to a lot of users of both free and commercial software (very often the same people) and, overwhelmingly, the reason behind their choice was the fact that statistical software free of a price tag is often free in other senses as well. On the plus side, the user is free to tinker with and customise it; on the debit side, free software may have no guaranteed support.

 

The second most common reason for 'going free' was the perception that a tool would be used only once or twice. If my work is usually in field A, but during the set-up of an unusual project I have a need, never likely to be repeated, to use a facility from field Z, which is not offered in my usual toolset, I may well go for a free tool rather than buy it in. Related to this one is the third most common reason: free tools allow me become familiar with what is and isn't possible, to learn what I do and don't want, before committing myself when moving into unfamiliar territory.

 

Between those limits, there is a staggering amount of statistical software within my 'free at the point of delivery' definition. I looked at more than 300 offerings, most of them very good, and had still not seen the horizon.

 

There is also a political dimension to free software. Most visible is the operating system issue, with Microsoft Windows on one side and Linux on the other. The International Open Source Network of the United Nations (IOSN) actively promotes open-source software as a component of developing world ICT infrastructure development. At the end of August, I was working with an east African university when IOSN's 'Software Freedom Day' erupted around me; there were suddenly more open-source CDs around than machines on which to run them, and I found myself in great demand for ad hoc talks on the subject. Central to that political dimension are the Open Software Foundation and the GNU General Public Licence, so I will kick off with the free software package, issued under the GNU GPL.

 

In the world of commercial analytic software, a few well-known names dominate; so it is with free software. For scientific users, one of the big commercial names is S-Plus - the analytical and visualisation environment now marketed by Insightful Corporation, a value-added implementation of the S language originally born in Bell's labs nearly 30 years ago. In free software, the equivalent big name is R - also an implementation of the S language, though with significant influence from the LISP variant, Scheme.

 

An interpreted language, R originated in New Zealand, at the University of Auckland's Department of Statistics. Originally growing through user support and feedback to its two authors - Ross Ihaka and Robert Gentleman - it has for the past seven years been under the oversight of an 'R Core Team', which is currently 17 strong and includes, besides Ihaka and Gentleman, John Chambers of S fame. This core team has created a not-for-profit R Foundation, which provides support for the R project in particular and innovations in statistical computing in general, and manages the software and documentation copyrights. Most functions are written in R itself although provision is made for Fortran, C, and C++ procedures. It is available for a clutch of Windows, MacOS and several Unix flavours.

  • Example 1.1 from Bowman &Azzalini is transferred to R, using the relevant version of the sm library. Left, upper: the R console. Left, lower: information window for the loaded aircraft data. Right: graphics window with R's own histogram and the smoothed curve generated from the sm library. At this level, there is no difference in working between S-Plus and R. As the examples become more complex, differences begin to emerge.

 

All my exploration was carried out in releases up to 1.9.1, but version 2.0.0 (a coming of age more than a radical change but with a number of new features) was announced just as this issue went to press.

 

I was familiar (as most scientists are) with S, and have always had an affection for LISP, so R held no terrors - or so, at least, I thought. On the other hand, I'd never actually done anything with R. I have a lot of code knocking around in dark corners of my hard disk, and R documentation claimed that much of it would run with minimal modification - which was true, but deceptive. Wandering arrogantly into the R environment, I suddenly felt like a city dweller cheerfully trotting off for a week's stiff winter hill walking in light summer shoes. This was a good lesson in the different ways of paying for software.

 

I am not, I freely admit, a natural programmer. It is true, I found, that R will run most of my S code - and very efficiently too. Most of the instances where it didn't were trackable to the different way in which R determines the values of free variables; my S fragments assume that this determination is by global reference ('static scoping'), whereas R looks to the environment that existed at the time of function creation ('lexical scoping'). Once I had gotten my head around this idea, and allowed for it, almost everything worked flawlessly. Although my code now ran, it didn't run as efficiently as it should. What I had created were overlarge; they carried unnecessary baggage over from one environment to another. In most cases, my code could be considerably stripped down, as a direct result of the different 'scoping'.

 

Although only noticeable in long runs and loops, there is also an increase in speed for the same reason: R is storing in RAM information that S stores on disk. The downside of that is a greater risk of loss in the case of a crash. Saving frequently becomes more important, which somewhat neutralises the speed advantage.

 

Chastened, I went back to basics and ran through some exercises. Bowman and Azzalini's 'sm' data-smoothing library for S-Plus is one I know well. There is a version of sm for R, so I worked my way through the illustrations in the associated book, Applied Smoothing Techniques.

 

All in all, R was a good lesson in the price that may have to be paid for free software: I spent many hours relearning some quite basic things taken for granted in the commercial package. Such things as importing a data file, for instance, are no longer done without thinking about it by clicking a file menu option. In fact, the whole file menu in R looks either elegantly uncluttered or frighteningly austere, depending on your point of view. You still have all the facilities you need for import; but you have to roll up your sleeves, open the help files, and set aside initial learning time before you can come to love them.

 

I've recorded all of this because it is an important aspect of 'running free' - but it would be wrong to see it as the whole or even dominant truth. It is the price paid, just as the dollars or euros for a commercial package would be. For that price, I've learnt a great deal - and not only about R. And I shall remember it when I next have to find a heavyweight solution for a big problem presented by a small charitable client with an invisible budget. It's a huge, awe-inspiring package - easier to perceive as such because the power is not hidden beneath a cosmetic veneer. A bank of libraries is included with the basic install, but many more are available thanks to other users more creative and experienced than I am. There's quite a bit of good and free instructional material for those on the learning curve as well - including a very useful PDF text for learning both R and S-plus basics from scratch, side by side. There is continual development, including some specialised areas of interest to particular users - and all of this is, of course, also free.

 

There are other types of free software besides the big all-singing, all-dancing environment. When I started this, I began a listing from my own past favourites; but I soon realised I couldn't hope to better the job already being done by others. So, treat what I have to say here as simply a taster before going on to one of the sites listed in the 'Signposts' box, below. It includes three portal pages to get you out there in the right territory.

 

The logical first stop has to be Rweb - an interactive site which lets you run R through a browser without having to install it on your machine. This is a good way to get an initial feel for the language, or to make occasional use of it; though it doesn't have the flexibility of a local installation, and you can't, for example, call up specialist libraries at will. It's also a good continuity link to the far end of the free statistical software spectrum: the many online resources available for occasional or trial use of techniques and products - sometimes with public intent, more often as a by-product of institutional facilities.

 

These are usually simple, single purpose pages, performing a single task in an easy-to-use way. An example is the Large Markov Chains computation page at the University of Baltimore's Statistical Thinking for Decision Making site. Like many such pages and sites, this one (administered by UB's Professor Hossein Arsham) is part of the resources provided for some of the institution's courses; availability to the outside world is a by-product. Powered by Javascript, the Markov calculator is one of a collection of useful tools running down to smaller utilities such as a one-row, 14-item, chi-squared, goodness-of-fit calculator and an exponentiality test. If you are thinking that you have those facilities already, in a desktop maths or analytics package, then you are probably right - but it is surprising how often a bookmark to one of these pages is quicker and more convenient for a one-off question. Another advantage is that these tools are accessible from platforms that don't have, or cannot run, heavyweight specialised software. I can vouch for the fact that the Risky Decisions calculator works faultlessly (though a little scrolling is required) when accessed via satellite from a handheld machine on top of a dune in the depths of the Sahel desert, so it should work anywhere. Obviously, these are not tools for frequent, high-volume work, but for occasional or exploratory use, in the right situations, they can be ideal.

 

The 'big end' of the market looks to generality while the 'web end' concentrates on specific tasks; the intermediate band generally falls between the two extremes in scope as well as in size.

 

In the generality camp are packages such as OpenStat (open and friendly, designed for educational use), or Mielke and Berry's Statistical Software (a collection of DOS routines, intended to accompany their book). I include MicroOsiris even though, with registration dependent on a minimum 'donation' of $10, it is not strictly 'free', but it is one of the most comprehensive and includes a guide to selection of suitable techniques. Then there is StatCrunch, which is also a university-based Java collection (South Carolina, this time), but more general in intent than the Baltimore examples and arranged to emulate a simple 'conventional' statistics package. There is provision for import of data from a file, rather than manual input, and the facilities on offer range from summary measures and a range of graphics to control charts taking in basic significance tests, multiple linear regression, one way ANOVA and some common non-parametrics along the way.

  • Aspects of MicroOsiris. Overlaid on the audit trail screen (background) are an Excel-mediated input utility (top left), command screen (top right), and the decision tree advisor for choice of techniques to be used.

 

There are then specialised subset or single-purpose packages dealing with a particular area or task - usually in greater depth or with greater ease than the generalist packages. Examples include everything from First Bayes (exactly what it sounds like) and DEMETRA (time series) through Biomapper (a GIS and stats kit for ecological niche-factor analysis), WinSAAM (modelling, aimed at pharma use) and highly specific Excel add-ins or routines, to EpiData which offers no statistical analysis at all, but does help you to build a documented data-entry regime. This seems an appropriate place to mention that many free products (especially, but not only, this specialised class) stipulate that you are only free to use them for non-profit purposes: check this, for moral as well as legal reasons, before wading in.

 

I haven't done more than a gadfly flit over the possibilities here. Quite apart from the fact that I have completely ignored literally thousands of purely statistical options, there are all those more general mathematical avenues which Ray Girvan has previously suggested in SCW March/April 2004, page 43 - all of them of interest to the statistician as well. There is, in principle, no need ever to pay for statistical software in monetary terms, because there is more choice free at the point of delivery than anyone can realistically expect even to visit. The real questions are: what are your exact needs, for what particular purposes, and how much are you prepared to pay in time, security, and currency? I am in the fortunate position (partly as a result of reviewing, partly through academic or other channels) of having most of the big commercial packages available to me; the web delivers the rest to my door. Spoilt for choice, I still find myself splitting my time roughly half and half between the two worlds. 'There are more things in Heaven and Earth than may be found between London and Staines' - and the same could be said for any set of well known names you care to pick, commercial or free, however superb they may be, from the analytical software arena.

 

SIGNPOSTS

 

 

Altervista's Free Stats on the Web (a portal) http://freestatistics.altervista.org/stat.php

Biomapper http://www2.unil.ch/biomapper/

EpiData http://www.epidata.dk/

First Bayes http://www.firstbayes.co.uk/

Java tools menu at University of Baltimore Statistical Thinking for Decision Making site http://home.ubalt.edu/ntsbarsh/Business-stat/Matrix/Mat10.htm#rmenu

John C Pezullo's Free Statistical Software and Web pages that perform statistical calculations (a portal) http://members.aol.com/johnp71/javasta2.html http://members.aol.com/johnp71/javastat.html

LispStat http://www.stat.uiowa.edu/~luke/xls/xlsinfo/xlsinfo.html

Micah Altman's The impoverished Social Scientist's Guide to Free Statistical Software on the Web (a portal) http://www.hmdc.harvard.edu/micah_altman/socsci.shtml

Mielke and Berry - Statistical Software http://www.stat.colostate.edu/~mielke/permute.html

OpenStat version 4 http://www.statpages.org/miller/openstat/

Ox http://www.nuff.ox.ac.uk/Users/Doornik/index.html

R project - the starting point for investigation of R http://www.R-project.org/

Rweb - interactive site for basic use or exploration of R http://www.math.montana.edu/Rweb/Rweb.general.html

StatCrunch http://www.statcrunch.com/

WinSAAM http://www.winsaam.com/

Feature

Sophia Ktori highlights the role of the laboratory software in the use of medical diagnostics

Feature

Gemma Church explores the use of modelling and simulation to predict weather and climate patterns

Feature

Robert Roe reviews the latest in accelerator technology and finds that GPUs and coprocessors will be key fixtures in the future of deep learning

Feature

Robert Roe finds that commoditisation of flash and SSD technology and the uptake of machine learning and AI applications are driving new paradigms in storage technology.