Skip to main content

'S' is for seaside, sea slug and statistics

Everybody needs a hobby, or so my great aunt used to tell me. I'm personally suspicious of the word 'hobby', but perhaps one of mine is the investigation, over a long period and a wide spectrum, of half a dozen little bays around Britain's coast. These bays, whose specifics I have no intention of revealing for selfish fear of rupturing their solitude, are remarkably similar in many ways: tiny, infrequently visited (though half of them are only a short scramble from busy highways), ringed by rocky terrain with varying types and degrees of short tough ground cover. In other ways, and more so as my investigation deepens over the years, they are magically different. For the most part, the function of these bays and their investigation is simply to refresh the sense of wonder that first brought me to statistics. A copy of Insightful's S-Plus 7.0 reached me at a time when there was no upcoming professional project on which it could be put through its paces. I was, on the other hand, in serendipitous possession of a large and as yet unanalysed database of raw readings garnered from the 'sea slugs' that had, for a time, inhabited one of my bays.

My friend, colleague, and partner in oddity, Zeph, derives his enjoyment from building gadgets. Radio triggers, data loggers, miniature microphone arrays, and other transduction devices - if it is small, Zeph enjoys designing it and making it, though he loses interest once it has been demonstrated to work. Since he doesn't share my delight in clambering up cliffs and sitting out rainstorms under stunted trees, we have a nearly perfect symbiotic relationship: he makes the toys; I play with them. If he could be persuaded to make these gadgets last longer than the sealed-in batteries that run them, they would be better still; but any suggestion that his creations might be marketed plunges him into weeks of somnolence. You can't have everything, as my great aunt also used to say.

Software packages frequently come in different weights, these days, and so it is with S-Plus for Windows. The more heavyweight of the pair is 'Enterprise Developer' (ED), with 'Professional Developer' (PD) as its lighter sibling, and the version I was given to play with was ED. There are two ED-only features, identified as I reach them; everything else is applicable to both.

 


The voyage of the sea slug. Despite its computational power, S-Plus has always been very strong on quick and easy exploratory graphics. Here, the movements of one sea slug under tide (y-axis) and currents (x-axis) over 11 weeks is plotted from highlighted dataset columns with a single mouse click from the Plots2D palette at frame left -- an instant snapshot map of the movements within the beach shingle.

 

Zeph's 'sea slugs' were small rocks, rounded by the tides like a pebble, but about 250mm across, ground out to house electronics in sealed plastic containers and then rebuilt to original shape. The electronics show signs of descent from vivisected cell phones, consumer GPS devices and pocket computers, linked by pieces of tatty Veroboard; the rest of the space is taken up by batteries. I lugged a score or so of these rocks back with me in late 2004, to Zeph's specification; after a number of mistakes and breakages, five of them went home again in the spring of 2005. For a few weeks they sat in the shingle between low and high tide lines, collecting data internally on flash cards while static sensor clusters ashore simultaneously recorded parallel environmental data. Just as S-Plus arrived, their batteries were weakening so they were brought in and dismantled. When under water, the GPS in the slugs didn't work, so was shut down to conserve battery life. On request via SMS (short message service), each reported its position and some key data in densely coded strings sent back the same way. It's not everyone who can claim to have held text message conversations with a family of rocks.

The first feature noticed by anyone who has used the S-Plus GUI in a previous release is a cosmetic one: that its responsiveness has, again, been fine tuned. In particular, you can continue using it (though particular parts that call on the S engine may not be available) while long computations are in progress, rather than having to wait until the process is complete. This may not be the most earth-shattering aspect, but it's a very real contribution to usability, and therefore productivity, well worth noting. Staying on this theme for a moment, there are other detail changes - graph colour schemes, for instance, have an increased coherence and graphlet print quality takes advantage of newer Java Virtual Machines. The gain in responsiveness is presumably achieved by code refinement, but it seems convenient to move onto other issues which affect handling.

 


S-Plus installation no longer seems to colonise the Excel toolbar, but the same functionality is available within S-Plus itself. Here an Excel file has been imported, the highlighted data plotted and summarised using S-Plus 7 tools as if it were in a native worksheet. All Excel conventions, operations and tools are available within the worksheet window. Excel data can also be linked, choosing specific cell ranges, but the result is not live - modifications made in one view do not reflect in updating of the other.

 

First of these is Microsoft 4GT RAM tuning support, which potentially jacks up the virtual user address space from 2Gb to 3. This isn't for everyone, as its full benefit depends on the full whack of RAM being both present and not required by the system, plus an appropriate Windows Server version. But if you've got the context, then S-Plus 7 can use it. I've not had an opportunity to try out this aspect with S-Plus, but experience with other applications software brought home both its benefits and its sensitive spots. On the hardware side, as your installed RAM total falls, so does the benefit - but some advantage remains even below 2Gb. There is some messing about to be done with the works of your system, but it's a one off thing and no big deal. More serious is checking that your system can still function in the reduced space that you are now allowing it - this may mean some hard choices about which device drivers you really need and which can be traded off as a sacrifice to the greater good.

If, like me in this case, you are running S-Plus under a version of XP with installed RAM at around half or one Gigabyte and your video adaptor requires a lot of address space, the joys of 4GT tuning are academic. The new Big Data facility, on the other hand, is very applicable to you - though it is one of the two features only available in the ED version. Data sets too big for available RAM are handled as the new bdFrame type via a binary cache on the hard disk which serves as virtual RAM - not part of the usual OS virtual memory, but a separate and dedicated file. In the normal way of things I would have subsetted the sea slug data, but Big Data seemed just made for the task. A library has to be loaded first (this happens by default with the GUI, but has to be called explicitly in many circumstances) before BD can be used, but all that is required thereafter is a tick in an option box when importing a data set. There is a difference in presentation, with bdFrames adding a tabbed interface to the basic Frame window, but broadly speaking data in either can be analysed in the same ways (with some limitations: no timeSeries in a bdFrame as yet, for example). There were a few rough edges when I first tried it in 7.0, but most of them disappeared with application of the 7.0.2 update patch - as always, make sure you have the newest version (good advice that I too often forget to follow myself).

The exact size of set that can be handled as Big Data, though not contained in RAM, depends on it (since column metadata is maintained there) and on the operations involved. The documentation refers to tens of thousands of columns on a 'typical machine'. The sea slugs and their shoreline kin had, between them, generated about 12,000 parallel variables, which would not all have been analysed together in the normal way of things, but were loaded together as a trial of Big Data. By combining some into compound measures such as lateral movement per hour, longitudinal movement per hour, rates of change, and so on, I artificially boosted them to 18,278 (the largest that a Quattro Pro spreadsheet will store). These imported without trouble on a very standard laptop, and I didn't push further in search of ultimate limits.

 


S-Plus 7 prepares to import a data set from a Quattro workbook into a Big Data frame (selected by the ticked 'Import as Big Data' option box centre at left edge of the 'Import from file' dialogue).

 

As long as the data set will fit in RAM, you have the choice, at time of import, whether to import conventionally or as a Big Data frame. This may seem like a no-brainer, but it deserves careful thought in certain circumstances - for instance, if other things are happening on your machine that you don't want to terminate. Obviously, a disk-based data set is going to be slower than the RAM equivalent (especially if other processes are concurrent), but what S-Plus does with block reads and data streams is impressive. The library itself takes some time to load (though this is only a once per session wait at start up), and if a dataset is big enough to leave you with no option but the Big Data treatment, you can go and have a cup of coffee while it imports. Once in, the data is not analysed as fast as in RAM, but nevertheless with very respectable speed - and, of course, couldn't have been analysed at all if Big Data hadn't been there. It's always worth checking that you really need to use this facility before abandoning RAM, but it's a very useful new facility to have on tap.

Before moving on to the actual purpose of the software, where there are both changes and new aspects, perhaps mention should be made of FlexNet licensing management. Insightful flags this shift as a significant development, and a quick phone-around confirms that customers with large, multiple-user networks to keep in compliance seem to agree. Like most licensing systems that work well, it is so nearly transparent that I would have nothing to say but for two experiences during my trial up to the time of writing. On those two occasions the program refused to load, informing me that the stored ID didn't match that of my machine; but the problem was fleeting, disappearing on the first occasion in the time that it took me to send a plaintive support-request email.

Most such products are making moves to attract greater business attention rather than just the pure scientific, which often leads to software bloat, so I'm pleased to see that all the financial functions that were core to S-Plus 6.2, and are usually not of interest in science work, have been farmed out tidily into a separate 'financedb' library. The correlatedData library, by contrast, which was fairly central to some of the sea slug investigations, usefully extends reach in the areas of generalised linear models, and maximum likelihood. Given the nature of their task the sea slugs particularly appreciated strengthened time series handling with new functions, extensions to old ones, and applicability to objects or functions which previously lacked it. There is a new random effects Cox model type for censored data, and a clutch of other additions or improvements to existing capability.

The language sees similar incremental change - left bin endpoint inclusion and an 'align.by' argument added to aggregateSeries, for instance. The test for missing values (despite their best efforts, the sea slugs generated a lot of these) has moved from its library into the core. A couple of functions not previously in the Windows flavour brought in from the cold, LME strengthened and tidied up, some added sophistication in time series handling, and so on. All of this was valuable, but perhaps overshadowed by arrival of the second feature to be found only in Enterprise Edition: an implementation of the Eclipse IDE, customised for the S language and incorporated as the standalone 'S-Plus Workbench'.

Eclipse grew from IBM, which still dominates the consortium that licenses it, but is open source and so presented in different ways by different product suppliers. So far, Workbench offers development of tasks and projects, interaction with language engine and source control, monitoring of syntax, and editing of code. Insightful invites user feedback into a development cycle intended to grow the workbench beyond this into, eventually, a total S code environment. This will be interesting to watch. Will it echo, step by step, the release cycle of S-Plus? Will it be faster, with interim releases? Will it take on a life of its own, like my sea slugs following their tidal cycle, with separate update calendar? Will it, in time, become available as an addition to the Professional Developer version of S-Plus as well? While not applicable to this single study in the private life of seaside rocks, the long term importance of the S-Plus Workbench is obvious. Eclipse is taken seriously across the industry, and has caused potentially seismic rumbles by its de facto confrontation with Sun. Insightful will not have adopted it lightly, and clearly have strategic aims in mind when doing so.

The sea slugs, alas, have retired; S-Plus, on the other hand, has made a significant career move. While the Professional Developer variant is in line with normal upgrade expectations, the Enterprise Developer goes beyond that in ways which require careful consideration of potential futures.

Sources

Insightful: info.uk@insightful.com
Eclipse IDE: info@eclipse.org
FlexNet licensing: corp.info@macrovision.com



With version 7 comes access to latest versions of all the S+ modules, separate products that provide significant enhancement of S-Plus in particular areas of work. I mentioned some of these last year ('And furthermore: add-ins for stats packages', December 2004), but a brief survey seems in order here. Those marked 'new version' generally take advantage of the new large data set capacity in S-Plus 7:

  • S+ArrayAnalyzer (currently v2.0.4, available for Windows) - new version. Providing assay methods primarily of interest to the pharmaceutical industry, it has several improvements over the previous release. Easy to use, it offers a fairly short acclimatisation without sacrificing power.
  • S+FinMetrics (currently v2.0.1, available for Windows, Solaris, Linux 32-bit) - new version. Aimed at the financial services sector, this is an advanced modelling and estimation package designed for market prediction and trading design.
  • S+NuOPT (currently v1.6.2.0b, available for Windows, Solaris, Linux 32-bit) - new version with performance and memory management improvements. Optimisation package for very large data sets; well suited to accompany S+FinMetrics, and positioned as such by Insightful, but the core methods are applicable to many science situations as well.
  • S+Wavelets (currently v2.0.2, available for Windows, Solaris, Linux 32-bit, HP, AIX). Image, signal and time series analysis using wavelet methods.
  • S-PLUS for ArcView GIS Link (currently v1.1.1, available for Windows). Makes S-Plus tools available within ArcView. Pairs naturally with S+SpatialStats, which it will access.
  • S+EnvironmentalStats (currently v2.0.3, available for Windows). Adds graphic and analytic functions covering major published methods, plus a substantial body of relevant reference data sets and a help system on background, usage and source referencing. Aimed at civic or corporate professionals working directly to legislative or regulatory requirements, but core capabilities are of much wider utility.
  • S+SeqTrial (currently v2.0.3, available for Windows). Group sequential methods for clinical trials.
  • S+SpatialStats (currently v1.57, available for Windows, Solaris, Linux 32-bit, HP, AIX). Analysis of spatially distributed data of all kinds.

Topics

Read more about:

Modelling & simulation

Media Partners