A model miscellany

Just over a year ago [1] I took a busman's holiday, deliberately abandoning the familiar round and looking for analytical software tools that had never been reviewed before in Scientific Computing World. Some of them I was able to try on my own machines; for others I had to hit the road and find hosts who would let me play in corners of their environments. I learned a great deal from the experience and so, 14 months on, I repeated it. This time, rather than seeking out my own prey, I tried out a range of proposals made through the SCW website.

The sample on which I settled all seemed particularly intriguing in some non-standard way (either personally or marketwise), and could all be related to a common modelling theme. Last time I kicked off with a well-known name from a big corporation; this time, everything comes from the leaner places where innovative thinking and imaginative development are so often born. Such places don't have to be physically remote, nor do they have to be away from centres associated with software development (though in our rapidly globalising world they may well be); their removal is conceptual, measured from corporate market domination patterns.

One effect wrought by the internet in general, and the web in particular, is the irrelevance of geography in many contexts; there are companies with whom I have been dealing for several years but couldn't tell you where they are physically based.

The first stop in this particular journey is Slovenia, a country in transition from one economic model to another with a consequent turnover of ideas. Dew Research, about six years old, is engaged in several areas of work including consultancy and application development. Most interestingly in the present context, though, it develops packaged reusable code libraries for use by others who are building specific scientific and engineering software applications. From Slovenia in the east of Europe, to Atebion (a hybrid tree-based modelling environment combining elements of the spreadsheet on one hand and computational command line execution approaches on the other), which takes us to the western edge. Atebion is a Welsh word with a spread of meanings covering the English equivalents: reply, solution, answer, panacea, and so on; and Atebit (the publisher) hails from an address near Bangor in north-west Wales. The next member of the group is DTREG, a package comprehensively dedicated to predictive data modelling, from the other side of the Atlantic. Finally, northward from Atebion, to Sweden for Linolab: a matrix algebra environment in Java (the program, not the island, in case my travel analogies are beginning to confuse you!) which is, like DTREG, an individual project.

MtxVec, from Dew Research, is a comprehensive base set of object-oriented numerical tools.

Three of the four will run as standalone applications on any standard Wintel PC; for the fourth, this was one of those times when I couldn't rely entirely on my own resources. My interest in Dew Research was, given my own bias, first caught by its Stats Master; but that is a statistics-specific add-on for its larger core library, MtxVec, so I started with that and worked through. In terms of function, MtxVec occupies territory with which I'm very familiar - the application of vector and matrix methods to large data sets. Its habitat, on the other hand, is not one where I have any real background of expertise. I might once have known my way hesitantly around Borland Delphi, but that was long ago and much has changed; C++Builder and Visual Studio are alien tongues. So it was out on the highways and byways to find existing installations where I could get access to both the software and a base of expertise in using it. Maintaining impartiality in such circumstances can sometimes be an issue, since an established user base has obviously already committed to the software, but I was lucky on this occasion. I found MtxVec in use at a centre where I was already known well enough to be given play space and then left alone unless I called for help.

MtxVec is a comprehensive base set of object-oriented numerical tools using LAPACK to speed up and ease the development of computational work on large matrices of data values (including complex quantities). Vectorised code, multiprocessing and access to various classes of primitives and low level functions (including streamed SIMD) provide speed and efficiency, packaged in ways that make power available and readable to someone who thinks like a human algebraist rather than a computational machine coder. It also integrates with Steema Software's TeeChart to provide quick and easy visualisation.

As a naïve babe in the woods of unfamiliar environments, I can't give an authoritative pronouncement on MtxVec's broader place in the ontogeny of its chosen world. I can, though, say that I found it a tremendous aid to quickly and efficiently getting done the sort of work that I wanted to do. Side explorations showed that without it I would either have had to do a lot of wheel reinventing (with all the error and inefficiency debugging which that implies) or spend a lot of time locating less than obvious native methods. By way of comparison, after a single day my beginner's efforts with MtxVec more than held their own when run against equivalent structures built without it by more experienced practitioners.

For all the advantages that it provides on its own, though, MtxVec was for me not an destination but a means to an end. Its inherent statistical functions are undeniably valuable compared to what is available raw, but for most purposes there is an equally powerful logic to moving on into the StatsMaster. This takes MtxVec's capabilities and presents them as a coherently focused set of methods for statistics and probability work. Here a stranger like myself can at last begin to feel at home, insulated from the raw generic strangeness of a Builder, Delphi, or Studio environment by structures of instantly recognisable familiarity - a metaphorical Rosetta Stone.

StatsMaster's repertoire starts from basic descriptive measures before building upwards to include a range of 200 or so distribution and regression methods, all applicable across the large data set sizes that MtxVec can access, and calling on its matrix manipulation methods. This makes it a natural data mining tool. Probability calculators and random number generators are built in for a solidly useful range of distributions; regression model types are comprehensive as are parametric hypothesis tests (a small number of common nonparametric tests are present as well). Logical key groups of these facilities are encapsulated as components for ANOVA, nonlinear and multiple linear regression, principal components and hypothesis testing. Charting is dependent on installation of TeeChart but, given that proviso, reflects TeeChart's capabilities.

Linolab is designed explicitly as a Java development extension.

Taking a step back from StatsMaster, there is a logical continuity link from MtxVec to Linolab. Technically, Linolab also requires an operating environment - but since that environment is Java, which can be taken as read on most Wintel platforms these days and is freely available to those where it is not, it can be seen as freestanding in real terms. More important, Linolab is also oriented to matrix methods - it can be seen as operating in the same context as products such as SciLab or Matlab. Like those two products, it has a command line and can be used for interactive exploratory development of code; unlike them, however, it is designed explicitly as a Java development extension. Its language (Lino) is not the familiar *lab clone, but a superset of Java itself with use of the expanded vocabulary a user option.

As someone who picks up and uses programming languages as an incidental side-issue to the process of doing something else, I found this my first really comfortable experience of Java. As with the likes of Matlab, here is an interactive graphical user environment where it is possible to think first and foremost about algebraic structures and only deal with language issues as they arise within that view of whatever task is to hand. Running through a series of 'five finger exercises', the language gradually reveals itself around purpose in a conversational manner. Being a habitually contrary learner I started with exercises of my own, but would have gotten on even more quickly with the well-designed exercises and examples provided.

The similarities and differences with respect to Matlab and its clones are interesting, the latter emerging as temporary pauses in the workflow during learning. Matrix performance is pretty closely comparable, no difference at all being apparent in my use so far, but some differentials are flagged in both directions. Lapack is available as a Linolab library (DLL, SO, or whatever) but sparse matrix operations are 'not yet' optimised for it. It would seem that in the most demanding circumstances Linolab is likely to provide a better turn of speed on dense matrices, Matlab on sparse, Linolab on heavily looped structures, and so on.

This seems a good point at which to mention that I used the freely distributable open licence version of Linolab. This is not restricted to personal or academic purposes; commercial use is also permitted. There is also a purchased licence version which runs faster (by a factor of four) and also allows parallel execution of commands where the free license executes serially. The speed comparison examples and results provided with LinoLab assume the purchased licence.

Enough about speed; anyone to who it matters will make their own investigations anyway. The purpose of an article like this is to signpost possibilities, not to make detailed cases. One advantage which Linolab offers is the output of final programs that will run alone (assuming Java, and the same platform) without the presence of Linolab itself; the same result from Matlab requires separate compilation. Linolab appears, on my admittedly short acquaintance so far, to have better scalability and object orientation, plus support for threading which I haven't tried. On the other hand there is a lot of preexisting *lab stuff out there, a lot of *lab expertise on which to call, and Matlab installs with access to a lot more extension material than Linolab. Linolab code can be transferred to other Java development environments, though, of course Matlab code can on the whole be transferred between *lab clone environments.

Plotting, I was surprised to discover, I found more intuitive than with *lab clones - surprising both because the plotting module, as with MtxVec StatsMaster, is externally based (on JFreeChart in this case) and because Java is less familiar territory for me than *lab. Once I had taken on board the advice to ignore the plethora of classes provided, relying on the default unless it wouldn't do what I wanted, it all went very smoothly in less time and with less attention than I am used to - and the presence of SVG was a pleasant bonus. Calculation took more thought because small issues and convention differences need to be absorbed, but was equivalent in essence. Most of the methods are drawn from the free Jsci API set of packages, which is a good thing in itself since it benefits from the larger expertise base of open source, but some are from CERN's jet.math larder or, in the case of quaternions, apparently developed just for Linolab.

Atebion, a hybrid tree-based modelling environment combining elements of spreadsheet and computational command execution approaches.

The facility for 'playful' exploratory modelling is my onward link to Atebion - I value it in environments like Matlab and Linolab, and Atebion offers a different take on it. In particular, it offers a perspective that is fruitful when jointly exploring ideas with colleagues from other disciplines for whom mathematics is a regrettable necessity and a *lab style CLI-based approach is alien. The interface is an intercrossed combination of command line algebra, spreadsheet organisation and tree navigation. It is not, in essence, unique; nor does it (yet, at any rate) compete in terms of power with the mathematical products familiar to professional scientists; what it does offer is an intuitively accessible view of modelling relations.

Last year [2] I wrote about the ability of developing software to help mathematics 'break out of the ghetto' and become more widely accepted as a tool beyond its traditional sciences hinterland. It was in that setting that Atebion appealed to me at first sight, and increased that appeal as I got to know it. I took it around to a lot of the people I had talked to at that time, and found them unanimously positive towards it. A few months ago [3] I took a similar look at ways to embed explicitly mathematical modelling methods in a public policy debate. Taking Atebion with me to a lot of meetings where I knew that models would be important, but mathematics problematic, was an extension of that second exploration.

Atebion's screen splits into three areas. In the top left quadrant is what looks like a conventional spreadsheet. Below it, across the bottom half, is a command area where algebraic constructs are assembled. At top right is a tree representing parts and aspects of the model. Variables names (not cell references) are used as placeholders in those constructs. Equations in the command area take variable values from spreadsheet cells and return computed values to others, and multiple spreadsheets can be linked to any level of complexity that I had the patience to investigate. That's it. Conceptually, it's no different from a spreadsheet, or an algebra package, or that other spreadsheet mould breaker, DADiSP [4]; psychoperceptually, though, there is a world of difference. You can see the bottom command area as simply an expansion of the Excel formula bar, and the tree as just the notebook management pane of many programs (Statistica or MiniTab, for example); but if maintaining a mental audit of a spreadsheet's multitudinous formulæ grid is a learned behaviour (as it is for me) rather than a natural way of thinking, then this is a breath of fresh air. My only criticism of Atebion is that I would prefer to see the tree moved to the left hand screen edge where convention expects it to be.

One of the points that counted against Atebion in some discussions beforehand turned out to be a strength in practice: its existing acceptance in some accounting circles, and the financial examples given by publishers Atebit in promoting it. There's a subconscious assumption that finance and science have little in common. On the other hand, that same assumption had a reassuring effect on many people who would normally bridle at an openly mathematical approach: money is something that everyone feels they understand, even if (like me) they don't manage very well with it in reality. Even the most cursory look at Atebion's facilities shows that it is not primarily designed with money in mind; it may not be rich in higher mathematical functions, but it is equipped on the same level as the pocket scientific calculators many of us rely upon for essential day to day scientific work. Trigonometric functions, for example, have little use in the average accountant's day. (There is, of course, a lot of overlap between advanced science and advanced finance; but I checked with my own accountant that I'm not maligning the bulk of the profession out of ignorance.) A number of functions are available (detection of insoluble or unstable situations among them) to logically test the validity of constructed models. Atebion is obviously a young product, and its growth will depend on the requirements of the markets that open up for it; but it is well placed for one of those markets to be science.

One example will do for illustration. I had been asked to attend a mixed group of scientists and administrators running a protected species area on a voluntary basis, to inform a meeting at which they would discuss a contentious question over culling. This happens to be an emotive area for me personally, so I needed to be careful that my private feelings were kept scrupulously separate from my professional input. That problem was compounded by realisation, during initial discussions, that individual members of the group fell into two basic types: one had already made up their minds, and my maths wasn't going to change them, while the others would simply glaze over and expect my maths to tell them what to do. Curiously, the scientists were equally split between these two camps: a famous biologist, used to applying advanced mathematical methods in a genetic context, was in the 'glazed expression and obedience' faction when it came to fairly simple ecological arithmetic. Neither psychology was very useful to them, or provided an obvious justification for my being there. The solution turned out to be no provided information of any kind; just a laptop, a projector, and a copy of Atebion.

In the meeting, I first showed one of Atebit's simple financial models. Everyone (especially the administrators, but the scientists as well) were quite happy with the way a balance sheet was being shown. I then led them through replacing monetary values in the model with population and depredation levels, showing that a balance sheet could show increase and decrease in a species just as well as in a bank account. Then we replaced the financial formulæ with those for a simplistic predator/prey model, and I insisted that the meeting supply agreed ranges for every variable rather than relying on me. Once that was accepted, we moved on to the basic Lotka-Volterra version. This was a step change, since Atebion does not provide native methods for solution of differential equations, but using its reversible input/output goal seeking methods and explicit definitions of differentials in algebraic terms there was nothing difficult involved. We discussed the implication of stochasticising the model using Swift's methods [5], but stopped short of implementing them (with Atebion's existing tools it was harder work than the situation justified, though I intend to pursue the experiment out of interest). By the close of the meeting, they had reached a unanimous decision without any input from me; not a result that I could have easily reached with a conventional spreadsheet, mathematics or analysis and visualisation product.

DTREG takes modelling to a much higher level of sophistication and capability.

While Atebion is the friendliest entry point I've seen, modelling is taken to a much higher level of sophistication and capability by DTREG. Playfulness here gives way to efficiency and comprehensiveness. Phil Sherrod, author of DTREG, has a background that inspires confidence: his previous credits range from a nonlinear regression program and a mathematical plotter to a complete OS marketed by the company (S&H, Nashville) in which he is a partner. This is not a product to pick up and wave around without knowing what you are doing; it brings together under one roof, so to speak, a range of methods that make it a powerful aid for the serious and informed modeller but, equally, a maze for the unprepared or unwary. If you think you are going to spend significant amounts of time in search of critical relationships buried deep within your data, though, this may very well be what you are looking for.

The methods implemented by DTREG cover a broad spread of pros and cons in operational terms, which is where the need for understanding starts: rather than plug into a black box and trust you will select your method, better to define the properties of your data, and so on. At one end of the scale are single decision trees of the classical kind, displayed as a graphic; at the other, support vector machines (SVM). Between the two come logistic regression and two developments of the tree idea - forests or boosts. If you know what all of these are, you won't thank me for teaching you to suck eggs; if you don't know what they are then you may not want to. One of the many good indicators to a software product's quality is the standard of its explanations and DTREG gets an A+ grade here - the manual could be used as the basis of a course or text book - so I'll restrict myself to a quick gloss.

A simple decision tree is found in most statistical analysis and visualisation products, and is in essence just a dendritic picture of the way relationships branch to or from a root point. Boosted trees take the output of the initial tree as a point of departure for other trees, while forests grow several trees at once and only then examine their interactions. Despite what I have said about DTREG not being a black box, once you have made your decisions and started these more complex arboricultures, you are flying blind compared to traditional methods; and much the same is true of SVMs. An SVM explores a classification boundary between conceptual data subsets, and the definition of that boundary is the expressed relationship. Decisions made about the degree of definition decide the compromise between theoretical perfection and practical utility. Logistic regression is the outsider, and uses more traditional approaches to examine relation in terms of either a binary success/failure outcome or a probability of specified outcome.

There is a lot more to the package than just the models and their controls; a manipulation language for one thing. More, as with every one of these products, than can be even touched upon. The best start is to download the manual and read it thoroughly; then, if it sounds like a possible solution for your needs, try it out. However beautiful it may be (and DTREG is), nobody buys a product like this without a pragmatic purpose in mind, so the important question is how well these component methods work in practice. After some initial joyrides to get a feel for the interface and choice mechanisms, I pulled out some past archived cases and ran them through to see how they played out - one concerned with mixture of a protective paint film, the other with combination of veterinary pharmaceuticals. The results were pretty much spot on with previous findings; where they weren't, there were lessons to be learnt of one sort or another. In the paints case, I had been careless with some of the input settings; for the pharmaceuticals, DTREG's output matched slight changes from my original recommendations, which have been made in the light of experience.

Pretty good, then. But there's more to it than that; my own methods drew on several different tools, while DTREG offers them all in one packet. It's in the nature of modelling that cross checking is important and one way to cross check is to follow different routes before comparing results; having those different routes in the same place encourages this at the very least. It's also worth noting that within the single structure each use refines understanding of the controls - so that I now have a much better understanding of how I would tackle a problem such as the paint mix than I had before I started. It's a product where each use is an investment, not a one-off adventure.

One of the biggest benefits to me of this odyssey (as it was last time, in retrospect) is the renewed realisation of how much variety is out there and at the same time how much synergy. In many ways, the four products I set out to investigate couldn't be more different, and that is their strength - different horses for different courses. At the same time, it's remarkable how much they reflect, in their different ways, the commonality of core purpose and function. Any application of mathematics is, of course, modelling; but it's fun to rediscover that fact in action, in so many different and productive materials.

References

[1] Grant, F., Not the usual suspects. Scientific Computing World, 2004 (October).
[2] Grant, F., Count me in: maths for the wider public. Scientific Computing World, 2004 (October).
[3] Grant, F., Exploring energy: Easing the way to fuels paradise. Scientific Computing World, 2005 (June).
[4] Grant, F., DADiSP 2002: Escape from the cell block Scientific Computing World, 2003 (July/August).
[5] Swift, R.J., A Stochastic Predator-Prey Model. Irish Mathematical Society Bulletin, 2002 (48): pp.57-63.

A model miscellany

Editor's picks

The convergence of HPC and AI: Innovation in the post-Moore’s Law era

Online Panel Discussion | Optimise your HPC storage strategy

On-demand | AI in Life Sciences: Practical applications in small molecule design

On-demand Webcast: Transform your labs with cutting-edge AI solutions

Centralising analytical data from mass spectrometry in drug discovery and development

AI-driven Laboratories: Navigating Challenges and Embracing the Future

Choosing a flexible digital platform for drug discovery