Across many industries, a common theme is emerging: the complexity of systems we wish to understand is growing at an alarming rate. Sometimes this increase in complexity is intentional; for example, the addition of new control parameters on cars to increase fuel economy or engine power. In other cases, such as patient drug trials, the complexity is inherent, and the challenge is to take more of this complexity into account to come to a better understanding of the experimental data collected.

A number of techniques have been developed to deal with the challenge of increasing system complexity; design of experiments, non-parametric modelling, and hierarchical statistical modelling. Such models are familiar to a growing number of expert mathematicians, statisticians, and engineers in the field, many of whom have developed their own routines to solve these problems. There are a large number of industries (including automotive, finance, and pharmaceutical) that could benefit from the application of these techniques to their problems. The new challenge is to make these powerful tools accessible to a wider group of users so that commerce and industry can reap the benefits.

An increase in the number of variables or parameters in a system or experiment can have a large impact on the number of data points that must be collected. The traditional technique of data collection changes one parameter at a time and tests a full matrix of possible settings. If 10 settings of each variable are considered, then already with just six variables, a million (10 to the power of six) measurements are required using this traditional method. The data collection stage of product development is often one of the most costly, and such a large number of tests is often infeasible. For example, when calibrating a new car engine, the commercial pressure of reducing time-to-market means that the company can afford to make an expensive prototype engine available for testing for a short space of time only. In patient drug trials, ethical and regulatory considerations further intensify the need to reduce the amount of testing required. It is important to collect the data so that the maximum amount of information can be obtained from a fewer number of test points. This is the aim of optimal and space-filling design-of-experiment techniques.

The increase in system complexity also demands more advanced modelling techniques. When there are a just a couple of input variables, and no non-linear behaviour to contend with, the standard approach of fitting multivariate polynomials to the data suffices. However, in many challenging industrial situations, polynomials are not flexible enough to describe fully the trends in the data across the full range of the input parameters. More flexible statistical models are required to capture the non-linear responses.

A number of model types can offer this increased flexibility. If the data takes a non-simple shape in the direction of just one of the input variables, then a hybrid spline can be used to fit the data. A hybrid spline is polynomial in all variables except the one showing the interesting response, where a spline (a number of pieces of polynomial joined together smoothly) is used. On the other hand if the complex behaviour occurs across all the input variables, then radial basis functions (a type of neural network) are a suitable tool. Radial basis functions can be pictured as a linear combination of hills or bump functions. One of the most common examples is a Gaussian function. The position, shape, width and height of the hills can be varied to get the best fit to the data. This flexibility can be used to fit scattered data with highly complex trends.

In examples such as in the pharmaceutical industry when drug trials are performed on a number of patients, there is inherent structure in the variability (error) in the data, due to the way the data is collected for one individual at a time. The error distributions between patients differ from the error distribution within an individual patient. A technique called two-stage modelling, or more generally, hierarchical statistical modelling, can be used to gain better understanding of the errors in the data that can lead to models of improved accuracy. In the automotive industry, a similar hierarchical data structure occurs when collecting engine torque data in a series of torque-spark ignition timing sweeps corresponding to different settings of engine speed and valve timings. For each sweep all the parameters except ignition timing are held fixed, and ignition timing is swept across its permissible range (called a local sweep). The clinical physician has a pre-conceived image of the shape of a curve that relates a patient response to a changing drug concentration, just as the automotive engineer has a good idea of the expected shape of Torque-Ignition Timing curves. Two-stage modelling allows the engineer or scientist to choose an appropriate model that can capture this shape, for example, a quadratic polynomial or logistic growth curve.

In the first stage of two-stage modelling a local curve is fitted to the data collected from each patient or sweep. The coefficients of each of the local curves will depend on the values of the other (global) parameters. For example, the constant term will tend to be larger at the sweeps corresponding to higher speeds, or the rate at which a patient responds to a changing drug concentration may depend on the body mass of the patient. At the second stage, the dependency of each of the coefficients of the local curves on the global parameters is explicitly modelled. This allows interpolation between the local curves, or between the patients, and builds up a model of the response (e.g. torque) as a function of both the local (e.g. Injection Timing or drug concentration) and global parameters (e.g. Engine Speed, Valve Timings or body mass, age of patient). Using a two-stage modelling approach can make the problem of removing rogue data measurements easier, because each data point can be viewed in the context of the sweep or individual in which it was collected.

Until recently, there has been a lack of commercially available software that implements design of experiments, advanced statistical modelling methods and calibration tools in a user-friendly package. Although there are a number of standard statistical modelling packages available, these are usually limited to polynomials and do not offer the more advanced response surface modelling techniques such as splines, neural networks or two-stage models. Most software tools that implement design of experiments restrict to classical designs and optimal design tools for polynomials that do not have the ability to take operating region constraints into account. There is little or no ability to create space-filling designs that are suitable for use with non-linear models such as radial basis functions. However, in December 2001, this situation changed when The MathWorks launched the Model-Based Calibration Toolbox (MBCT), the first commercially available software to implement all of the techniques discussed. Initially packaged for the automotive industry (it was developed in conjunction with Ford Motor Company), it has received a very positive response, with international automotive companies adopting the toolbox and making it an integral part of their processes.

Automotive innovators are reaping the benefits of these powerful new techniques in several ways. The MBCT is being successfully applied to calibrate advanced engine types, where there is an urgent need for new techniques. This use is most notable in countries where the customer demand for improved fuel economy is strong and the government regulations on emission levels are strict, for example in Japan and Europe. Worldwide, the MBCT is being used to obtain cost savings by reducing calibration time on a wide range of types of engine.

Although these new statistical techniques offer many benefits, they face a major barrier, which seems to be applicable world- and industry-wide: effective knowledge transfer from the innovative scientists and engineers to the production engineer or technician. To assist in crossing this barrier some functionality already exists in the MBCT.

Re-usable templates that capture experimental designs and model settings can be created by an expert and passed to less expert users; the ability to visualise and interact with the experimental designs and models is crucial to build understanding; and the export of models into other tools such as Simulink allows models to be shared across departments. Companies such as The MathWorks also offer customised training and consultancy to help customers in the integration of these new techniques into their process.

These steps have come a long way to breaking down the educational barrier, but more work is still needed. In its future development of the MBCT The MathWorks will continue to focus on ways to make it easier to transfer knowledge effectively into a mass production environment. The hope is that this is the start of a more wide-scale adoption of advanced statistical techniques. *Dr Tanya Morton is a senior engineer with The MathWorks.*