Too big for its boots
In a delightful way, from a writing perspective, the data analysis topics over the last few issues have first synergised with one another and then led naturally on to the consideration of big data. What exactly is ‘big data’? The answer, it should come as no particular surprise to hear, is ‘it depends’. As a broad, rough and ready definition, it means data in sufficient volume, complexity and velocity to present practical problems in storage, management, curation and analysis within a reasonable time scale. In other words, it is data that becomes, or at least threatens to become in a specific context, too dense, too rapidly acquired and too various to handle. Clearly, specific contexts will vary from case to case and over time (technology continuously upgrades our ability to manage data as well as generate it in greater volume) but, broadly speaking, the gap remains – and seems likely to remain in the immediate future. The poster boys and girls of big data, in this respect, are the likes of genomics, social research, astronomy and the Large Hadron Collider (LHC), whose unmanaged gross sensor output would be around 50 zettabytes per day.
There are other thorny issues besides the technicalities of computing. Some of them concern research ethics: to what extent, for example, is it justifiable to use big data gathered for other purposes (for example, from health, telecommunications, credit card usage or social networking) in ways to which the subjects did not give consent? Janet Currie (to mention only one recent example amongst many) suggests a stark tightrope with her ‘big data vs. big brother’ consideration of large-scale paediatric studies. Others are more of a concern to statisticians like me: there is a tendency for the sheer density of data available to obscure the idea of a representative sample – and a billion unbalanced data points can actually give much less reliable results than 30 well selected ones.
Conversely, however, big data can also be defined in terms not of problems but of opportunity. Big data approaches open up the opportunity to explore very small but crucial effects. They can be used to validate (or otherwise) smaller and more focused data collection, as for instance in Ansolabehere and Hersh’s study of survey misreporting. As technology gives us expanding data capture capabilities at ever-finer levels of resolution, all areas of scientific endeavour are becoming increasingly data intensive. That means (in principle, at least) knowing the nature of our studies in greater detail than statisticians of my generation could ever have dreamed. A couple of issues back, to look at the smaller end of the scale, I mentioned the example of an automated entomological field study regime simultaneously sampling 2,000 variables at a resolution of several hundred cases per second. That’s not, by any stretch of the imagination, in LHC territory but it is big enough data to make significant call on a one terabyte portable hard drive. It’s also a goldmine opportunity for small team, or even individual, study of phenomena that not long ago would have been beyond the reach of even the largest government-funded programme: big data has revolutionised small science.
There is, in any case, no going back; big data is here to stay – and it will grow ever bigger, because it can. Like all progress, it’s a double-edged sword and the trick as always is to manage the obstacles in ways that deliver the prize. Most of the LHC’s raw output is not used or stored; only the crucial data points for a truly useful sample are analysed.
From a scientific computing point of view, the first problem to be so managed is described succinctly by Informatica’s Greg Hanson (see box: Only Connect): integration of often wildly different data sets to allow treatment as a single analytic entity. Traditional relational database management systems (RDBMS) run into difficulties here. Where big data results from the aggregation of similarly structured systems (suggested exploitation of Britain’s National Health Service records as a clinical research base is an example to which I shall also return), old approaches might still work, but most cases don’t fit that model. RDBMS rely upon consistency of field and record structure that is, by definition, missing from multiple data sources compiled for different purposes by different research programmes in different places and times, scattered higgledy-piggledy across the reaches of the internet cloud. Indeed, there may not even be enough coherence between the data streams emerging for immediate examination from different captures within a single organisation.
Matt Asay, at scalable data specialist MongoDB (see box: Space science in real time), describes RDBMS as ‘one of the world’s most successful inventions’, having for four decades ‘played an integral part in a wide array of industries, and any significant scientific discovery that required a data set’. Now, however, the flood of big data has burst the chreodic banks that made that approach viable.
The solution is emerging in the form of NoSQL (Not only SQL) database management systems, particularly the document oriented approach that, to considerably over simplify, retrieves documents based on their content and key reference, using their internal structure (rather than one which is externally imposed) to assemble the data within them in a useful form. Once that data is integrated, by whatever means, there remain issues to be resolved. There is the human difficulty of mentally getting a handle on huge data volumes. There is, in some cases, the physical impossibility of storing that volume or processing it meaningfully in a finite time. And there are issues around the potential compromise of research quality by the siren call of supply quantity.
Despite all our current analytic computing power, an important factor in good analysis is a live, grounded, intuitive human overview of the data under examination. The most effective means so far developed of maintaining that overview is, as Golden’s Sabrina Pearson eloquently points out (see box: A picture is worth a million data points), mapping of variables and relationships onto sensory metaphors – predominantly by graphic visualisation though also, to a more restricted extent, by sonification. This was my theme in the August/September issue of Scientific Computing World, so I’ll stick to a big data case study here.
France’s CEA (Alternative Energies and Atomic Energy Commission) is a government-funded research organisation with numerous ramifications. One of those is a division of immune-virology at Fontenay-aux-Roses, which explores and develops vaccine strategies for treating chronic and emerging viral infections. A major bottleneck in this division’s work lay in the various reporting systems and processes handling large bodies (up to 50,000 fluorescence-tagged cells per second) of flow cytometry data. A disproportionate amount of expert time was tied up in managing the data flow rather than analysing and understanding it, then using it to feed experimental programmes.
Antonio Cosima, responsible for this aspect, happened across Tableau, a visual business data reporting product and spotted the potential for his own situation. He installed a trial copy and tried it out on his data; the trial was a success, and went on to integrate a full installation as a core component of the division’s data handling structure. The flood of research data, instead of progressing through a serial chain of reporting and review steps that eventually feed back into process adjustment, now passes through a single visual analysis stage that informs immediate decision making on the fly. The system facilitates close, rapid control of instrument, material or participant selection, among other aspects. It places exploratory visualisation in the hands of each team member through quickly learned interactive dashboards, and supplies publishable report illustrations as part of its operation. Cosima estimates that switching to a visual data reporting system saves the division ‘days of work’, which can now be switched to other, more productive purposes.
The challenge of large, complex, high-velocity data products that threaten to exceed available storage capacity is often met by applying reduction strategies inline. These are often based on quite traditional methods. Qlucore’s Omics Explorer, for example, uses principal component analysis and hierarchical clustering to fish out the most relevant informational strands from the data flood. Omics explorer takes its name from its roots at Lund University in the big data sets of proteomics, genomics, etc. – and the reduction of massive data sets to understandable results in a short time is central to its purpose. Study of gene expression in meningiomas, or in circulating blood cells following the holy grail of individualised diabetes treatment, is an example. NextGen RNA sequencing is cutting-edge, but the statistical tests involved may once again be even more traditional than PCA. F and T tests would have been familiar to statisticians of my grandparents’ time, never mind my own, but their application in a few seconds to sample profiling from immense microarray data sets would have defied belief not so very long ago.
Looking ahead, the inevitable rise and rise of big data promises to drive increasingly imaginative scientific computing approaches. There will, of course, be continuations of the high-performance route, testing the limits of processor and architecture development, but artificial neural networks (ANNs), to take just one example, are leading in other interesting directions. Researchers monitoring ecosystems off the Florida coast, for example, analyse high density data streams from autonomous robot environmental sensor systems for significant patterns using ANNs that explicitly mimic natural processes.
It’s not a huge conceptual leap from developing ANNs to the simulation or even storage of a whole organism. This (usually focused on the capture of a functioning human intellect) has been a favourite science fiction theme at least since the invention of the electronic computer, and would involve immense big data issues – unfeasible as yet, but no longer laughably so. The idea is beginning to get serious consideration and funding for small beginnings in that direction. The University of Waterloo’s Computational Neuroscience Research Group have built the largest functional model yet of a human brain and made it control an arm using simulation software MapleSim; only a relatively paltry two and a half million neurons as yet, but the principle is established. In the last few months, researchers at MIT and the Max Planck Institute have reconstructed a neural wiring for a respectable chunk of a mouse retina. Never mind the storing of a human mind; if a realistic central nervous system analogue of any kind could one day be constructed on a human complexity scale, big data would have created its own best analytic engine.
References and Sources
For a full list of references and sources, visit www.scientific-computing.com/features/referencesoct13.php
Space science in real time
One great example of scientists taking a document-orientated database approach to big data is the space weather forecasting tool at the Met Office. The team has responsibility for space weather events like coronal mass ejections and solar flares, which impact performance of the electricity grid, satellites, GPS systems, aviation and mobile communications. They used our scalable document-orientated NoSQL database to analyse a large volume and wide variety of data types including solar flare imagery from NASA and live feeds from satellites tracking radiation flux, magnetic field strength and solar wind. The system not only tracks security critical events as they unfold, but also stores and monitors complex data for pattern analysis – just the type of challenge for which document orientated databases are so well suited. This is just one example among many of how scientists are using different tools to interact with big data, delivering research that changes how we understand the universe.
Matt Asay, VP corporate strategy at MongoDB
Big data offers an unprecedented opportunity to draw greater levels of insight about the world around us than ever before. However, this data can be hard to manage. There are too many disparate data sets for many analytics to be really accurate and useful.
If we truly want to take advantage of big data, these enormous amounts of disparate data need to be brought together quickly and easily. Without this first step of effective data integration, analytics cannot take place, and insights cannot inform useful action. Only this way can data unleash their potential.
Greg Hanson, chief technology officer for Europe, the Middle East and Africa, Informatica
A picture is worth a million data points
Data pours in from multiple automated locations. Yesterday’s information is replaced or supplemented. Complexity grows, new variables await analysis, and processes are reworked. Long-term trends are easy to overlook when data accumulate by the micro-second; minor nuances can be mistaken for significant events.
Visualisation is critical in these large data situations. It highlights possible lines of enquiry, brings fuzzy mountains of data into clear focus, and displays multiple variables in a single simple image. Colours and symbols highlight data shifts. Multiple images combined as video show changes in variables over time. Visualisation communicates big data to audiences: quickly updating a director, funding provider or non-technical team member. New collaborators are teased to produce their own ideas to explore. That simple, well designed graph or map puts everyone on the same page, regardless of whether they understand the underlying data.
Sabrina Pearson, Grapher product manager, Golden Software
Big data analytics give insights into things and relationships we never knew existed. But R&D is based on good records, defending results by enabling others to repeat, validate and utilise findings in their own work. Unless analytic outputs are re-usable and consumable, captured alongside the context of how you got there, their usefulness is seriously diminished.
It’s like whisky production: start with a huge vat of ingredients, cook them up, and after various process steps distil out and capture the very small final amount of valuable product. Advanced electronic laboratory notebooks, like our E-WorkBook, do just that with big data.
Aggregation and analysis must not disrupt day-to-day tasks. Technology must be the bridge connecting big data analytics to experimental process management. Scientists can then capture the results they need at the bench.
Paul Denny-Gouldson, VP for translational medicine at IDBS