The many shades of data intensive
‘HPC is no longer an accurate description of the field. We’re now talking about Intensive Computing, where HPC fits as one element in a wider picture,’ says Cliff Brereton, Director of the Hartree Centre for UK Supercomputing & Big Data Analytics.
Science has a big problem with data. Different areas within science are generating bigger and bigger data sets. In areas of research, from astronomy and particle physics to the life sciences and translational medicine, the size of these data sets is threatening to overwhelm even the largest HPC installations.
It’s not only sheer volume that is the problem. In many areas of science, large data sets are also becoming increasingly complex. Data sets can be made up of tens or hundreds of thousands of inter-related object data states, each transforming and updating every second. Other recent scientific modelling exercises have involved hundreds of thousands of micro-models, each one interconnected to create a real time simulation of a complex human organ.
Data can be complex as a result of massive fragmentation as a well. New, cheaper, more portable analytical devices make it easier, cheaper and faster to carry out complex genetic or chemical analyses. But often the result is short length analysis. Large-scale computation of data sets produced by such analytical devices often demands very large numbers of separate disk reads, file sorting and concentration, and very high memory loads.
The Internet of Things is another potential source of very large data sets. Laboratory devices and scientific instrumentation already can produce large amounts of data. The trend towards the Internet of Things is likely to result in far more meta-data and annotation data, much of it being produced on the fly. Increasingly there will be demands to monitor, data munge with other data sets, and interact in real time.
Then there is the challenge and opportunity around Big Data – which can be defined as very large volumes of often unstructured data collected using cheap commodity systems that prioritise data capture over data transformation or annotation. Such data is typically generated from web sources, social media, or mobile phones.
There are a number of reasons why scientific computing is increasingly encountering this variety of Big Data. In disease modelling, economic forecasting, and psychology, data drawn from the web and social media can be invaluable. The range of scientific disciplines that are finding uses for Big Data is constantly expanding.
Highly dimensional data
Dr Jianquing Fan of Princeton has pointed out that Big Data is distinguished not only by size but also by high dimensionality as well. Typically, data variables around each item can be highly multivariate, and data sets can be irregular as well, with gaps as well as additional information.
This high dimensionality in Big Data results in very high noise accumulation, spurious and incidental correlations. Big Data samples are also frequently aggregated from uncontrolled sources, giving rise to issues of heterogeneity, experimental variations and statistical biases. Computationally, not only does Big Data incur an obvious heavy computational cost, but less obviously an insidious algorithmic instability. The result is that using Big Data in experiments with standard methods can result in faulty statistical inference.
What this sheer volume of data – huge and growing scientific data sets, increasingly complex data, and also commodity hardware derived, web sourced Big Data – entails is a major challenge to established scientific computing models.
Workloads are becoming increasingly I/O bound throughout the job, and memory and network capability is facing new strains.
To exacerbate the situation further, the advent of such large and frequently complex scientific data sets, as well as Big Data, is promoting a change in expectations and demands among scientific computing end users. There are calls to allow more data intensive and unstructured data handling languages, including Python, R and Perl. Catering to these requirements creates its own set of challenges for HPC and scientific computing installations. There are also mounting calls for greater online interactivity in data intensive HPC workloads.
Addressing all these challenges calls not only for changes in system setup and workload design, but also in skill sets, methodologies, templates and even – in working with some data sets – changes in analytical cultures. And attempting to address these challenges reveals that there is a looming skills shortfall in high performance data analysis professionals, not only in many installations but also across industry. The solution may be lateral: not to look even harder for skilled professionals who simply do not exist, but instead choose potential candidates from appropriate research and university backgrounds.
Research objectives need to frame HPC data analysis
According to Joe Duran, head of enterprise computer servers at Fujitsu, HPC efforts with large data sets should avoid data crunching just for the sake of data crunching: ‘There’s an awful lot of effort put in to trying to discover unknowns from very large data sets. People trying to discover trends that they don’t know are there. This requires a tremendous amount of processing and effort, with multiple MapReduces. And it boils down to trying to discover if there’s a needle there in the first place. What people need to do is ask themselves what their core objectives are, frame the study objectives rather than just look at every piece of data coming back.’
For Thomas Hill, Director of Analytics at Dell, attempting to analyse highly dimensional, complex data sets is pointless and potentially error prone. It makes far more sense, he believes, to create multiple micro models with the data: ‘It makes little sense to build a model against very large and very complex data sets. Computing an average against a hundred well drawn examples can give you just as accurate a figure as against vast numbers. And if the large data sets are highly dimensional, then you’re looking at the potential for error. The real opportunity with very large data sets is in micro-segmentation. That is, building and testing large number of models. And when you do, that the data volume for every model is then not very large.’
But then you are faced with the challenge of how to build very large numbers of models: ‘You then face the problem that you have to manage very large numbers of models,’ says Hill. ‘The case where you have data scientists building models no longer scales. The process of building models needs to become fully automated. That is a problem that falls into the domain of what is often called “deep learning”.’
More complex, highly dimensional data sets also introduce the need to be able to set depths of processing, according to Hewlett-Packard’s John Gromala. With highly dimensional and with object data, users will need to decide just how much processing resource they want to commit to a particular workload, and how much they can afford. This is a shift away from the processing of flatter data that has characterised conventional scientific HPC. ‘With many of the new data sets that we’re looking at coming into HPC, there’s a real trade-off between speed and granularity,’ says Gromala. ‘The annotation process can be multidimensional, and have varying depth. In this scenario, the tuning has to be in the hands of the users. Some people will want to do it as fast as possible, and they’ll get a certain level of detail. Others will want to expend greater resources in order to gain much greater granularity of detail. APIs such as Swift for object storage are emerging that make this level of granularity and control possible, and fast and secure as well.’
Object stores and metadata make the difference
Efficient object storage that can serve HPC installations, both with object data and metadata, is seen by many as a key system component to dealing with very complex large datasets. ‘Object storage is becoming a cornerstone of new HPC applications,’ says Hewlett-Packard’s John Gromala. ‘Because object storage is becoming so central, the entire compute philosophy in HPC will increasingly be wrapped around a common object model, all the metadata, and all the other pieces will be consistent, and the systems will increasingly be optimised for this model. And the role of memory in this optimisation is key.’
Efficient capture and serving of metadata from object stores will increasingly affect HPC computation of very large object data sets, argues Molly Rector from Data Direct Networks: ‘It’s relatively trivial, adding very large amounts of complex data to large object stores. But the key becomes how do you annotate it, and store that metadata. The really bigger challenge in large object file systems is capturing and managing metadata efficiently.’
Cloud becomes an essential HPC component
For many vendors, the Cloud is becoming an essential part of managing very large and complex data sets. ‘Complex data sets create a real issue,’ says Jason Stowe, CEO of Cycle Computing. ‘Their combinatorial nature can produce very bursty performance and I/O spikes. Utilisation of equipment becomes an issue. You have long periods of time when hardware is simply not being utilised. There’s been an element of this in HPC already. This trend is only going to increase as data gets more complex. It’s a significant factor in why people are looking to Cloud as a resource in HPC.’
‘Highly heterogeneous workloads, where the workload profile shifts dynamically over the course of the job, are far more cost-efficient in the Cloud than they can be in on premise,’ says Stowe.
Smarter workload automation and introducing more artificial intelligence into HPC is seen by many as pivotal in enabling HPC to deal with increasingly complex large scale data loads: ‘Combining intelligence, storage, memory and workload in an overall whole is key to handling very large, complex data sets in HPC,’ says Chris Thomas, Analytics Solution Architect at IBM. ‘Computation and caching have to match in an access and processing hierarchy that performs multi levels of analyses on very large data sets, with each level of processing stripping away an order of data volume.’
For John Gromala, Senior Director, Hyperscale Product Management at Hewlett-Packard, large scale object and complex data processing can only be handled by significant amounts of workload automation: ‘When you’re doing tens of thousands of objects at very low latency, you can fine-tune the fabric, the compute, and the storage media. But when you dealing with billions of objects, at that scale, you’ve got to have extensive workload automation to make it possible.’
However whereas industry has adopted standard automation solutions in their data centres, in HPC, many sites have created their own custom built automation solutions. ‘There’s very little commonality in workflow optimisation in HPC,’ comments Molly Rector, Chief Marketing Officer at Data Direct Networks.
Make HPC simpler for end users
Accessibility of HPC resources, and a gap between owners of data and the keepers of HPC resources is seen by many as major issues that have to be addressed as HPC moves towards more data intensive computing.
According to Paul Anderson, Director of Professional Services at Adaptive Computing, simplified, accessible interfaces to HPC resources are increasingly needed. ‘Classically there’s a problem with HPC where you have a physicist, or a biochemist, or petroleum geologist, and they get told that you need to learn how to code, and you need to learn Linux, and only then can you use HPC,’ he says. ‘But organisations that have HPC are trying to get their users to the answers as fast as possible. So we see a real need now for simplified web interfaces that will accelerate the use of HPC in scientific contexts.’ ’For most users, the key to making HPC more accessible is creating standardised APIs such as SOAP and RESTful API, and tying these together through Java or Python. We are seeing a real and growing demand for Python in particular for users coming to HPC,’ says Anderson.
Chris Brown, of UK HPC analytics specialists OFC, also sees differences between data owners and HPC specialists: ‘HPC has often been set apart from end users. But large data sets are owned and sourced by departments outside of HPC. And their objectives for their data are distinct. So you can see a clash of mindsets and of priorities.’
Growing need for data scientists
The need to process larger, more complex, and more highly dimensional data is introducing the need for new disciplines in scientific HPC. There is a need for more data scientists, domain experts and even artificial intelligence experts. And the type of programming expertise demanded is changing too.
Aridhia works with the UK’s NHS to produce greater insight into major population health issues such as diabetes. With data coming from myriad organisations, and many different types of data collection, understanding just what each dataset represents is key. Unfortunately, there’s a real shortfall in data expertise in the organisations that submit their datasets to national authorities tasked with producing disease analytics. ‘From the perspective of high performance computation of healthcare data, we’ve found there really are a lot of barriers in practice,’ says Aridhia’s Andy Judson. ‘One is that there is so little data science expertise in the field among users. We have found that we have had to supply that out to organisations who want to apply HPC compute to their healthcare data. What we find is that there are plenty of clinicians and policy makers who want answers. But what are also needed are domain specialists who really understand how the data has been captured, how it’s been stored, how it’s been modelled, and how it’s been annotated, and then separately you need someone who can do the programming.’
Institutions such as the Hartree Centre and OCF are busy attempting to recruit more data scientists to fill the gap. But for OCF’s Chris Brown, finding data scientists is problematic. ‘There are very few trained data scientists with experience in this field,’ he says. ‘It’s such a new field that there is a real skills shortage. What we’ve found is that smart people with backgrounds in HPC and programming can pick up data science, and really excel at it.’
For Fujitsu’s Joe Duran, adding data scientists is not enough to handle very complex data sets in HPC. Artificial intelligence skills are also vital to create the algorithms that can handle complex, multiple source and highly dimensional data. ‘To handle these very large and very complex data sets, we need an additional level of expertise in HPC that specialises in AI,’ says Duran. ‘Financial HPC is ahead of the game here. They’ve been doing this for longer than anyone, and they have a whole new layer of AI experts who can design and manage algorithms that produce accurate, directed results, and don’t produce noise.”
The rise and rise of open source
One effect of being faced with increasingly complex and highly dimensional data in HPC is that it is increasing the predominance of open source solutions.
‘Open source is a real boon, says Aridhia’s Andy Judson. ‘It has produced really transparent tools, which are very powerful, and which there is a tremendous community around. This is really important especially in data science in the HPC arena, where you need really powerful tools, that people can be trained in rapidly, which are robust, and where there’s a lot of help and education available. It’s about getting people up to speed with the best tools as fast as possible.’
Ironically, with the rise of open source, conventional Standard setting is becoming less and less relevant in HPC. ‘Standards in HPC are becoming a lot less relevant now,’ argues Data Direct Networks’ Molly Rector, ‘because open source is becoming the predominant standard setter. So formal vendor-set standards are really set to decline in importance.’