Thanks for visiting Scientific Computing World.

You're trying to access an editorial feature that is only available to logged in, registered users of Scientific Computing World. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

The rise of Hadoop

Share this on social media:

Hadoop, an open source software project for distributed computing, is becoming synonymous with big data. Beth Harlen speaks to industry experts to find out why

Dr Sumit Gupta, GM of the Tesla Accelerated 
Computing Group at Nvidia

The use of faster, larger computers for everything from simulating natural phenomena to simulating every product we make has become mainstream in the past decade. Today, supercomputers touch every aspect of our lives, from the weather forecast we check in the morning, to the development of the drugs we take, to the design of the cars we drive. This has dramatically increased the use of data-intensive computing, and affordability of these systems is key to their widespread deployment. GPU accelerators enable supercomputing for the masses by reducing the cost of supercomputing by 10 times compared to CPU-based systems.

GPU accelerators are being widely adopted by both the HPC industry and the big data community because they offer an affordable way to tackle the deluge of data. The same software methods, and the same hardware systems that were the purview of the HPC industry until a few years ago, are now being used for these everyday simulations and mobile cloud-based computing. This is being driven in part by the growing availability of affordable, high-performance GPU accelerators that can be installed in smaller computing systems. Hadoop is a software method for solving a compute problem by distributing it across hundreds or thousands of servers. If there is a database, it is also distributed across all the servers. GPU accelerators can be used in the servers running Hadoop to speed the processing of this work across distributed servers. The only way to pull useful information from all this data is to have affordable and scalable computing infrastructures. 

Chris Brown, data analytics consultant at OCF

Just so we all understand the size (according to Intel) of the big data phenomenon, from the dawn of time until 2003 mankind generated five exabytes of data, in 2012 it generated 2.7 zettabytes1, (500 times more data) and by 2015 it is estimated that figure will have grown to eight zettabyes (one zettabyte is 1000 exabytes). How does this mass of information affect the scientific community?

Well, you can thank the scientific community for ‘big data’, emanating as it does from ‘big science’. It made the link between having an enormous amount of data to work with and the mathematically huge probabilities of actually finding anything useful, spawning projects such as astronomy images (planet detection), physics research (supercollider data analytics), medical research (drug interaction), weather prediction and others. It is also the scientific community that is at the forefront of the new technologies making big data analytics possible and cost-effective, major projects are under way to evaluate core technologies and tools that take advantage of collections of large data sets.

One technology addressing this emerging market is the Hadoop framework, which redefines the way data is managed and analysed by leveraging the power of a distributed grid of computing resources using a simple programming model to enable distributed processing of large data sets on clusters of computers. Its technology stack includes common utilities, a distributed file system, analytics, data storage platforms, an application layer that manages distributed processing, parallel computation, workflow, and configuration management. 

Ted Dunning, chief application architect, MapR

What Map Reduce, the Hadoop framework that assigns work to the nodes in a cluster, does well – and this is not the entire Hadoop experience – is certain types of computations on very large out-of-core datasets. Usually these datasets tend to be in the form of sparse observations of some kind – this may mean log lines, measurements from a physical system, or text derived from messages. The critical points are that we have an online computation the size of the data, and that the cost of that computation is linear. The scientific computations that fit that constraint are numerous, especially anything that’s embarrassingly parallel. With image processing in astronomy, for example, most parts of the sky can be processed fairly independently and so you can do data reduction on whole sky imagery or large scale synthetic aperture radio astronomy of the kind that the Square Kilometre Array (SKA) will be producing. That can be done pretty efficiently with Hadoop.

The drawback is that many of the numerical algorithms we’ve used in the past don’t work well in this new environment. A good example is the singular value decomposition (SVD), a large matrix product, and there is an algorithm that’s been known to be wonderfully good for many years called Lanczos. Involving sequential matrix by vector multiplications, it works very well on supercomputers, but on Hadoop the cost of iterating a process like that is much higher, so Lanczos becomes less appropriate. The rise of Hadoop has meant that we’ve had to find alternative algorithms that fit well in this new paradigm. Traditional clustering is also iterative, but there are new algorithms that allow you to go through the data very quickly and develop a sketch of it so that you can then apply a classic algorithm. Many of these problems require substantial restatements and that’s a severe downside of Hadoop.

But, to cause a revolution in computing, you have to restate the problem. 

Jeff Denworth, VP of marketing 
at DataDirect Networks

The reality is that we’re seeing two shifts occur in the marketplace. Firstly, big data is being distilled down into analytics of very large data sets. Click down one more level and you’ll find that almost everyone is talking about Hadoop. The two have become so linked I predict that, in the future, the term ‘big data’ will be used only to talk about Hadoop or NoSQL-style parallel data processing.

The general industry trend is to take scale-out servers and deploy the technology on those servers. But there are a growing number of customers around world who are realising there may be a smarter way of building more efficient and scalable infrastructure for these technologies. The HPC market is looking for centralised storage that enables people to, first and foremost, decouple the capacity they need to deploy with the compute, because there’s not always a 1:1 ratio. Secondly, high-throughput centralised storage allows people to deliver to a compute node with faster I/O performance than they can get by simply deploying 12 drives within that node. Drives aren’t getting any faster, but computers are and they’re getting hungrier – a lot hungrier if you add an accelerator or GPGPU to the system. As a result, people are really challenging the notion of data nodes – some call them super data nodes – because no matter how many disks they put into the system it seems like they can’t get the greatest level of compute efficiency. An alternate approach is to put a number of InfiniBand cards into that node and really drive the LAN speed as opposed to the speed of commodity disks.

The Hadoop ecosystem has become an extraordinary movement, where developers are involved and a lot of work is being done at scale as people recognise that it is the default platform for batch and real-time computing. Internally, we refer to it as a data refinery. In the past, the general trend has been towards a ‘pizza box’ scale-out mentality, but now the industry is really engaging in discussion of smarter approaches to a Hadoop infrastructure. Parallel processing at massive scale is the next stage of data analytics. Performance and compute efficiency are increasingly key system attributes, and balanced, efficient systems will usher in broad adoption of parallel Map Reduce processing methods. 

Barbara Murphy, chief marketing officer at Panasas

The message that we have tried to get across is that Hadoop is an emerging application for non-SQL workloads – which is the vast majority of what is being created as unstructured data. There needs to be a way of taking petabytes of unstructured data and putting it into a format that is semi-structured and can be used by the traditional business intelligence tools out there. Organisations have such huge silos of data that they don’t even know what they have any more, and so many are beginning to look at Hadoop as a way of categorising that content. Rather than making specific, deterministic queries, Hadoop is about clustering common elements together and bringing out patterns. In that first case, people are attempting to prove an answer, while in the second case it’s about looking for commonality in order to find an answer.

What’s interesting about Hadoop is the fact that the technology is still in its infancy, so the industry hasn’t quite figured out how it fits in. People are experimenting with it, and many more are using it across numerous fields, but it’s a limited set of people who have actually figured out how it fully fits into their workloads. The approach everyone has been taking is to make Hadoop a standalone workflow, whereas it should be viewed as a piece of the puzzle.

Somehow, the Hadoop workloads became meshed with the Hadoop hardware platform in people’s minds and there’s the idea that it’s purely local storage and compute together. But that’s not a requisite – that’s a coincidence born out of the way Google built out its original system. The difficulty for users is that they end up with a dedicated system for these workloads, but then have to move that data somewhere else in order to make it more usable. The compelling message is that the Hadoop workload itself is not tied to the hardware that’s become so prevalent.

Companies don’t need to invest in new or alternative technologies, as Hadoop can be run on existing infrastructure as long as it’s a solid scale-out system that can deliver the performance required. So, it’s important that people aren’t put off by the idea that another piece of hardware is necessary. The focus too often is on what piece of storage is needed to use Hadoop, whereas the real question is how to get value out of using it. The point we want to get across is that the physical storage shouldn’t be connected to the decision of how to use Hadoop. The technology will be taking off within 24 months, but the shortage will be of people who know how to program to the languages. That’s the next hurdle. 

Bill Mannel, vice president of server product management at SGI

Big data is truly in the eyes of the beholder, but can essentially be defined as data that is significantly larger in volume than people are accustomed to, that needs to be accessed faster than was previously necessary, or that is comprised of multiple different types. In terms of coping with the big data shift, Hadoop is one of the first and more successful instantiations. Born out of the internet space, Hadoop spanned out to other industries rather quickly as people witnessed its effectiveness. That being said, it is important to note that it’s not the only big data method out there – but it is arguably the most popular.

The main advantage of Hadoop is that it’s open source, which has allowed an active community to be built around it. This community has lowered the financial and technical barriers, and ensured that innovations are occurring at a pace. Typical software companies have a major release once a year, and a minor release once a year. Now we are seeing a lot of companies within the Hadoop space having one major release per quarter. Hadoop really is right on the forefront of the big data explosion. It effectively flips the paradigm because it brings compute to the data, whereas data has generally been stored in an array before being brought across to the server when needed.

With Hadoop, you have data nodes and assign the processing to the data nodes directly. It’s a unique proposition when it comes to big data, and it’s also very scalable and easy to add nodes to as the data grows. This combination is very attractive. However, as the number of nodes increases, there are limitations. Latency becomes a big issue but this can sometimes be addressed with higher-bandwidth interconnects. There are also software challenges that come from increasing node sizes. I do expect to see more broader base solutions as the 1,000- and 10,000-node Hadoop clusters become more popular.

The key point is that users have the opportunity to begin Hadoop projects and discover the big data in their midst. SGI, for example, has been in existence for more than 20 years and has lots of data lying around in a multitude of different places. We don’t know where much of it is or, indeed, how relevant it is. This is not unique to SGI and many companies are seeking ways to explore and take advantage of that historical data.

Our back office consists of a standard type of database and, if I want to look at different data, I have write to a request to our IT department. It could be some time before I get a new form written, enabling me to access the data differently. But if you look at what can be done with some of the tools that sit on top of Hadoop, you realise that you can start to build your own applications and view the data how you want to. There are challenges in terms of using the environment and ensuring that everyone is up to speed with the technology, but based upon what was possible previously, we can now access more data, faster. There are many ways of becoming familiar with Hadoop, such as meetings worldwide and online training – we’ve partnered with Cloudera on training, for example.