Managing a modern HPC cluster

High-performance computing (HPC) has seen the introduction of many new technologies in recent years from heterogeneous computing to the cloud, big data, and even deep learning. All of these changes can be challenging, particularly for systems administrators accustomed to traditional HPC clusters.

The task of managing a data centre has become much more important – and much more complex at the same time. Fortunately, if you are responsible for managing such a data centre, there is some help to be found in excellent management tools. Let's take a look at what's changing and explore some ways to make managing it all easier.

Reining in the challenges of cluster management

A major piece of the cutting-edge data centre technology puzzle is the cluster. The term is used to mean different things, so for the sake of this discussion, let's define a cluster as a collection of computers that are linked through a network that acts as a single, much more powerful machine. Clusters are modular with individual servers (nodes) that are independent, fully functioning units that can be added or subtracted, upgraded, shared, assigned, rearranged, etc., as needed. Clusters are redundant by nature and generally speaking if one node in a cluster fails the rest of the nodes can share the load and continue operating.

Why are clusters moving to the data centre? For many users today this is because a clustered infrastructure can increase efficiency – enabling new types of workloads to be run effectively at scale.

Examples of the types of workloads running on clustered infrastructure in modern data centres include:

1. High-Performance Computing (HPC) – once the exclusive domain to research labs, HPC is finding its way into corporate data centres where it can be applied to solving all kinds of business problems, not just scientific simulations and calculations.

2. Cloud Computing – perhaps one of the most powerful resources of modern data and information systems, cloud computing creates an environment where resources can be allocated to different applications dynamically. These resources may be corporate-owned in a private cloud, or they may be in a public cloud, like Amazon Web Services (AWS).

3. Big Data Analytics – the uses for big data analytics are diverse: growth algorithms, demand forecasting, clickstream analysers, recommendation engines, fraud detection and more, benefit from the technology behind big data analytics. These include Hadoop, Spark, and scalable NoSQL tools like Cassandra.

4. Deep Learning – a practical incarnation of artificial intelligence that is increasingly being used to go a step beyond big data analysis. Deep learning is improving video processing, natural language processing, translation, and a host of other practical applications.

Because these are mission-critical workloads, effective cluster design and management is crucial and has to scale to enterprise IT levels. Production applications demand reliable, scalable infrastructure to run on. As more scale-out architectures move to the data centre, the importance of cluster management will continue to grow.

Bigger, better hardware is becoming the norm

As the needs of consumers change with time, so does HPC. We continue to see an increasing need for large scale (even exascale in some cases) computing and data-centric processing with high I/O performance metrics. At the same time, tight budgets are driving requirements for better energy efficiency and encouraging multi-use cluster hardware.

One way to make the most of the compute footprint is to adopt accelerators. According to some predictions, more than half of the systems installed this year will include accelerators, such as GPUs. Additionally, servers with big data capabilities are becoming more affordable making them the standard choice for general-purpose servers and storage clusters.

New parameters, new atmospheres: HPC in the cloud.

The term HPC escapes a one-size-fits-all, cut and dried definition. HPC leverages the aggregation of computing resources to solve problems that can't be handled by a single machine.

Of course, the aggregation of computing resources can take many forms, and each scenario will be a little different from the next. Some only involve traditional HPC workload managers. Others will include Hadoop or Spark for big data. Still, others will require cloud-based servers or a mix of on-premise and cloud-based servers. The question then becomes: How does your organisation aggregate these resources in a manner that is both flexible and efficient? At a minimum, you will need a solution that can allocate and release various resources independently as needed.

Invest in good software to manage your clustered infrastructure

If you are running a modern data centre that incorporates clustered infrastructure, an effective management solution can be a tremendous investment. Let's take a look at how it can help.

Consider the effort involved in deploying, configuring, monitoring, and managing all of the servers in your clusters. Compare the work involved in setting up something like a Hadoop cluster (with tens or even hundreds of servers), and you'll begin to see the value that automated provisioning and installing brings. With the right tools, you can deploy a cluster from bare metal, reliably and quickly.

A centralised management console that enables you to monitor and manage your clustered infrastructure is a tremendous time saver. So is automated cluster setup. Once you've got the server cluster going, you will want to make sure it is healthy – both regarding the health of the cluster as well as the health of the individual nodes that make up the cluster. Keeping on top of problems as they arise brings peace of mind, and substantially improves uptime.

Corral those containers

Beyond the basics of clustered infrastructure, your deployment may require more advanced capabilities, such as containers. A container provides a complete runtime environment (application, dependencies, libraries, and configuration files) bundled into one, container. They are similar in many ways to virtual machines (VMs), with the major difference being that the operation system exists outside the container.

This provides some efficiencies over VMs, while still solving the problem of how to get software to run reliably when moved from one computing environment to another.

You can run multiple containers per server, reducing the number of servers you need and lowering overall cost. This cost saving is driving an increasing number of data centres to deploy containerised architectures in their data centres.

Advanced cluster managers may support containers within clusters by integrating Docker technology. This allows users to isolate processes in the cluster from one another and ensure applications run in a consistent environment. Isolating processes in the cluster makes it easier to write secure applications to run in a clustered environment. The use of containers can ensure that applications can get at whatever file systems, third-party applications, or other resources they need, no matter which cluster the application is sent to run on. If your cluster manager also supports Kubernetes orchestration, containers can share resources across the cluster to do things such as load balancing.

Solving your hybrid cloud conundrum

An excellent cluster manager is essential when it comes to your organisation's hybrid cloud setup. This is because configuring and managing a hybrid cloud can be very challenging. Balancing resources between on-premise, private cloud, and public cloud requires data centre managers to make some hard decisions about reliability and security. It also requires some sophisticated configuration and management expertise.

What can you do if your firm doesn't have the requisite experts on staff? Make sure you have an excellent cluster manager at your disposal. Make sure your organisation has the ability to stand up server clusters in a hurry. Make sure cloud bursting is at the ready in case there is a need to quickly call up more services from the public cloud. Make sure you can organise all of your computing resources in a modular fashion so that changes can be made quickly and hassle free.

I'm biased, of course, but I believe you should give Bright Computing serious consideration when it comes to selecting management software for your hybrid cloud. From a blog by Julia Shih titled 2016: Year of the Monkey (and the Hybrid Cloud), ‘Bright OpenStack works by discovering bare-metal hardware and installing a Linux distribution over the top of it. This Linux platform then serves as the foundation for a fully operational OpenStack cluster, which in turn can provision and manage virtual machines compatible with any major public cloud (AWS, Azure, etc.).

The OpenStack cluster utilises on-premise hardware resources and serves as a company's private cloud platform, and is capable of bursting workloads to the public cloud as necessary, making it a very effective hybrid cloud.’

The latest version of Bright Cluster Manager includes enhanced support for AWS integration and containers, plus other great new features; together with Bright OpenStack, which makes spinning up clusters for HPC and other uses inside an OpenStack private cloud quicker and easier than ever. This solid hybrid cloud implementation may be just what your organisation needs.

Do you need cluster management?

In my opinion, any organisation that uses HPC in combination with other cluster-based systems can benefit from using advanced cluster management software. The ability to quickly repurpose nodes for the task at hand, constantly monitor the entire system for problems, and manage hardware, software, and network components in an integrated fashion is indispensable to a thriving organisation. There are many benefits.

Automated configuration and deployment lend agility to organisations, as new systems can be brought online quickly to respond to new business requirements.

Constant monitoring of systems ensures that every component within a cluster and tracked for performance and availability. Such vigilance goes a long way towards spotting problems early so that you can intervene, correct the issue, and keep the workloads running.

On-demand and cluster extension into the cloud enable a dynamic, responsive data centre to cater to the needs of a modern enterprise. No more bottlenecks.

A capable cluster manager can take the grunt work out of building and maintaining a clustered infrastructure of any kind. As a result, your researchers are free to pursue their main objective instead of acting as ad-hoc systems administrators. And your systems administrators can get more work done in less time, making your entire organisation more efficient.

Managing a modern HPC cluster

Editor's picks

Enter the SCW75 - celebrating leaders in scientific computing

On Demand: Free Online Panel Discussion | LIMS innovation boosts precision and security

On-Demand: Optimise your HPC storage strategy

On-demand | AI in Life Sciences: Practical applications in small molecule design

Protecting bioanalytical data integrity from bench to report

Why AILNs are the future of scientific discovery

Future-proofing your lab: key considerations for upgrading or switching chromatography data systems