Tech focus: HPC management software

Share this on social media:

Credit: ZinetroN/Shutterstock

As HPC systems continue to grow in complexity, the sheer number of hardware components and through the increasing parallelism of those components, the effective management of these systems becomes increasingly critical to maximise the return on investment in scientific computing. 

Ensuring that a cluster is utilised efficiently is a full-time job. However, employing the latest software can reduce the burden of supporting a HPC cluster, reduce the number of people needed to manage the resource, or allow more science and engineering to be completed by utilising the resource more efficiently. 

The options available for cluster management software are as varied as the types of computing systems on which they can be deployed. Whether you are an academic institution leveraging open source software due to budget restrictions, or a commercial company paying for software with additional support and maintenance, choosing the right software package can save key resources. 

OCF has developed its own software stack through the implementation and deployment of HPC systems over many years. ‘When you scale over a number of machines, you need to have a way of managing those machines, installing an operating system, setting up and managing that environment,’ said Andy Dean, sales director, OCF. 

‘That’s essentially what a piece of cluster management software, or a suite of management software gives you. The ability to manage and provision capabilities to allow you to keep the environment consistent across all those machines. A HPC cluster isn’t an HPC cluster until you install a management stack on top of it – it’s just a collection of nodes.’

OCF has developed its own HPC management software stack based on several open-source technologies that the company works with regularly and recommends to its users. This stack consists of several components, including Slurm, X-cat, Proxmox, CrateDB, Prometheus, Saltstack and Grafana. However, Dean notes that the company is flexible and has extensive experience working with different types of HPC software. 

‘We have a modular approach that allows us to add any other major component... and the skills to allow us to implement any of those components along with our software stack, if that’s what our customers would prefer,’ said Dean.

Laurence Horrocks-Barlow, technical director at OCF, commented: ‘We’ve developed the stack, we can use any technologies as modules, but our preferred, well tested and integrated workload manager is Slurm. Over the last few years we’ve been putting an official development product life cycle around the development of our in-house software stack to the point that there’s a dedicated research and development team with dedicated releases quarterly. We put a lot of time and focus on what our customers require. 

‘The OCF STEEL stack is this continual development and integration with what the HPC community wants, and what customers require to provision their HPC services to their customers,’ said Horrocks-Barlow. 


The importance of the management stack when building a cluster shouldn’t be underestimated. OCF Steel is a suite of cluster management software that allows organisations to run HPC applications on their clusters, together with the tools that help to manage, maintain, and monitor the HPC environment. With OCF’s modular approach, customers can choose OCF Steel’s range of standard software stack installation with tried and tested technologies, combined with the unique flexibility of choosing bespoke components if required. 

With 20 years of experience in HPC cluster management, OCF offers a flexible pathway to its customers utilising open-source technologies, making it easier to upgrade and support the cluster management software stack without any software licensing fees. OCF Steel is deployed using a resilient open source virtualised management platform, making it simpler to facilitate any necessary upgrades or adaptations to the cluster. Customers using the OCF Steel software stack are supported by OCF’s dedicated HPC support team. storage resources integrated with workload orchestration services for HPC applications.

Other products 

Aspen Systems Cluster Management software comes standard with all of its HPC clusters, along with its standard service package at no additional cost. Aspen Cluster HPC Management software is compatible with most Linux distributions and is supported for the life of the cluster. 

Bright Cluster Manager software automates the process of building and managing modern, high-performance Linux clusters, eliminating complexity and enabling flexibility. Bright software aims to eliminate complexity and enables flexibility by allowing users to deploy complete Linux clusters over bare metal and manage them reliably from edge, to core, to cloud. Providing cluster management solutions for the new era of HPC, Bright Computing combines provisioning, monitoring and management capabilities in a single tool that spans the entire lifecycle of your Linux cluster. With Bright Cluster Manager, your administrators can provide better support to end users and your business. 

Advanced Clustering Technologies (ACT) has designed ClusterVisor to enable you to easily deploy your HPC cluster and manage everything from the hardware and operating system to software and networking using a single GUI. The full-featured ClusterVisor tool gives you everything you need to manage and make changes to your cluster over time. ClusterVisor is highly customisable to ensure you can manage your cluster and organise your data in a way that makes the most sense. 

eQUEUE from ACT is a software solution that allows system administrators to create easy-to-use, web-based job submission forms. It is designed to increase cluster utilisation by bringing more users to the cluster who would ordinarily stay away due to the complexity of submitting jobs to a cluster. There is no need to learn Linux or scripting. The end user simply inputs his or her data into predefined fields and the job is now in the cluster’s queue to run. 

The Scalable Cube from HPC Scalable is an enterprise ready, supported distribution of an open-source workload scheduler that supports a wide variety of HPC and analytic applications. Whether deployed on site, on virtual infrastructure, or in the cloud, customers can take advantage of top-quality support services from HPC Scalable, helping ensure the success of managing their HPC workloads. 

Microsoft Azure high performance computing (HPC) is a complete set of computing, networking and multiple hardware, operating system, storage, network, license and resource manager environments. With purpose-built HPC infrastructure, solutions, and optimised application services, Azure offers competitive price/performance compared to on-premises options. with additional high-performance computing benefits. Additionally, Azure includes next-generation machine-learning tools to drive smarter simulations and empower intelligent decision making.

Adaptive Computing’s Moab Cluster Suite is a professional cluster workload management solution that integrates the scheduling, managing, monitoring and reporting of cluster workloads. Moab Cluster Suite simplifies and unifies management across one or multiple hardware, operating system, storage, network, license and resource manager environments. It processes greater workloads in less time to maximise cluster ROI. Its task-oriented management and flexible policy engine ensure service levels are delivered and workload is processed faster. This enables organisations to accomplish more work, resulting in improved cluster ROI. 

Omnia is a deployment tool to configure Dell EMC PowerEdge servers running standard RPM-based Linux OS images into clusters capable of supporting HPC, AI and data analytics workloads. It uses Slurm, Kubernetes and other packages to manage jobs and run diverse workloads on the same converged solution. It is a collection of Ansible playbooks, is open source, and is constantly being extended to enable comprehensive workloads. 

PBS Professional from Altair is a workload manager designed to improve productivity, optimise utilisation and efficiency, and simplify administration for clusters, clouds and supercomputers – from the biggest HPC workloads to millions of small, high-throughput jobs. PBS Professional automates job scheduling, management, monitoring, and reporting, and it's the trusted solution for complex Top500 systems as well as smaller clusters. PBS Professional delivers a workload simulator that makes it easy to understand job behaviour and the effects of policy changes, plus allocation and budget management capabilities that let you manage budgets across your enterprise. 

Slurm is an open source, fault-tolerant and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. 

As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

The IBM Spectrum LSF software is designed to distribute work across existing heterogeneous IT resources to create a shared, scalable and fault-tolerant infrastructure that delivers faster, more reliable workload performance and reduces cost. LSF provides a resource management framework that takes your job requirements, finds the best resources to run the job, and monitors its progress. Jobs always run according to host load and site policies. 

Univa Grid Engine is a distributed resource management system for optimising workloads and resources in thousands of data centres, improving performance and boosting productivity and efficiency. Grid Engine helps organisations improve ROI and deliver better results faster by optimising throughput and performance of applications, containers and services while maximising shared compute resources across on-premises, hybrid and cloud infrastructures.