Skip to main content

SC25: How Lablup's cluster workload management tool skips the need for specialists

The Lablup booth at SC25

Lablup used SC25 to set out its goal to provide the AI infrastructure operating platform of choice for the hyperscale deep learning and scientific computing sectors. Its flagship product, Backend.AI, aims to achieve three key objectives: to maximise GPU utilisation; to keep pace with the rapidly advancing deep learning software stack with GPUs; and to offer the ultimate level of convenience to users.
 

Jeongkyu Shin, founder and CEO, says: ‘While doing my PhD at university, I found it very challenging to manage the clusters, so I set about developing the best platform I could to solve that problem.
 

‘This led to the founding of  Lablup in 2015, with the aim of accelerating computation-based research by having the best tools possible for the management of clusters. However, that initial product had a very low user base; there are a limited number of clusters that require this type of tool.
 

‘We recognised very quickly that AI was a key beneficiary of accelerated computing, but that there were no easy or convenient tools on the market. Everyone was using C++ or CAFFE.
 

‘At that time [2015], there were no users; it was very difficult to convince anyone of the importance of accelerated computing power. In 2017, we renamed our flagship product Backend.AI to make it obvious that the platform could be used for any AI computation and made it open source. This opened up new enterprise markets for us. Commercial research organisations have similar infrastructure to universities, they’re just using it in a different way - but they still need AI and HPC management.’


Having made themselves extremely well known in their home country of South Korea, international expansion plans were initially stalled by Covid-19. However, those expansion plans are very much now back up to speed, with Lablup present at major trade shows around the world.


In a market that has plenty of cluster workload management tools available, Lablup believes that its software layer can increase GPU utilisation through GPU virtualisation. Backend.AI’s scheduling capabilities are complemented by Lablup’s proprietary container orchestrator, Sokovan, which has been specifically designed to run resource-intensive batch workloads in containerised environments. As an alternative to traditional tools like Slurm, Sokovan offers acceleration-aware, multi-tenant, batch-oriented job scheduling alongside integrated hardware acceleration technologies.
 

‘Cluster optimisation and GPU management used to be so complex,’ continues Shin. ‘It wasn’t accessible or manageable by the average scientist or engineer; you needed to be a specialist. With this product, we have made something that automates and optimises everything without the need for coding-level knowledge. Researchers can just get on with their research. As our company slogan says, we “make AI accessible”; we’re also now making AI scalable.’
 

For scientists and engineers, the research benefits can be found in maximising the utilisation of hardware they already own. ‘HPC hardware, GPUs, storage - they’re all really expensive,’ says Shin. ‘So, if we can help cut costs by removing the need for further hardware investment just by making what they already have more efficient, that is going to help make research budgets go much further.’
 

The global expansion now extends to five continents, with iconic customers including the University of Southern California Center for Advanced Research Computing (USC CARC), Samsung Electronics, LG Electronics, Hyundai Mobis, and Korea Telecom, who are training their AI models and services on the Backend.AI environment.

 

How Backend.AI is furthering scientific research

In addition to the key customers listed above, Backend.AI also supports academic computing research across a wide range of institutions and universities, helping them discover new possibilities.

At Sungkyunkwan University (SKKU), where its research output is focused on computer science and engineering, the challenge was to find a platform capable of handling a centralised infrastructure that included more than 40 Nvidia A100 GPUs, high-density compute nodes with numerous CPU cores per node, backed up by a storage capacity of approximately 2000 terabytes.

Moreover, with multiple research groups competing for access to the resource, it was essential for the university to monitor usage, cost allocation and procurement.


Backend.AI was chosen to streamline GPU cluster management; it also helped to automate resource allocation, track usage and process billing based on individual researcher’s consumption. The in-built session-based architecture ensured researchers could self-configure their preferred machine-learning frameworks on demand.


At Korea Institute of Science and Technology (KIST), where research areas include materials, robotics and artificial intelligence, the team within its CJLab function were involved in Cas9 research. Cas9, an endonuclease derived from bacterial immune systems, can recognise and cleave specific DNA sequences, making it a key tool in gene editing.


DNA and genetics in general are notoriously data intensive, so the computing environment required needed to be able to provide rapid and accessible large-scale molecular simulations. Again, with computational requirements varying across research projects, the team at KIST opted for Backend.AI cloud services, powered by Lablup’s GPU resources.


The platform’s container-level GPU virtualisation enables flexible and efficient resource allocation, allowing researchers to deploy application software, such as GROMACS, and terminate instances as needed. This ensures that scientific research outcomes are achieved in a robust and scalable environment. 

Media Partners