Supporting UK HPC

Share this on social media:

Issue: 

The initial 4 Cabinet Archer2 installation

UK scientists and researchers will soon receive a new supercomputing upgrade 'Archer2' that will be hosted by the EPCC, formerly known as the Edinburgh Parallel Computing Centre, as one of the primary HPC resources available to UK researchers.

Archer and its replacement, the aptly named, Archer2 are provided by UK Research and Innovation (UKRI) to support HPC research in the UK. The EPCC hosts three major national facilities: Archer, the UK national supercomputing facility provided by UKRI; DiRAC, a national supercomputing facility provided by STFC; UK Research Data Facility (UK-RDF) a national data facility provided by UKRI. This is alongside research into HPC and exascale, education and training and providing services to industrial HPC users.

Dr Michèle Weiland, senior research fellow at the EPCC, notes that while the EPCC does run these national computing facilities and provide service provisioning this is just a small part of the work undertaken at the centre. ‘We also do many other things such as research – HPC research, for example – we have a five-year project with Rolls Royce (RR), a strategic partnership prosperity project which is funded partly by RR and partially by the EPCC.

‘We also work with industry were we provide access to HPC resources and support for industry customers. We help industry users get access to machines but we also help them make the best use of the resources that we have available for them,’ added Weiland.

These industrial users can, as in the case of Rolls Royce, be well versed in the use of HPC, but many of the EPCC’s industrial partners are small to medium-sized enterprises that may be accessing HPC services for the first time.

‘Some of them may have been working on workstations in their office and they know they need more [computing resources] but they do not know where to start,’ stated Weiland. If they were to go into the cloud they might not be ready to do that, in terms of skills, so they want access to computing but they realise they cannot do it without speaking to somebody.

‘We sell them the time on the machine but generally, we give them a first taste of what it is like to use such a machine, we help them get onto the system. Other companies will come to us wanting resources that they do necessarily get in the cloud or they can potentially get it cheaper with us, and they know what to do, they just want the time,’ added Weiland.

Developing industrial partnerships

One of the largest industrial partnerships that the EPCC is working on is the five-year ‘Strategic Partnership in Computational Science for Advanced Simulation and Modelling of Virtual Systems’ (ASiMoV).

In 2018 the consortium led by Rolls-Royce and EPCC was awarded an Engineering and Physical Sciences Research Council (EPSRC) Prosperity Partnership worth £14.7m to develop the next generation of engineering simulation and modelling techniques, with the aim of developing the world’s first high-fidelity simulation of a complete gas-turbine engine during operation.

ASiMoV will require breakthroughs across many simulation domains combining mathematics, algorithms, software, security and computer architectures with fundamental engineering and computational science research. This is needed to address a challenge that is beyond the capabilities of today’s state-of-the-art computing resources.  The project is led by EPCC, and Rolls-Royce; collaborating with the Universities of Bristol, Cambridge, Oxford and Warwick.

ASiMoV ‘is jointly funded by EPCC and Rolls Royce and the whole point of the project is to simulate the world’s first whole engine simulation. What happens at the moment is that you might model one part of an engine, structure, electromagnetic simulation or the CFD simulation but what we want to do here is to couple all of these things together for a full engine model that couples all of the different aspects of these simulations together.’

At the moment when you are trying to certify an engine you build an engine and you want to test a blade-off event then you actually have to generate a blade-off event or a bird-strike for example. Ideally, in the long-term future, you would want to do this computationally so you do not have to build the thing and then destroy it’, said Weiland.

‘There are some tests that can already be done virtually but the idea is that really you want to be able to do all of this virtually. Virtual certification is the long-term goal that they would be working towards,’ Weiland added. It is computationally expensive and it is also technically challenging getting all these components to work together. The project is basing this research on existing simulation tools that are used by Rolls Royce with further software development to help integrate the different components.’

Managing the UK’s national HPC resources

In March 2020 the EPCC announced that it had been awarded contracts to run the Service Provision and Computational Science and Engineering services for Archer2. In February, the EPCC announced a major upgrade to CIRRUS – a tier-2 system available on-demand for pay-per-use basis by industry users. Cirrus is primarily used to solve CFD and FEA simulation and modelling problems in sectors such as automotive, aerospace, energy, oil and gas, general engineering, life sciences and financial services.

The Cirrus system is set up to allows users to run their own codes as well as accessing a range of commercial software tools. Based on an SGI ICE XA system with 280 compute nodes, and an Infiniband interconnect. There are 36 cores per node (18 cores per Intel Xeon ‘Broadwell’ processor) providing 10,080 cores in total, with each node 256GB RAM.

The EPCC has received £3.5m funding over four years to continue the Cirrus service until early 2024. The EPCC will added 144 NVIDIA V100 GPUs to the system and a 256TB high-performance storage layer to help meet more demanding data streaming applications.
The new capability from CIRRUS is being used to prepare users for heterogeneity at exascale and also support growth in artificial intelligence (AI) and machine learning (ML) workloads.

Archer2 is based on a Cray Shasta system with an estimated peak performance of 28 PFLOP/s. The machine will have 5,848 compute nodes, each with dual AMD EPYC Zen2 (Rome) 64 core CPUs at 2.2GHz, giving 748,544 cores in total and 1.57 PBytes of total system memory.

In July 2020 the EPCC and UKRI reported that the first four cabinets had been delivered for the Archer2. The system will replace the current Archer system, a Cray XC30, as the UK’s National HPC system. Once fully configured, it is estimated that Archer2 should be capable on average of more than eleven times the science throughput of Archer, based on benchmarks which use five of the most heavily used codes on the current service.

The full Archer2 system will consist of 23 Shasta Mountain cabinets, giving a total of 748,544 cores with 1.57 PBytes of total system memory and an estimated peak performance of 28 PFLOP/s.

Archer2 will replace Archer in two stages. The first stage will be for four cabinets to be built and brought in to operation, at which point Archer will be decommissioned and the remaining 19 cabinets delivered and built.

‘Some of the work that goes on behind the scenes is porting some of the standard software that people want to log in and use immediately. We provide a lot of software pre-installed and we want to provide in pre-installed and working well,’ said Weiland.

This means that the EPCC has begun testing software and writing documentation to prepare the system for use by researchers once it is fully operational. Weiland noted that EPCC are trying to ‘write the documentation in such a way that when they first log in and look at the documentation they are able to submit their jobs and configure their jobs so that they get good performance.’

‘Old Archer is 24 cores per node but new Archer will be 128 cores per node. That is much more parallelism than people are used to so this will pose some challenges. The CPU architecture is very different as well so people will experience much more in terms of non-uniform memory access (NUMA) effects.’

Weiland also added that ‘each 64 core CPU is really eight chip-lets of eight cores: ‘This means the teams need to re-optimise for data locality.’

While this means that running applications should be no more difficult than any other system there will be some optimisation required to get applications working optimally. ‘I think there will be a trend towards more mixed-mode operation,’ stated Weiland.

‘If your code has OpenMP in addition to MPI then people will want to use it more and I think there may also be a trend towards under-populating the nodes – not using all of the cores that are available and that is because some codes may require more memory per core then is available.

‘If you under populate and use 32 or 64 cores in a node then you will get more memory that way and you will still have a lot of parallelism so you are not going to be as reluctant to under-populate to say every second core because you have so many of them available,’ Weiland concluded.