The Leibniz Supercomputing Centre (LRZ) is driving the development of new energy-efficient practices for HPC, as Robert Roe discovers
In the second profile of the three HPC centres under the banner of the Gauss Center for Supercomputing (GCS), the focus is now on the Leibniz Supercomputing Centre (LRZ), located in Garching, Munich. LRZ serves HPC users with a particular application focus on life sciences, environmental sciences, geophysics and astrophysics, and engineering.
However, as Professor Arnt Bode, chairman of the board of directors at LRZ explains, its role is more than just providing HPC, it also provides IT services to the local academic and research communities.
He said: ‘We are not a single computing centre for one university or research institution. We serve universities, colleges, the Bavarian library, internet visualisation services, we provide backup and storage, and of course we are also running many HPC servers.’
Each GCS centre hosts its multi-petaflop supercomputer, placing all three institutions among the most powerful computing centres worldwide. With a total of more than 20 petaflops, GCS offers by far the largest and most powerful supercomputing infrastructure in Europe to serve a broad range of science and industrial research activities.
The LRZ provides general IT services for more than 100,000 university customers in Munich and the Bavarian Academy of Sciences and Humanities (BAdW). It also provides a communications infrastructure called the Munich Scientific Network (Münchner Wissenschaftsnetz, MWN); a competence centre for data communication networks; archiving and backup on extensive disk and automated magnetic tape storage; and a technical and scientific high-performance supercomputing centre for all German universities.
The flagship supercomputer at LRZ is SuperMUC, although the supercomputer is effectively two separate installations with individual entries in the Top500, the bi-annual list of most powerful supercomputers based on a Linpack benchmark. Phase 1 is rated as number 27 on the Top500 while phase 2 sits at number 28 on the list, as of June 2016.
SuperMUC Phase 1 consists of 18 ‘thin node’ islands based on Intel Sandy Bridge processor technology, six ‘thin Node’ islands based on Intel Haswell processors and one ‘fat node’ island based on Intel Westmere processors with each island consisting of at least 8,192 cores. All compute nodes within an individual Island are connected via a fully non-blocking Infiniband network. This network consists of FDR10 for the ‘thin nodes’ of Phase 1, FDR14 for the Haswell nodes of Phase 2 and QDR for the Fat Nodes.
Bode stressed that the different processing technologies within each phase mean that they are effectively used as separate computing resources. He explained that the ‘fat nodes’ are a consequence of the previous supercomputer at LRZ, based on an SGI Altix shared memory system. ‘We had many applications that had quite large memory requirements, so we started with a system based on Westmere with a rather large main memory.’
Phase 1 also includes a smaller cluster SuperMIC, consisting of 32 Intel Ivy Bridge nodes each with two Intel Xeon Phi accelerator cards. However, the many-core Xeon Phi nodes are primarily used for application development and optimisation. Bode commented that this is why there is a relatively low number of Intel Xeon Phi nodes.
Combining the two phases SuperMUC creates a supercomputer totalling more than 241,000 cores and a combined peak performance of the two installation phases of more than 6.8 petaflop/s – roughly number 11 on the current top500.
Pursuing energy efficiency
In addition to the application focus which characterises the LRZ user community, the centre also focuses heavily on optimising its HPC resources for energy efficiency.
Bode stated: ‘We have a specialised computer building that uses different types of cooling, including direct warm water cooling, without any need for compressors or other things which help us to moderate our electricity bill.’
SuperMUC uses a form of warm water cooling developed by IBM. Active components like processors and memory are directly cooled with water that can have an inlet temperature of up to 40 degrees Celsius. This ‘High-Temperature Liquid Cooling’ together with innovative system software that cuts the energy consumption of the system up to 40 per cent. These energy savings are increased further as the LRZ heats its buildings using this waste heat energy.
By reducing the number of cooling components and using free cooling LRZ expects to save several millions of Euros in cooling costs over the 5-year lifetime of the SuperMUC system.
Bode also commented that the LRZ is focused on continuing research and development to optimise energy usage in the data centre further. He explained that the procurement process for LRZ is conducted under what he refers to as a ‘competitive dialogue’ which allows LRZ to be part of a co-design process to ensure that compute and cooling infrastructure can be designed and optimised simultaneously.
Perhaps the most crucial point that separates the Gauss centres from other HPC organisations is the focus on a select group of applications. This allows users access to a more specialised support network and allows LRZ to optimise hardware, software and support services for a specific set of users.
Of the main application areas mentioned, Bode commented that life sciences was experiencing the most growth. This is due to a rise in bioinformatics, medical and personalised medicine based applications. ‘We see this field growing rapidly in the number of users and the type of applications’ said Bode.
This focus on users and energy efficiency drives the procurement of new hardware at LRZ. An example of this id the addition of nodes with Intel Xeon Phi accelerators as it allows users to begin to evaluate the technology and prepare applications to be ported across to many-core architectures in the future.
Bode said: ‘It is quite clear that in the future some percentage of the system will be devoted to accelerators.’
However, Bode went one step further explaining that to fully understand the requirements of LRZ users the centre needs to analyse how many applications will be suited to this new technology: 'To rush out and buy a new supercomputer full of the latest technologies would certainly provide a theoretical performance boost. But, for a centre with hundreds of active users, this would mean months if not years, where the HPC hardware was not being used efficiently as applications, are optimised for this new HPC architecture.
‘We have to admit that we have a history of about 20 years of supercomputing meaning that we have about 700-1000 application groups of which about 50 per cent are still active users. Let’s say 500 active applications with a huge amount of software which is nearly impossible to transfer to today’s accelerators. We need to find out what the real percentage of applications that can make good use of accelerators.’
One of the advantages of running SuperMUC as separate phases is that it allows the LRZ to keep a large cluster up and running for the active user community. Bode explained that procurement is based on testing different products based on benchmark kernels taken from representative application areas of LRZ users.
This is important because it ensures that the amount of time spent porting and fine tuning applications for new hardware are reduced, keeping the user community working and ensuring that HPC resources are used at maximum efficiency.
Another important aspect of the procurement of new systems is that SuperMUC is split into different phases. If the LRZ were to shut down the entire cluster, then it would be left with huge numbers of users with no HPC resources to continue their research. Splitting the HPC resource into separate phases allows the LRZ to keep a large cluster up and running while new hardware is installed and optimised.
This will also be true of the next phase of upgrades at the centre which are planned for 2018/2019. Bode stated that ‘this will be a very large upgrade that will be overlapped with phase 2 of SuperMUC but phase 1, the Sandy-Bridge based system will be decommissioned and replaced.’ The procurement process will begin in 2018 with installation sometime in 2019.
‘We always want to make sure that we have at least one very large system running which is why we have this overlapping procedure of installation and running of systems,’ explained Bode.
This new procurement will also allow the LRZ to strengthen further its commitment to energy-efficient HPC as Bode explains. ‘Today some of our supercomputing hardware cannot be cooled with direct liquid cooling.’
The next phase of procurement will remove the oldest Sandy-Bridge processors, but it will also replace some of the infrastructures. Bode said the installation would also include liquid-cooled electrical components that will allow the centre to reduce waste heat even further so that up to 90 per cent of waste heat can be removed through liquid cooling.
The focus on providing energy efficient HPC not only drives the procurement of new hardware at LRZ but also the development of software and tools that can facilitate the optimisation of applications to make the most efficient use of resources.
Bode stressed that in addition to hardware upgrades, the LRZ would also investigate ‘middleware and tools that will help to run applications in an energy efficient way using voltage changes, reduced clock frequencies.’ These tools combined with research into how to best optimise the hardware for a given set of applications allows the LRZ to develop incredibly power efficient HPC systems, an impressive feat when considering the overall computational power of SuperMUC.