Weathering HPC challenges

Share this on social media:

This article is brought to you by: 

Tied together within the same community, the fields of weather forecasting and climate modelling may differ in both focus and approach, but the fundamental goal of developing more accurate predictions through the use of higher fidelity modelling and simulation tools remains the same. And critical to this goal is high-performance computing (HPC).

Dr Paul Selwood, manager of HPC optimisation at The Met Office, one of the world’s leading weather forecasting and climate research organisations, emphasised the impact of HPC: ‘Over the past 30 years, we have been able to improve the accuracy of our forecasts by roughly one day per decade. This means that every 10 years, a forecast will have the same level of reliability at three days out that it previously did at two days out,’ he said. ‘HPC provides the means to actually do the complex calculations that our science has enabled, and so every forecast and prediction is reliant on those resources. It’s absolutely central to what we do.’

HPC in weather forecasting is not without its own set of challenges, however, as accurate modelling is dependent on the handling of significant amounts of data and fluctuating variables. According to Mike Hawkins, head of the High Performance Computing and Storage Section at the European Centre for Medium-range Weather Forecasts, the main computing issue in this field has been the scalability of the codes. In order to increase the accuracy of forecasts, higher resolutions and more accurate simulations of physical processes are required.  This needs more computing power and the only practical way to achieve this is through the use of more processors, as Hawkins explained: ‘There hasn’t been a dramatic increase in the speed of processors and so we have to make use of an increasing number of them. There has also been a definite trend on the hardware side of HPC to use different types of processor in a system, so GPGPUs and Intel Xeon Phi processors are both becoming more popular.

‘This presents us with a real challenge because we then have to put a forecast system that’s been developed over a long time and that we know produces very good results, into a new architecture. Not only that, but we have to ensure it will work for the next decade at least.’ Hawkins added that with each new architecture the Centre has to look at the scalability and portability of the codes, and how they can be maintained for long periods of time. This represents a major and long-term development programme for the centre.

Honing in on the hardware

Consistency and reliability are more than buzzwords within weather forecasting, where unique demands are placed on the HPC systems. At The Met Office, based in the UK, two separate systems operate in unison in order to provide operational resilience as data from more than 10 million weather observations per day are used alongside an advanced atmospheric model to create 3,000 tailored forecasts on a daily basis.

‘Most high-performance computers exist in academia where the set-up essentially consists of one large machine being used to run the most expansive leading-edge science possible. And if that one machine goes offline for a week, the impact is often minimal,’ commented Dr Paul Selwood. ‘At The Met Office, however, we need to get multiple forecasts out on time every single day and if any one of those forecasts is late, it’s instantly of no use.’ Selwood continued by saying that forecasting remains an unusual field within HPC in that the machine’s primary function is as an operational system rather than a research system.

In the latter half of 2014, The Met Office announced that it had awarded a contract to Cray for the provision of multiple Cray XC supercomputers and Cray Sonexion storage systems. These systems will replace the current pair of IBM Power7 supercomputers that are currently coming to the end of their operational lives at the centre. The £97 million ($128 million) contract will consist of three phases over several years and will expand Cray’s presence in the global weather and climate community.

The European Centre for Medium-range Weather Forecasts (ECMWF) has also recently deployed two new Cray systems. Following a competitive tendering exercise, ECMWF selected two Cray XC30 systems which arrived on site at the end of 2013. The systems came into production last year and have been running the Centre’s operational forecast since September 2014. ‘Changing from one supercomputing system to another is an incredibly large job – each machine needs 1.2 megawatts of power, enough to run 6,000 UK homes and a water cooling system delivering 40 litres per second, the equivalent of 150 domestic showers. So even getting to the point of turning it on requires months of work,’ ECMWF’s Mike Hawkins explained.

‘We have to produce an operational weather forecast, to a tight timescale, several times a day so before switching systems we need to ensure that the forecasts run reliably on the new machine offer the same standard of results as those run on the previous system. This is quite a time-consuming process that contributes to fairly long periods of parallel running between the old and new systems,’ he added.

The two Cray XC30 systems at ECMWF offer about 3,500 nodes of parallel processing based on two 12-core Intel processors, with a theoretical peak performance of about 1.8 Petaflops per machine. The Centre has a benchmark for measuring the performance of the systems based on a stripped down version of its code that is easier for both the Centre and vendors to run and test. Based on that measure, the two systems have a combined sustained performance of 200 Teraflops – a boost of three times the overall processing power compared to the previous machines deployed at the Centre.

Ensuring code fidelity

Hardware is only part of the puzzle, and as mentioned earlier in this article, the reliability of the code can be of paramount importance, operational runs at the centre must produce results to very tight timeframes and ensuring that all bugs are removed from the system is critical. With this in mind, both The Met Office and ECMWF chose to deploy a powerful debugger, Allinea DDT. ‘The best recommendation we received for Allinea DDT from the developers using it is that it does what good debugger is supposed to do, it does it fast and it does it in a very intuitive way,’ said Mike Hawkins at ECMWF. ‘We needed a solution that could not only handle our very complicated code, but that could be used in a wide range of scales. These scales range from people who are developing pieces of code on their desktop machines – or perhaps even on one or two nodes on the supercomputer – right up to a full configuration, which takes up to 100 nodes of our machine. Having the same tool work across that entire spectrum is of definite benefit to us.’

Likewise, the demands at The Met Office meant that any chosen debugger needed to be able to handle complex code development and porting work. ‘Allinea DDT was one of the very few parallel debuggers that could fit in well with the types of workflows we have,’ said Dr Paul Selwood. ‘The right debugger can save everyone quite a bit of time, and that’s what we’re really trying to do with Allinea DDT – improve the time to solution for a development or investigation of a scientific problem.’ He added that within the new system, Allinea DDT has found a number of problems relatively quickly, which in turn has enabled those bugs to be fixed equally as fast.

Allinea DDT ensures that even the most complex multi-threaded or multi-process software problems can be solved quickly and easily. Supercomputers can be debugged from a laptop with native Mac, Windows and Linux clients, and OpenGL array visualisation displays are now performed on the user’s local graphics card, removing the need for laggy X-forwarding or poor-quality VNC displays. In addition to facilitating collaborative debugging, Allinea DDT maintains an automatic, always-on log of debugging activity so that users can easily go back and review the evidence for things you might have missed at the time. Furthermore, users can see process and thread anomalies instantly with the parallel stack display – the scalable view that narrows down problems for any level of concurrency: from a handful of processes and threads to hundreds of thousands.

Allinea DDT also plays a part in code development and revision as a single command in Allinea DDT will automatically log the values of variables across all processes at each changed section of code. This enables users to track down exactly how and why a particular change introduced problems to the code. This is particularly useful for The Met Office which is taking the rather unusual step of utterly rewriting its codes with new algorithms. ‘It’s a different approach to programming compared to what we have traditionally used,’ Selwood explained. ‘It’s a long-term project and it’s going to take a while before we really see the benefit, but given the direction HPC architectures are heading in with ever increasing demands on parallelism and scalability, it’s something we need to do. Our current codes are 30 years old and showing their age.’

Climate of change

Much like weather forecasting, climate modelling is fundamentally tied to high-performance computing. According to Thomas Ludwig, director of the German Climate Computing Center (DKRZ), climate modelling is not a single batch job issue, and the workflows are becoming increasingly complex. Chain jobs exist with these complex relationships, and programs are being composed of several binaries (e.g. atmosphere, ocean and coupler) that work at the same time. The demands are high as relevant computations need millions of core hours and produce hundreds of terabytes.

At DKRZ, the mission is to provide advanced compute power, storage capacity, and service to users, and to that end the Center assists users with their parallel programs (e.g. debugging, performance analysis, etc.) and then later supports them with data handling (e.g. quality management, long-term archival). The ultimate goal is to maximise the amount of scientific insight that can be gained by using DKRZ’s services.

‘The main challenges are the scalability of codes for an increasing number of cores and a rising amount of data that needs to be stored,’ Ludwig explained. ‘These are challenges we are yet to overcome. Scalability is a severe problem that perhaps has no solution – we simply have to continually work on it by analysing and improving code sections, and improving the load balance. As for the amount of data, we are looking into compression and re-computation.’

DKRZ currently has an IBM Power6 with 150 Tflops peak and 6 PB disk and 130 PB tape, out of which 40+ is used. Installation of a new machine will begin in the first quarter of 2015, and it will eventually have in the region of 3 Pflops peak, 50 PB disk, ½ EB tape. The Center will go from 8PB/year added to tape to approximately 75 PB/year that goes to tape. In terms of software tools, DKRZ has also deployed Allinea DDT as the debugger of choice. Through Allinea DDT, DKRZ staff can save time finding bugs and problems in applications, in particular for complicated user cases. And as HPC architectures evolve, Allinea DDT will continue to be the world’s most scalable and sought-after debugger.

For more information, please visit www.allinea.com