Realising the potential of AI and HPC
The convergence of artificial intelligence (AI) and high-performance computing (HPC) promises to transform the scientific computing landscape, with its potential to enable research groups to tackle challenges that would otherwise have been beyond their capabilities.
Over the past decade, we have begun to see AI penetrate nearly all industries and scientific disciplines, from the headline-grabbing integration in autonomous vehicles and the protein-folding predictions of AlphaFold, to the more quietly heralded work managing traffic flows, creating more efficient jet engines, and removing the noise from astronomical images. This is, undoubtedly, only the beginning, especially as AI is increasingly combined with the processing power possible from HPC and that is necessary for dealing with the large data sets that are being made available, or may need to be simulated, and the complexities of the deep-learning models.
There are, however, many challenges to be overcome. This includes challenges inherent to AI, challenges to the integration of AI in HPC, and challenges to successfully transfer knowledge to the people who need it.
The black box of AI
That AI is a black box technology, with the scientist unable to know what is going on inside, is one of the challenges that still faces the application of AI. As Michèle Weiland, senior research fellow at EPCC, the supercomputer centre at the University of Edinburgh put it: 'If you are dealing with a black box, and you are getting the right answer, you don’t necessarily know the reason why you are getting the right answer, you may just be lucky.'
The problem of discerning why you are getting the answer you are getting can be exacerbated as the barriers to technologies are lowered and they start to be used by those who do not necessarily recognise the inherent limitations of AI.
Andreas Lintermann of the Jülich Supercomputing Centre and Coordinator of CoE RAISE (The European Center of Excellence in Exascale Computing - Research on AI- and Simulation-Based Engineering at Exascale) explained: 'It’s a new field for many domain scientists. They have to find out what models can be used, what architectures for neural networks make sense, and what kind and amount of data is suited. There’s a danger that you simply plug-in a model that you believe works for your problem. It might generate some output that looks very promising, but in the end it might not be as accurate as you think. These models hide very complex processes from the user and there are always the questions – do the models always do what the user, or the scientist, expects them to do, and do they generate an output with a sufficient accuracy to reliably reason from?'
It is important that scientists don’t blindly trust these models, rather there is a need for 'explainable AI', with a human who can explain and verify the AI outputs. Vikram Saletore, principal engineer, Super Computer Group at Intel Corporation, provides the example of interpreting the inference of a CT scan that might identify a tumour. Explainable AI incorporates a radiologist who uses their expert interpretation to ensure that the identified region is indeed a tumour.
He said: 'Scientists are deservedly sceptical of AI replacing modelling and simulation. To be useful, this technology must deliver demonstrably correct results to prevent the introduction of non-physical artifacts for a user’s simulation, and provide benefits such as faster inference performance.'
It’s not only that AI is a black box, but the speed of development and workflows are also very different to what many scientists are used to. Weiland explained: 'A lot of these machine learning tools are quite a long way removed from what computational scientists use in their day-to-day life. Scientists often use monolithic applications that they know inside out, and they are often quite old – 10, 20 or even 30 years old – and they have grown over time. Whereas these new applications have been very dynamic over recent years, so getting to grips with the different technology and the fast-changing landscape is quite a challenge.'
In the case of HPC, there are also significant differences in the way people are expected to interact with the systems, with command prompts to a computer in a data centre rather than graphical user interfaces on their own PC. There is also a need to rethink how the code is processed; it’s not just the same code run faster. As the scientist moves from small data to big data it becomes necessary to parallelise the code, so data can be analysed in parallel rather than sequentially, which could otherwise take years.
The challenge for the scientist is to break down the task into smaller parts without suffering from the overhead of the communication between the different parts, which inevitably can impact some use cases more than others, as Lintermann explained: 'The problem is split into sub-problems, where each of the processes that live in a high-performance computer works on its own sub-problem.
'For example, if you want to simulate the fluid mechanics in a complete room, the problem is usually too big to be computed on a single processor and the room is split in such a way that each processor takes care of the computation in only a small fraction of the complete volume of the room. If you now, at one point in the room, initiate a pressure wave that travels through the room, it also needs to travels across these different sub-volumes with the consequence that the information from one processor needs to be transported to another processor. This communication is always a bottleneck when scaling from a small to a large number of processors.
'For a fixed problem size, you want to make sure that if you use more resources that you achieve a result faster. Usually if you have the same size problem and use more processors, you would continually decrease the time to solution. However, as the volume sizes and the corresponding number of elements per sub-volume decrease, the communication share increases, and at some point, using more processors does not make the computation faster anymore. This is usually when the scaling ends.'
This task of optimisation is not one that the domain scientist is likely to have spent much time thinking about previously, but as Lintermann pointed out, it doesn’t just relate to simulation, but must also be explored for the optimisation of AI, where ideally models can be trained faster by using more processors.
As Lintermann went on to explain, simulations and AI can be part of combined full loop implementations. A simulation can produce a lot of data, and these data are – maybe already at simulation run time – used to train artificial neural networks, which may as surrogates directly be plugged into the original simulation. The same loop can iteratively be run to continuously optimise not only the simulation but also the surrogate model. This is something CoE RAISE is working on for example in the context of hydrogen combustion and its integration in aircraft engines.
At the same time, the HPC hardware is also changing, and this is partly because of the influence of AI. As Weiland put it: 'The AI and HPC convergence is driving changes in hardware designs for HPC systems. Traditionally, HPC systems are designed for monolithic scientific computing applications, numerical codes that don’t do a vast amount of high-throughput IO - they predominantly read and write large files. HPC systems don't necessarily have the required flexibility with the workloads. AI has slightly different requirements on the hardware, and system designs will more and more reflect these changes, such as providing different types of file systems for different types of applications. Systems designs will have to vary a little bit, change and adapt to accommodate all the workloads equally. That's beginning to happen, but it'll actually happen more as these AI and HPC converge more.'
Changes in hardware are increasingly possible because HPC systems are becoming more and more modular, consisting of different components. Some components are more suited for some tasks than others, and there are always new components coming along. Saletore spoke of how some of Intel’s latest technology are helping with both AI and HPC.
Saletore explained: 'To meet the needs of our customers, the latest Intel technology is raising the bar in AI and HPC computing. XPUs provide an AI-enabled general-purpose hardware platform that incorporates high bandwidth memory, Xe-HPC accelerators provide a large number of cores and massive parallelism coupled with high bandwidth memory and interconnect – all of which are important to fast time to solution in a distributed HPC environment.
'General-purpose, next-generation Intel Xeon Scalable processors (code name Sapphire Rapids) are now architected with new instructions for AI including Advanced Matrix Extensions (AMX) and Tile matrix MULtiply (TMUL) ISA extensions. Some of these Intel processors incorporate High Bandwidth Memory (HBM). Intel Optane persistent memory with Distributed Asynchronous Object Storage (DAOS) have revolutionised storage performance to address the data handling issues.'
As the acceleration of hardware capabilities continues, it becomes increasingly important that software doesn’t become locked-in to the hardware, and Intel’s oneAPI ecosystem is an important part of that, enabling scientists and organisations to quickly pivot to the fastest and most cost-effective hardware platforms. As Saletore put it: 'OneAPI is an important initiative for HPC as it is open standards-based and is delivering performant portability.'
Overcoming the challenges
The key to overcoming some of the challenges faced is bringing experts in the different domains together. Only then can ensuring that science and industry get access to the facilities and technologies that they need and they can use in a manner that ensures that scientific progress will be achieved as quickly as possible.
There is an increasing focus at a European level on easing the process by which researchers and scientists can make use of HPC and AI. For example, EuroCC, a network of 33 international partners around Europe is designed to identify knowledge gaps and identify HPC competencies to help with access to HPC for researchers. Similarly, the CoE RAISE project, funded by the European Commission under the Horizon 2020 Framework Project, is tasked with developing scalable AI technologies towards Exascale with use cases from engineering and natural sciences. Examples of use cases include the optimisation of the surface of aerofoils to reduce drag and increase lift, identifying potential porosities or weaknesses in metal additive manufacturing, and windfarm layout optimisation.
Lintermann explained the importance of getting this collaboration right, ensuring sufficient understanding between the different experts, and some of the steps taken in CoE RAISE: 'There has to be a general understanding - what is the problem for a specific domain and what do the domain scientists want to solve? We found out that the creation of so-called Factsheets is very promising. The computer scientists, the AI experts, the HPC experts, and the domain experts gather to jointly draft the problem set up in an understandable way and to carve out what would be necessary to solve the problem. This starts a discussion on how people can work together, how the AI people can contribute to solving the problem, and what HPC can do in reducing the time to solution. This all happens in an interaction room, which is an online live meeting room, where the people meet and have mural boards at their disposal to draw and add any information, images, and explanations. This has proven to be very helpful in equalizing the language among the disciplines and helps people to communicate.'
The potential of combining AI with HPC is phenomenal, although of course that doesn’t mean it is suited to every task in scientific computing, and part of the challenge is to identify where it can be best put to work.
As Weiland explained with reference to numerical computing: 'In the most promising approaches, people are looking at those parts of the computation that can be replaced by machine learning, that don't necessarily influence the outcome, but can accelerate the getting to the solution by providing a better initial guess. Techniques where you effectively replace an entire numerical model or scientific computing model with AI, that's not ever going to be a solution, because AI is largely a black box. You can replace sections of scientific computing with machine learning, bits that matter for the performance, but don't matter for the solution.'
Finding those places where AI and HPC can work, and ensuring that it works to the best extent possible requires successful collaboration between AI experts, HPC experts and domain experts, and there are now many success stories in the field. Saletore said: 'Success stories and ready accessibility to advanced technology have raised awareness among scientists about how AI augmentation and even replacement can expand what computer models can do. This promotes new thinking and the freedom to think big as scientists pursue the creation of more accurate models and explainable AI models. Scientists are realising they can do research that was not possible before.'
Projects such as EuroCC and CoE RAISE can be seen to be playing an important part in this across Europe.