Challenges and solutions in debugging code on the Intel Xeon Phi coprocessor
With the launch of the Intel Xeon Phi coprocessor, developers have been presented with many exciting opportunities to take advantage of many-core processor technology. Because the Intel Xeon Phi coprocessor shares many architectural features and much of the development tool chain with multi-core Intel Xeon processors, it is fairly simple to port a program to the new coprocessor. However, taking full advantage of the new power offered by the Intel Xeon Phi coprocessor requires expressing a level of parallelism that demands re-thinking of algorithms. This is the exact challenge that the National Institute for Computational Sciences (NICS) at the University of Tennessee, USA, is working towards overcoming with its Beacon Project.
The project is an research initiative funded by the US National Science Foundation and the University of Tennessee to explore the impact of emerging computer architectures on computational science and engineering. Currently, nine teams associated with the Beacon Project are exploring the impact of the Intel Xeon Phi coprocessor on scientific codes and libraries, with approximately two dozen more open-call applicants about to begin work. Some of the programs that are being optimised as part of the project include magneto-hydrodynamics, plasma physics, cosmology, chemistry, quantum chromodynamics, and bioinformatics applications.
The Beacon system, which received the number-one ranking on the November 2012 Green500 list, offers access to 48 compute nodes and six I/O nodes joined by FDR InfiniBand interconnect, providing 56 Gb/s of bi-directional bandwidth. Each compute node is equipped with two Intel Xeon E5-2670 processors, four Intel Xeon Phi coprocessors 5110P, 256 GB of RAM, and 960 GB of SSD storage. Beacon provides 768 conventional cores and 11,520 accelerator cores – meaning the system offers 210 Tflops of combined computational performance, 12 TB of system memory, 1.5 TB of coprocessor memory, and more than 73 TB of SSD storage in aggregate.
The typical strategy for developers participating in the Beacon Project is to first port and then optimise code for the Intel Xeon Phi coprocessor. An example of this is the Boltzmann-BGK Solver, which uses a kinetic model for computational fluid dynamics. With hundreds of thousands of state variables that need to be solved at each grid point, the BGK-model Boltzmann equation can directly benefit from vectorisation and acceleration on the Intel Xeon Phi coprocessor. As part of its optimisation process for this solver, the team used the early-access version of TotalView to debug its native Intel Xeon Phi code and drill down to the thread level in order to debug issues that came up during porting. The team tracked down a subtle problem and discovered that the answers were wrong in the OpenMP version. Using TotalView, the team analysed the operations occurring on each OpenMP thread. Being able to compare the data from each thread with the ultimate result clarified what was happening with the code, and allowed the team to work with the vendor to get the problem resolved. After porting the code, the team was able to quickly identify and correct initial performance problems, enabling positive speed-up on the Intel Xeon Phi coprocessor relative to the Intel Xeon processor.
Another example of a successful port to the Intel Xeon Phi coprocessor is the Gyro tokamak plasma simulation code from General Atomics. Gyro numerically simulates tokamak plasma microturbulence and computes the turbulent radial transport of particles and energy in tokamak plasmas, solving 5-D coupled time-dependent nonlinear gyrokinetic Maxwell equations with gyrokinetic ions and electrons. The team porting the Gyro tokamak plasma simulation code faced a problem, which originated from a different source than that which was initially thought. The code was first ported by adding the ‘-mmic’ compiler flag and was structured around using MPI to express multi-node parallelism and OpenMP for expressing parallelism across a number of threads in order to take advantage of multiple core compute nodes.
Using TotalView, the team tracked down the issue that was causing some runs to complete and others to fail in a strange way. In the many-core environment of the Intel Xeon Phi coprocessor, the number of threads created per MPI process was increased from single digits up to 50 or 100. The work distribution scheme had an assumption that was no longer valid, and therefore work was not being distributed to most of the threads. Had this not been fixed, the performance would have been limited as many cores would have been underutilised. Moreover, in this case, the mistake also had a cascading effect that ultimately caused the MPI processes to run out of memory. Fixing the issue also made the program more balanced, which resulted in better performance.
The Beacon Project has experienced initial success with porting and optimising code for the Intel Xeon Phi coprocessor. The optimisation process exposed the need for advanced tools that help scientists debug and optimise parallel applications so that they can support hundreds of threads per node. Applications need to have large numbers of threads, or they will be unable to use more than a fraction of the Intel Xeon Phi coprocessor’s power. The biggest challenge is that there are still vast numbers of MPI-based applications that will need to be ported to MPI/OpenMP hybrid parallelism.
When this is undertaken, the structure of the code is changed in fundamental ways, and these changes often break the code. TotalView has proved critical in alleviating these growing pains by making it easier and quicker to analyse and resolve defects uncovered or created the during porting process.