James Reinders is a parallel programming and HPC expert with more than 27 years’ experience working for Intel until his retirement in 2017. In this article Reinders gives his take on the use of roofline estimation as a tool for code optimisation in HPC
Roofline Analysis is a technique that projects a view of realism into optimisation targets. It lets us know when we’ve tuned all we can (assuming evolution of our code) which may uncover the unsettling fact that we need a new algorithm (revolution).
As a long-time teacher of optimisation techniques, I can confidently say that Roofline analysis is a must-have for anyone optimising for performance. This has not always been the case. As I will explain, today it is an important technique to draw upon when doing performance optimisation.
When mentioning Roofline Analysis, I have been asked ‘Hasn’t that been around awhile?’, usually followed by ‘What’s new?’
Excellent questions. The answers revolve around two factors:
(1) complexities (latency hiding through parallelism and memory hierarchies) in optimising for today’s processing architectures – including CPUs, GPUs, and accelerators of all kinds,
(2) new tools, based on new research, to help us deal with these complexities.
In the face of increasingly complicated systems, Roofline Analysis provides us with a step-by-step method to ascertain whether an algorithm has reached the end of its ability to provide more performance through continued optimisation work.
Complexities in optimising for today’s systems
Today we are faced with a great diversity of compute devices, ranging from Intel Xeon scalable processors, and GPUs, to more application-specific accelerators enabled by FPGAs and ASIC technologies.
It’s not the diversity that demands Roofline analysis, it’s the complexity of the architectures of the individual devices. Specifically, it is their complex abilities to hide latencies, and the sophisticated parallel compute capabilities and multilevel memory subsystems that play critical roles in such latency hiding. Years ago, performance optimisation was successful if we could reduce the number of instructions being executed. Such optimisations were nearly always rewarded by performance improvements. That is not the case today. Fortunately, Roofline analysis addresses these complications in optimisation work.
New tools, new research, how to cope
The technique of Roofline analysis has recently seen a surge in study, resulting in some interesting papers and tutorials. Throughput optimisation techniques tend to be effective everywhere. Therefore, tuning investments using roofline analysis done on an Intel Xeon Scalable processor-based server, where the development environments are rich and mature, will lead to optimisations that help other compute devices. We can choose whatever environment with which we are most comfortable, and wherever a tool happens to run best, to get the most important tuning work done to improve throughput.
When roofline confirms our fears (but reduces futile optimisation attempts)
Roofline analysis can hint that we should find a new algorithm in two ways:
(1) It reveals that the arithmetic intensity (AI) is low, therefore the peak capabilities are not well utilised. We may find ourselves needing to find an algorithm that can get closer to peak performance, when optimisations to the current approach fail to be possible in critical parts of our application.
(2) It reveals that AI is high, but performance falls short of what we need, want, or believe should be possible. Only an algorithmic change can give us better performance on a machine, if we are already close to a machine’s peak performance.
If this seems a bit circular, you are right. When we have low-AI, we seek to make it high-AI, through algorithmic change if optimisation is not possible. No matter how we reach high-AI, we are faced with the need for algorithm change to go further.
Being told we need to rewrite using a new algorithm is not necessarily welcome news. The good news about the Roofline analysis technique is that it clarifies for us whether these needs are truly present. Knowing that can prevent a lot of time vainly spent seeking optimisations that simply do not exist. An example of this is ‘reducing cache misses’. Specific ‘stall’ event monitoring counters (emon counters) added to Intel processors (with Intel Xeon Scalable processors offering the greatest support in quantity and diversity), allow tools to find cache misses that are actually causing delays (stalls) and therefore causing lower-AI.
Roofline analysis can incorporate stall information into its technique, helping us avoid chasing optimisations that do not improve performance. I cannot overstate how valuable this is!
Intel automated much of the tedious work in doing a Roofline analysis
Intel has implemented Roofline analysis into a feature in its Intel Advisor tool (free versions available) so we can explore our own applications, and get concrete feedback on application-specific bottlenecks.
Sophisticated, and easy-to-use instrumentation, it relies on strong support for stall accounting present in Intel processors, with the broadest capabilities being in the Intel Xeon Scalable processors found in servers and supercomputers.
I highly recommend a variety of reading material from Berkeley Labs, and the Intel Advisor tools including some excellent tutorials on its usage.
James Reinders is a Parallel Programming and HPC expert with more than 27 years’ experience working for Intel until his retirement in 2017. Reindeers is the author of eight books in the HPC field in addition to numerous papers and blogs.