Arthur 'Buddy' Bland, project director for the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory, describes the rise from terascale to petascale computing and the road to exascale
The high-performance computing world moves very quickly. In early 2004, the world’s most powerful computer was the Japanese Earth Simulator, clocking in with a peak performance of an amazing 40 teraflops (1,012 floating point operations, or calculations, per second). At the same time, the most powerful computer in the United States was the ASCI-Q machine at Los Alamos National Laboratory, with a peak speed of 20 teraflops. Oak Ridge National Laboratory (ORNL) was far behind, with an IBM Power3 system of just over three teraflops. The capability of the Earth Simulator woke up the world’s scientific computing community to what was possible. In February 2004, the US Department of Energy’s (DOE) Office of Advanced Scientific Computing Research (ASCR) issued a call for proposals to build a leadership-class scientific computing capability in support of the broad Office of Science research programmes, as well as other capability-limited federally-funded computational science activities. ORNL laid out a roadmap to petascale computational science and was chosen as the first Leadership Computing Facility in May 2004.
Our roadmap had a series of staged increases in power and scale starting at the three terafl ops and reaching a petaflop (1,015 calculations per second) in 2008 – a thousand-fold increase in computational capability in just four years. The staged increases allowed the scientific teams to learn and adapt to the scale of these new science capabilities. In 2008, the Oak Ridge Leadership Computing Facility’s Cray XT5 system, called Jaguar, surpassed a petaflop and became the first system in the world to run a full science application at a sustained petaflop. And the story continues. In 2008, buoyed by the success of our approach to reach a petascale, ORNL laid out a roadmap to DOE to reach exascale computational science by 2018; another thousand-fold increase, but this time over 10 years. Today, just a year later, ORNL houses two petascale computer systems. Jaguar has continued to grow and, at more than two petaflops, is now the world’s most powerful computer system. In 2009 the National Science Foundation’s Cray XT5, called Kraken, became the first academic petaflops computer system. Kraken is run by the University of Tennessee’s National Institute for Computational Sciences located at ORNL. These systems are in large demand by researchers the world over to develop renewable energy, understand climate change, and address some of the world’s most challenging problems.
We have focused on delivering the most powerful and balanced computer for the world’s most important science problems. The Cray XT computer was designed from the ground up for scientific computing at the highest scales. Balance is critical. Jaguar has 300 terabytes of memory (16 gigabytes per node), more than 10 petabytes of disk capacity, and more than 240 gigabytes per second of disk bandwidth. Every one of these categories makes Jaguar a world-leading resource. But hardware alone does not solve problems. With almost 225,000 AMD Opteron compute cores, scaling applications to run on a significant fraction of Jaguar requires careful planning and working closely in partnership with the application developers, library developers, and programming tool creators. We have partnered with all of these to deliver a series of increasingly powerful computer systems, allowing time for each to scale the systems, applications, libraries, and tools from a few hundred processors in 2004 to almost 225,000 processors today.
There have been several keys to the science successes we have seen. Most directly has been our Computational Liaison programme, which embeds computational scientists from ORNL’s Leadership Computing Facility into the science teams that are awarded time on Jaguar. These have been critical in working with the science teams to effectively scale the codes to run in capability mode on the systems. A second key has been the ASCR ‘Joule metric’ programme. Through this programme, ORNL’s Leadership Computing Facility has worked closely with application teams to double the performance of many codes through the application of libraries and new algorithms. This has resulted in dozens of codes that now can scale to tens and hundreds of thousands of processors on today’s leadership systems. Finally, the DOE SciDAC programme has been the most successful computational science programme of the last decade, providing the resources to develop an entire generation of new science codes, libraries, and tools effectively for modern computer architectures. All of these investments were coordinated by ASCR and were critical to the ORNL Leadership Computing Facility’s success.
The computational needs of the scientific community show no signs of slowing down. Modelling and simulation needs will only increase as we continue to tackle the most difficult problems in energy, health, the environment, and national security. For example, today’s best climate models can only predict changes at very large scale. To be able to plan for national infrastructure (water, electricity), we need to be able to predict future climates at regional and local scales. By some estimates, this will take more than one million times the computing capability than is available on today’s largest systems. Reducing the carbon in our atmosphere will require supercomputers to develop new technologies in nearly every part of the world’s energy infrastructure, from developing superconducting transmission wires, to new batteries and ultra capacitors for energy storage, to new types of materials to make automobiles both lighter and stronger. These are just a few of the nation’s most important problems that are dependent on high-performance computing.
As ORNL’s Leadership Computing Facility drives to exascale computing over the next decade, we will continue our commitment to the scientific user community, continuing to provide balanced, scalable, usable computer systems of increasing power. From a technological standpoint, individual processors aren’t getting any faster. Our roadmap includes architectures that use conventional processors and accelerators to achieve higher application performance with less power consumption. Today’s accelerators are providing exciting results on many applications. In the future, we expect to see accelerators that are even more powerful, fault-tolerant, and energy efficient. While technology specifics are impossible to predict a decade in advance, the trends in power consumption, application parallelism, and cost lead us to begin the process of moving our application base to heterogeneous parallel computing. And much like we saw in the 1980s with vector processors, we expect that the transformations needed to support these new systems will be very beneficial for applications on conventional systems.
We’ve been successful over the past five years in our contributions to scientific discovery as a world-class leadership facility. Building on this success we look to an exciting future of science breakthroughs enabled by the resources we provide.