Driving NASA missions

Share this on social media:

Topic tags: 

Rupak Biswas, chief of NASA's Advanced Supercomputing Division, explains the challenges he faces running one of the most high-profile establishments in the world

As chief of the NASA Advanced Supercomputing (NAS) Division at Ames Research Center, and project manager for NASA’s High-End Computing Capability Project, I am privileged to lead a distinctive supercomputing facility that supports the computational requirements of NASA’s key mission areas: aeronautics research, exploration systems, space operations, and science.

In my job as division chief, I oversee a full range of integrated high-end computing (HEC) resources and services, and manage a team of R&D scientists, engineers, and support staff. My HEC project management role involves setting high-level objectives for the division, maintaining a highimpact, production computing facility, and coordinating closely with NASA management and partner organisations.

NAS’ vision is to develop and deliver a world-class high-performance computing capability to NASA’s missions, enabling the agency to extend technology, expand knowledge, protect our home planet, and explore the universe. We provide critical computing resources required to support Space Shuttle missions; accelerate simulation and modelling for the design of future crew launch and exploration vehicles; and provide round-the-clock integrated support services to NASA’s scientific and engineering users.

Challenges of a diverse user base

Our diverse user base creates unique challenges and opportunities. NASA’s exploration systems and space operations mission users require HEC resources that can quickly handle large numbers of jobs. These jobs include parameter studies and ensemble calculations necessary to make engineering decisions such as design work for the Space Shuttle external fuel tank, rocket launch platform flame trench, vehicle assembly building safety, and next-generation space transportation vehicles.

Conversely, NASA aeronautics research and science mission users generally require resources that can handle high-fidelity, multi-disciplinary, and relatively large processor-count jobs with minimal time-tosolution. Examples include high-resolution global climate and weather models, accurate simulation of truly unsteady phenomena, such as rotary-wing wakes and vortices, and simulating massive black hole mergers to predict their gravitational wave signatures.

One of our biggest challenges – spanning both capacity and capability requirements – is to maintain readiness at all times to handle NASA’s mission- and time-critical applications on demand. This scenario covers events such as aerothermal analysis of any damaged tiles found during in-orbit shuttle missions, debris threats during re-entry due to torn insulation on the shuttle fuselage, and real-time hurricane prediction. Our control room and systems staff are prepared and available at all times to ensure users have access to the resources needed and to troubleshoot any issues that may arise during these events.

Integrated environment

One highly rewarding aspect of my work is the promotion of NAS’ unique, integrated services environment. Customisable user support – encompassing HEC platforms, high-speed networking, mass data storage, code performance optimisation, scientific visualisation, user support, and modelling and simulation – dramatically enhances our users’ understanding and insight, accelerates science and engineering, improves accuracy, and increases mission safety.

Our systems team manages all aspects of the supercomputers to ensure users get the secure, reliable resources they expect. Together with the application optimisation team, these experts regularly evaluate the latest computing, storage, and software/tools technologies. The optimisation team also specialises in enhancing performance of complex codes so researchers can use the HEC systems more effectively.

NAS visualisation experts develop and apply tools and techniques customised for our users’ problems to help them view and interact with their results to quickly pinpoint important details in complex datasets. Storage specialists create custom file systems to temporarily store large amounts of data for special projects, and provide training to help users efficiently manage and transfer their data. Our network engineers implement innovative transfer strategies and protocols to vastly reduce application turnaround time for users. Our user services team also monitors all systems, networks, job scheduling, and resource allocation 24x7, and provide support throughout the entire life cycle of users’ projects.

Hardware refresh

In June 2006, the NAS facility’s 10,240-processor SGI Altix supercluster, Columbia (62 Tflop/s peak), which increased NASA’s HEC capability ten-fold, had reached 80 per cent system utilisation after just 18 months of operation. In 2007, our short-term strategy was to upgrade Columbia to 13,824 processors, and to add a 640-processor IBM Power5+ system and a 4,096-core SGI ICE platform to augment Columbia and to test, evaluate, and mitigate risk before procuring the next supercomputer.

After a rigorous, formal evaluation and selection process to refresh NAS’ hardware, we installed a 51,200-core SGI Altix ICE system, Pleiades (609 Tflop/s peak), one of the most powerful general-purpose supercomputers built to date. Along with the recent Columbia upgrade and installation of other systems, total HEC capacity at NAS is now 700 Tflop/s – an increase of more than 10x over that of Columbia in 2004.

To round out our capability, a new 128-screen graphics wall enables users to handle the increasing size of their simulation results and complexity of visualisation needs. Called Hyperwall-2, this tiled LCD panel display, with more than 245 million pixels, is one of the largest in the world. The Hyperwall-2 is driven by 128 GPUs and a 1,024-core host system, and provides a peak processing power of 74 Tflop/s when used as a compute engine.

Future developments

This year marked the 25th anniversary of the NAS Division. Originally established as NASA’s bold initiative in simulation-based aerospace vehicle design and stewardship, we have earned an international reputation as a pioneer in the development and application of HEC technologies. While reflecting on our legacy is inspiring, significant challenges and opportunities lie ahead.

Installing and hardening leading-edge production supercomputers is extremely complex. With the advent of multi- and many-core architectures, coupled with several accelerator technology options, obtaining good sustained performance from petascale machines is a considerable challenge. This will require major innovations in programming languages, execution environments, advanced algorithms, and applications development.

The task of tightly integrating visualisation into the traditional computing, storage, and networks environment to provide a more powerful tool for users will also be critical. Network transfer speeds and storage solutions commensurate with the rapidly increasing computational resources and explosive growth in data volume are all very important to helping users meet their critical mission requirements.

Currently, I am involved in aggressively developing partnerships with other NASA centres, government laboratories, the computer industry, and academia to further our reputation as the ‘go-to’ place for supercomputing and advanced simulations. Concepts such as 10-petaflop systems and one-exabyte data archives – connected to the science and engineering community via one terabit-per-second links – are not too far in the future. Near-real-time aerospace design and a deeper understanding of our planet and the universe are right behind.

About the NAS HEC Facility

The NASA Advanced Supercomputing (NAS) Division at NASA Ames Research Center, located in Silicon Valley at Moffett Field, California, is the primary production supercomputing facility for the US space agency. Just this autumn, NAS completed installation of an SGI ICE system named Pleiades, significantly expanding NASA’s computational infrastructure. Merged with an existing 4,096-core SGI ICE platform, Pleiades consists of 12,800 Intel Xeon quad-core processors (51,200 cores, 100 racks) and a peak performance of 609 Tflop/s. It achieved 487 Tflop/s on the Linpack benchmark, garnering the third spot on the November 2008 Top500 list. All systems utilise the PBS Pro job scheduler. To support the new SGI platform, the NAS facility underwent a major infrastructure upgrade of electrical power capacity and cooling capability.