Scientists at Pittsburgh Supercomputing Center (PSC) have patented Zest, software that takes a rapid 'snapshot' of a supercomputer’s calculations as it works.
Zest is aimed at speeding up the ability to store complex calculations as a hedge against a system failure, saving precious supercomputing time and slowing calculations down far less than current methods.
PSC co-inventors of Zest included Paul Nowoczynski, Jason Sommerfield, Nathan Stone, and Jared Yanovich.
In the same way regular computer users save their work as they go, scientists carrying out vast computations such as those required for detailed weather predictions or earthquake science need to periodically store — or 'checkpoint' — the machine’s computational state. In the case of a system malfunction, this allows them to avoid having to start from the beginning after hours or days of work.
The problem, according to J. Ray Scott, director of systems and operations at PSC, is that retrieving and storing these data takes time away from calculation, which is carefully rationed to researchers using highly in-demand supercomputers. In fact, he added, over the last seven years the memory available in the largest machines has increased about 25-fold, while the capacity for retrieving that memory has increased only about four-fold.
'If you have a large job, checkpointing the run often means writing out tens of terabytes of data,' Scott explained. 'This takes a nontrivial amount of time. The whole time you’re doing the checkpoint, you’re not using the computer.'
The Zest software works by tightly managing the supercomputer’s disk drives, continuously routing checkpoint storage to disks that aren’t being used for computation.