A grant from the US National Science Foundation (NSF), which includes $6 million for deployment plus additional funding for operations, has been awarded to support the development of a transformational data-intensive system for the open science community. The Texas Advanced Computing Center (TACC) at The University of Austin and its partners have announced that they will design, build and deploy the system dubbed ‘Wrangler’ and that production is scheduled for January 2015.
‘Wrangler is designed from the ground up for emerging and existing applications in data-intensive science,’ said Dan Stanzione, the lead principal investigator (PI) for the project and deputy director at TACC. ‘Wrangler will be one of the highest performance data analysis systems ever deployed, and will be the most replicated, secure storage for the national open science community.
‘This combination of unmatched transaction performance, massive bandwidth and capacity, and full data replication far exceed what is currently available to the open science community.’
Wrangler features a novel primary storage tier based on NAND Flash memory, which will enable reading and writing data at up to one terabyte per second and executing up to 275 million IOPS (input/output operations per second). The technology at the core of Wrangler will be provided by Dell and DSSD, and the system will support the Hadoop software framework and a full ecosystem of analytics methods and technologies for big data. In addition, the 10 petabyte disk storage system of Wrangler will be fully replicated to Indiana University, a partner in the project, providing data access reliability and security.
In addition to hosting part of the system, Indiana University will participate in operations and training, and will help users optimise their network performance between their home institutions and Wrangler. The Computation Institute (CI), a joint initiative of the University of Chicago and Argonne National Laboratory, will integrate their Globus Online service within the Wrangler project to make transferring data to and from Wrangler simple and fast.
Wrangler’s performance and storage capabilities for big data applications will be enhanced through tight integration to TACC’s Stampede supercomputer and to NSF Extreme Science and Engineering Discovery Environment (XSEDE) resources around the USA. Immediately upon deployment, Wrangler will be part of the broader XSEDE ecosystem. Integration with Globus Online, the official data transfer mechanism for XSEDE, will provide for rapid, reliable and secure data exchange with other elements of the national cyberinfrastructure.
‘Wrangler will meet critical needs for managing, moving and analysing massive and diverse data sets in disciplines including energy, weather and the global climate, basic biology, health, and medicine, and will also support citizen science from astronomy to marine biology to zoology,’ said Craig Stewart, co-PI and executive director of the Pervasive Technology Institute at Indiana University. ‘We anticipate Wrangler will support more than 1,000 researchers and students every year, and will serve as a model for smaller-scale data systems on campuses that will improve US research capabilities.’