Earlham Institute has created a dedicated HPC cluster for international data portal ‘CyVerse’ that will provide free open-source genome analysis for big data research.
As an international collaboration between hardware and middleware engineers at EI, and support staff in the Norwich Research Park Computing Infrastructure for Science (NRP CiS) team, University of Arizona, Texas Advanced Computing Centre and Cold Spring Harbor Labs, CyVerse UK provides free, large-scale, computing facilities and data storage designed for life scientists.
Lead engineer of the CyVerse UK team, Erik van den Bergh, said: ‘Establishing the first CyVerse node outside of the US represents a vital hub in the UK for data analysis and management. CyVerse UK can provide free HPC facilities for all UK scientists as well as allowing integration of UK apps and pipelines into the wider international CyVerse ecosystem.’
‘CyVerse provides an intuitive web interface, Discovery Environment (DE), where scientists can upload data and run analyses’ explained van den Bergh. ‘While this resource is hosted in the US, the DE can automatically run tools hosted in the CyVerse UK platform, giving geographical advantages to data access speed, analysis time, and data placement policy.'
CyVerse UK currently hosts two open-source apps and a new virtual machine environment. Gwasser (Ben Ward, Clark Group) is a statistics pipeline which performs genome-wide association studies for single phenotypes. Mikado (Luca Venturini, Swarbreck Group) is a lightweight Python pipeline to identify the optimal set of data readings from multiple transcript genomics assemblies. Both apps are for the analysis and recent publication of the allohexaploid wheat genome; a crop genome that is paramount in tackling the societal challenge of global food security.
The Polymarker pipeline will soon also be available to scientists to create efficient SNP genome assays in wheat, together with a modified ‘Tuxedo suite’ app developed by the University of Liverpool which executes a series of pipelines for RNA-seq analysis. CyVerse UK’s robust virtualisation platform will also provide back-end data services and web hosting for the COPO and Grassroots Genomics projects.
Genomics is fast becoming a ‘big data’ science as more commonplace high-throughput technologies support faster, cheaper data analysis. This enables scientists more complex options when analysing data which leads to new breakthroughs as researchers can unearth previously hidden patterns and make new discoveries from biological data.
However, the scientific community struggles to take full advantage of the data generated because of a lack of computing resources, appropriate support, and technical skills.
To keep up with the latest developments, scientific researchers need to be able to store and access datasets, models, and analysis tools, which may be hosted in different global locations to facilitate international projects – this is where CyVerse can help accelerate scientific discovery.
The CyVerse UK node hardware and software environment has been set up and deployed by the core CyVerse UK team (Erik van den Bergh and Alice Minotto) in the Davey Group, Tim Stitt (Scientific Computing), and NBI Scientific Computing. The CyVerse UK project is a BBSRC-funded collaboration between the EI, University of Warwick, University of Nottingham, and the University of Liverpool.