Evolving expectations at the Earlham institute

Neil Hall, director at the Earlham institute and Dr Tim Stitt, head of scientific computing, explain the Earlham Institutes expanding providing not only a national capability for UK genomics but also providing capabilities for biotechnological and agricultural research.


Could you give a brief history of the institute – how and why was it founded?

The institute was established by the Biotechnology and Biological Sciences Research Council in partnership with East of England Development Agency (EEDA), Norfolk County CouncilNorwich City CouncilSouth Norfolk Council and the Greater Norwich Development Partnership. It cost £13.5 million and was built by Morgan Sindall. It was officially opened on 3 July 2009 by John Sulston, winner of the 2002 Nobel Prize in Physiology or Medicine. We now have more than 130 staff including genome scientists, technologists, and bioinformaticians, with strategic funding from the BBSRC for our core research programs and to provide a National Capability in Genomics. Through our life science research, we address the grand challenges of maintaining food security, lifelong health and well-being, energy security and living with environmental change, through the application of genomic technologies.

What were the original goals of the institute, and what are they today?

The primary goal of the institute when it were founded was to establish a capability in genomics for researchers in the biosciences. Before TGAC, the UK’s only large genome sequencing centre was the Sanger institute but its research portfolio was very much focussed on biomedical research whereas we focus more on biotechnological and agricultural research. We now have our own internally driven research programme centred around evolution, domestication and improvement in plant and animal species as well as host- pathogen interactions.

Why was the name changed recently?

In part, it is because the title ‘The Genome Analysis Centre’ was no longer descriptive of what we do anymore. Although we do boast one of the most advanced genome labs in Europe, this is only a subset of the type of work we carry out. As well as genome data there are many other data types we use from looking at the chemical components of cells to looking at plant architecture or even field level measurements of crops using drones.

Our flagship project is our work on one of the most complex genomes - wheat, we are working to produce a more complete and accurate genome assembly of the bread wheat genome. As one of the three major crop plants of global importance, the predicted impact of a high-quality wheat genome resource on crop improvement will be profound, as genomics provides a framework for new breeding methods that are substantially faster and more effective. As well as preserving and improving plant species, we also work on conservation genomics in fish, focusing on threatened native tilapia. These will be identified and preserved for the benefit and enhancement of the aquaculture industry globally, leading to enhanced research and monitoring of fish farming. The genome sequence information will be publicly available to future researchers, benefiting the wider academic community interested in research themes as diverse as fish health and evolutionary biology.

Could you tell us a little about your science strategy?

At the Earlham Institute, we believe that ‘data driven’ approaches to understanding biological systems will lead to rapid advances in agricultural science and give the UK a leading edge. Our aim is to try and integrate diverse datasets to identify what parts of the genome confer different traits. This approach, known as ‘multi-scale integrative biology,' is where the Earlham Institute is heading, and at the heart of this is advanced computational science to analyse Big Data.

Our future will continue to be developing expertise and methodologies in integrating large, complex and diverse datasets to help us to increase our understanding of biology. While biological methodologies will change, our reliance on computational methods remain a central pillar of our approach. As with the physical sciences, as our understanding improves, we are better able to describe nature using basic rules and DNA sequence is a great example of that.

We hope that as we gain a better understanding of biological systems, it will enable us to work across biological systems from humans to plants and from microbes to animals.

HPC at EI – Dr Tim Stitt, head of scientific computing

The Earlham Institute (EI) deploys some of the largest compute and storage e-Infrastructure for Life Sciences in Europe. EI researchers have access to multiple compute platforms supporting both large-memory tasks and high-throughput downstream analysis workloads.

Large genome assemblies can be accommodated on an SGI UV 2000 system boasting 2560 Intel Xeon cores, 32 Xeon-Phi processors and 20TB of shared RAM (the largest system of its type worldwide for life science applications) as well as two next-generation SGI UV 300 systems each with 12TB shared RAM, 256 cores and 32TB of Intel SSD NVMe technology.

High-throughput analysis tasks can be targeted to a 4,000 core HPC cluster comprising 90 compute nodes each with 128GB or 256GB RAM. Furthermore, EI deploys the first DRAGEN Bio-IT (FPGA) processor in the UK for high-speed mapping and variant-calling.

EI data storage is underpinned by over eight petabytes of high-performance scale-out storage utilising industry-standard data mirroring and file snapshotting practices allowing EI researchers to routinely create and manage large datasets (on the order of Terabytes) in a robust and resilient manner. Collaborative data-sharing and analysis platforms developed at EI, such as CyVerse-UK and UK-SeeD, leverage EI’s compute and storage infrastructure to allow national and international researchers to share large genomic datasets and perform data analysis without requiring access to their own high-performance compute (HPC) and storage resources. Furthermore, the Scientific Computing Group at EI provides dedicated support to researchers enabling them to utilise the HPC resources effectively and efficiently. The physical compute and storage infrastructure is managed and maintained by skilled engineers within the centralised Computing infrastructure for Science (CiS) team serving all the Norwich Bioscience Institutes.

As well as supporting an operational HPC service, the EI Scientific Computing Group also undertake research into novel computing technologies for Bioscience. Recent projects include the development of an optical processor for sequence alignment that will be over 90 per cent more energy-efficient than traditional processor technology, the application of Kx real-time data analytics technology(developed within the global capital markets domain) in combination with machine learning to accelerate the modelling of crop growth. We are also working on the application of novel quantum computing techniques to extremely difficult computing problems in bioinformatics.

Earlham Institute

EI understanding life on earth