Managing change in genomics

Share this on social media:

How is genomics research changing?

Dr James McCafferty: You’ll hear people talk about the move from wet lab to dry lab science. Wet lab is the folk with the test tubes and the white coats, and dry lab generally relates to IT, informatics and research software.

The BBSRC, which is the main funder in this space for the UK. they have talked about a move from 80% wet lab, to 80% dry lab. That’s the kind of transformation we see in the sciences, particularly the biosciences. This can massively accelerate research.

There are all manner of very sophisticated lab instruments that are mining terabytes of data on the kind of things we study. We look at genetics itself, but we also look at the proteins, we look at the chromosomes, the operation of cells, and [we’ve also taken] quite a significant move towards imaging data as well. The data generated is massive.

How is imaging data being used in genomics?

To give you an example, for the spatial data, if you consider a cell, that cell is within the context of a tissue, so it’s got lots of other cells around it [which provides information]. And the way the cell behaves in that context. So if you’re looking at, let’s say, a cancerous tumour, you want to understand where the cell is, and where it is in relation to other cells.

In addition to that, by looking at what’s happening inside the cell, so looking at its genomics, looking at the transcriptome – the proteins that the cell is generating – you can see what the cell is actually doing. If you capture that information, you can not only work out what type of cell it is, but what the cell is actually doing at any one time. For example, it could be growing; it could be dying; it could be splitting in two.

By combining the image, which is the cell in its context, and including into that the genomics and transcriptomics data, that yields a massive data source, allowing scientists to study things like cancerous tumours. But when you are dealing with image data, these are not small files.

How does this impact your ability to support scientists at Sanger?

This is changing the demands of our IT systems. We have a massive data explosion here. We need to support huge amounts of storage and be able to interact with very large datasets. We need to pull that data out, manipulate it and put it back into storage.

For our workloads, it’s about shifting data. In my previous role at UCL, the high performance computing (HPC) was geared up for astrophysics and materials science. And it’s interesting to compare that kind of environment with Sanger, because although we are nowhere near as big as UCL, we store more than double the amount of data they do for their research.

This is because our systems are geared up for data-intensive research. So rather than HPC, it’s more like high throughput computing (HTC).

Just to give you a kind of sense of scale, we have about 90 petabytes of genomics data here at the Sanger, and our sister organisation in the same campus store about 10 times that amount. Admittedly, that is not just genomics data, they store lots of other stuff, but it gives you a sense of scale. Part of my job is to make sure we’re properly efficient with what we generate and what we store. But in addition to that, part of my job is about making sure we get the best value, the greatest insights and the highest scientific output from that data.

There’s an increasing dependence on GPUs as well. That’s largely driven by the adoption of machine learning and deep learning techniques in biosciences. So here in the Sanger, we have just recently invested quite heavily in Nvidia GPUs.

Do you see this demand continuing to grow? Do you think Sanger will invest further in GPU technology?

I think that’s the way it’s going. To be honest, there’s always been a lot of machine learning-type applications in informatics, just by the very nature of what it does. So there’s quite a lot of random forest-type or support vector machine applications. Neural networks, on the other hand, are much newer in the genomics world but are gaining traction steadily. I think we will see a lot more of that in the future.

Part of my role is not just about getting the kit there ready for people to use, it’s helping to make sure they’re getting the best out of it. So training and support, advice and things like that.

And going back to the point I made at the start [about almost] reshaping what we’re doing for IT within the Sanger, now we are required to be more proactive with the science community. This means bringing new technologies and new tools to the table to help them with what they are trying to achieve. Having the best tools and the best IT means our researchers can be weeks and months ahead of other researchers, and also in some of the big scientific challenges.

Sanger operates at scale. We’ve got the second-biggest sequencing fleet in the world for genomic sequencing. It’s a massive operation. That means Sanger scientists can draw upon that when doing massive studies. Other research institutes still study one or two genes at a time, whereas we do entire genomes in one go.

Sanger is special and unique, but when you think about it, our data goes all the way from biological samples through to data and insights. Having a digital infrastructure that carries that all the way through is really important. One recent example, one of the things we’re working on at the moment, is equipping our researchers so they have the digital tools to do what they need.

The hope is that our researchers will use Jupyter Notebooks for developing and trialling techniques. This is absolutely an IT artefact and it’s exciting to see how that stitches into tools like ELN and LIMS to impact biological research. That’s absolutely the way of the future, because every single one of our scientists will be an informatician. They’ll know how to write scripts, they will know how to write software and know how to do analytics. That is the direction of travel.

Dr James McCafferty is Chief Information Officer at the Wellcome Sanger Institute. His role encompasses IT strategy, delivery and operations to support the goals of the institute. This encompasses research IT, research data, research software/informatics, enterprise IT and information security. Dr McCafferty previously worked as Chief Information Officer and Director of Research IT, UniversityCollege London.