Managing change in genomics
How is genomics research changing?
Dr James McCafferty: You’ll hear people talk about the move from wet lab to dry lab science. Wet lab is the folk with the test tubes and the white coats, and dry lab generally relates to IT, informatics and research software.
The BBSRC, which is the main funder in this space for the UK. they have talked about a move from 80% wet lab, to 80% dry lab. That’s the kind of transformation we see in the sciences, particularly the biosciences. This can massively accelerate research.
There are all manner of very sophisticated lab instruments that are mining terabytes of data on the kind of things we study. We look at genetics itself, but we also look at the proteins, we look at the chromosomes, the operation of cells, and [we’ve also taken] quite a significant move towards imaging data as well. The data generated is massive.
How is imaging data being used in genomics?
To give you an example, for the spatial data, if you consider a cell, that cell is within the context of a tissue, so it’s got lots of other cells around it [which provides information]. And the way the cell behaves in that context. So if you’re looking at, let’s say, a cancerous tumour, you want to understand where the cell is, and where it is in relation to other cells.
In addition to that, by looking at what’s happening inside the cell, so looking at its genomics, looking at the transcriptome – the proteins that the cell is generating – you can see what the cell is actually doing. If you capture that information, you can not only work out what type of cell it is, but what the cell is actually doing at any one time. For example, it could be growing; it could be dying; it could be splitting in two.
By combining the image, which is the cell in its context, and including into that the genomics and transcriptomics data, that yields a massive data source, allowing scientists to study things like cancerous tumours. But when you are dealing with image data, these are not small files.
We use a lot of sequencing machines. They're divided into two categories, we have what's known as short width, which tends to be about 200, bases, maybe 300, at a asingle timebut these are accurate, really accurate. The other type is , long read. and this will be 1000s of DNA bases, many 1000s of DNA bases, not so accurate. Again, just to put this in context, the human genome is about 3.2 billion bases. The mistletoe one would be more like 90,000,000,000. Yeah. So these the sequencing machines, you know, they do their best, right, and it gives you fragments, and the fragments that they give you could be fragments from anywhere in the genome. I don't know if you're aware of her, you know, genomes are structured, but you have this idea of chromosomes.
Chromosomes will have different what we call alleles across them for homologous chromosomes. So in the human genome, you've got two of each, chromosome 23 is either an X or a Y but generally it's two of everything whereas mistleltoe has six copies of every chromosome?
Actually working out how the genome is structured is a big challennge. And there's there's a lot of very clever techniques, including machine learning techniques to try and work that out. But then actually stitching the fragments together, all the AGTC's ,stitching them together is quite a problem. Because let's say you've got a million fragments and you're trying to work out sequences of them. Each fragment may overlap another fragment by about 50 bases. So for any one fragment, you have to search through millions of other that's trying to get the best fit for it.