Predicting infection

Sepsis is a life-threatening condition caused by an uncontrolled response to infection. While advances in life support have reduced the mortality of this condition, researchers’ understanding of sepsis is incomplete, and there is an urgent need for new treatment alternatives. Dr Cano Gamez’s research focuses on using high-throughput data to build reliable methods for precision medicine in sepsis, with a focus on gene expression and how genes are turned on or off under different conditions in the immune system.
Why is it so important to develop a better understanding of sepsis?
Eddie Cano-Gamez: The problem has several intertwined issues. One is that sepsis is what you can call a syndromic disease. It’s not defined based on anything you can specifically measure. It is a diagnosis by the doctor using a combination of several measurements
in hospitals. [One of] the criteria is you have to have evidence of an infection. That could be where doctors have detected bacteria in your blood or a viral infection, or something tells them you possibly are infected with something. Then your organs start to fail. You can have respiratory failure, kidney failure, or a very sharp drop in your blood pressure. The symptoms are very non-specific.
That means the disease is not very homogeneous. There are a lot of different types of patients and types of infections that come under the same umbrella term of ‘sepsis’. That is one of the reasons why people believe so many clinical trials have failed to identify new drugs, because it’s possible that actually what we call sepsis is a combination of different – sometimes very different – groups of patients:some might respond to treatment, some might not. In fact, some might actually be harmed by the treatment. Having them all together in a single group can make it very difficult to determine whether a treatment is working.
How does your research help to solve this problem?
Researchers have known since 2016 that patients can be split into subgroups. We look at the variety of genes – in this case, you can look at between seven and 19 genes. We know we can use that information to classify patients. But until now, there wasn’t a reproducible method to do that. This was a big issue, because lots of other groups and people studying sepsis wanted to classify patients and know which group they fall into, or how much they are at risk. There wasn’t an easy way to do that, because the study hadn’t really been done with a predictive angle in mind. The original work was more like a discovery study, so the only way to find out was to re-analyse the patients and the data.
My study aimed to make it as straightforward as possible for people studying sepsis, or infections in general, to get a set of patients and immediately classify them into subgroups and get a sense of their risk.
I predicted two machine learning methods. One predicts the groups of patients, so it’s either a healthy-looking person, so that’s what we call group
three, or it’s a sepsis patient that can be either low risk, so that’s group two, or high risk, which is group one. That’s the first model, which predicts three outcomes, depending on which group of patients we are talking about.
The second method is where we realised these two ends of the spectrum – the healthy versus severe sepsis – actually are not completely separate. They tend to form this progression. We took that to our advantage to derive a score for risk. This score goes from zero to one. And if it’s zero, it means you’re basically healthy, or your immune system looks like a healthy immune system. And the closer the score gets to one, the more severely ill you look in terms of your immune function. I trained models to predict this.
In terms of the specific machine learning models we use, all of what we present are ‘random forest’ models. We took a set of patients and then built what looks like a flowchart or a decision tree. And we ask, for this patient, is the first gene on or off, and how active is it? Then we subdivide the patients. Then we go to the next gene, and so forth, for a series of decisions. We use that type of decision tree to predict the groups or to predict the score.
Why did you choose a random forest model for this research?
In the past, the research group used simpler models. These were linear regressions. And that worked well for a while, but the problem is that sometimes each of the genes we are measuring do not act in isolation. Sometimes, two or three of them might work together or correlate with each other. That is not picked up as easily by something as simple as a linear model, where each variable is assessed independently.
The advantage of these random forest models is they can look at interactions. If we have two genes that tend to be together, or that tend to be active or inactive most of the time, then based on these different decision branches, you can get that to capture these nonlinear interactions. That was my motivation for trying this type of model. They definitely have improved our results compared to the linear models the lab used before. And I think the accuracy was good enough that we decided to go for that, but there’s no reason why we couldn’t have used other types of models that may be more complicated – for example, neural networks or support vector machines and other types of approaches. The random forests seem to work well on our end and I think it was suitable for the type of data we had. But it’s certainly not the only possibility.
What were the main challenges in developing this new model?
The real challenge here was that the models would work really well in our dataset, and then we would apply them to another dataset, and they wouldn’t necessarily work as well.
That was the motivation for integrating so many different datasets first into this training set. When the users want to classify the patients, the algorithm will first do that integration. It will take new samples and then align them or integrate them with our reference set, and then the prediction is done. So I think this preparation step, where the samples are aligned, is really the crucial one because it removes all of these differences in scale.
For that, I took a lot of inspiration from the field of single-cell biology. Single-cell biology is an exploding field that has grown over the last decade, where you can get detailed information separately for each cell. That means you generate really huge datasets. Nowadays, you can have in the order of a million data points, and for each data point, you have thousands of measurements. And so that real increase in the size of data in biology has led to a bunch of new data science methods that have been created for sorting all of those problems.
One big problem in ‘single cell’ is, if I have one experiment, and then another experiment, can I integrate them? So basically, I took inspiration from that as the same problem. But rather than integrating single cells here, I’m integrating patients from one hospital versus another. l
Dr Eddie Cano-Gamez is a postdoctoral researcher studying the host immune response during sepsis at Wellcome Centre for Human Genetics. Dr Cano-Gamez has a background in immunogenomics, with a particular emphasis on the use of single-cell technologies to study cellular functions. He completed his PhD in Cambridge (2020), funded by a Gates Cambridge Scholarship and trained under Dr Gosia Trynka at the Wellcome Sanger Institute.