Ellison Institute creates tissue ‘fingerprints’ for cancer diagnostics

Share this on social media:

The Lawrence J Ellison Institute for Transformative Medicine of USC (Ellison Institute) has revealed a promising two-step technique to train a predictive algorithm for cancer diagnostics. 

The study uses novel tissue ‘fingerprints’ - microscopic hematoxylin and eosin (H&E) histologic features - of tumours paired with correct diagnoses to train deep learning in the classification of breast cancer ER/PR/HER2 status.

These initial findings suggest that AI algorithms can be used to make correlations between a tumour’s architectural pattern and a correct diagnosis. The research paper published in Scientific Reports found that AUC of the receiver operating characteristic (ROC) curve to be 0.88 when averaged across different regions within the slide. 

Lead author of the research paper, Dr Rishi Rawat states: ‘If you train a computer to reproduce what a person knows how to do, it’s never going to get far beyond human performance. But if you train it on a task 10 times harder than anything a person could do you give it a chance to go beyond human capability. With tissue fingerprinting, we can train a computer to look through thousands of tumour images and recognise the visual features to identify an individual tumour. Through training, we have essentially evolved a computer eye that’s optimised to look at cancer patterns.’

This research was carried out using less than a thousand annotated breast cancer pathology slides suggesting that more correctly labelled data and algorithmic refinements could further increase accuracy. 

This study also further expands on previous work which was reported in a previous article from Scientific Computing World. Dan Ruderman discusses the use of AI and ML techniques in characterising cancer subtypes using readily available Hematoxylin and eosin (H+E) stains.

Overcoming data shortages

One of the primary challenges of developing artificial intelligence (AI) tools to diagnose cancer is that machine learning algorithms require clinically annotated data from tens of thousands of patients to analyse before they can be harnessed to recognise meaningful relationships in the data with consistency. An ideal size dataset is nearly impossible to gather in cancer pathology.  Researchers training computers to diagnose cancer typically only have access to hundreds or, in some cases, thousands of pathology slides annotated with correct diagnoses.

To overcome this limitation, the Ellison Institute scientists introduced a two-step process of priming the algorithm to identify unique patterns in cancerous tissue before teaching it the correct diagnoses.

The first step in the process introduces the concept of tissue ‘fingerprints’ or distinguishing architectural patterns in tumour tissue, that an algorithm can use to discriminate between samples because no two patients’ tumours are identical. These fingerprints are the result of biological variations such as the presence of signalling molecules and receptors that influence the 3D organisation of a tumour. The study shows that AI spotted these fine, structural differentiations on pathology slides with greater accuracy and reliability than the human eye, and was able to recognise these variations without human guidance.

In the current Ellison Institute study, the research team took digital pathology images, split them in half and prompted a machine-learning algorithm to pair the images based on their molecular fingerprints.  This practice showcased the algorithm’s ability to group ‘same’ and ‘different’ pathology slides without paired diagnoses, which allowed the team to train the algorithm on large, unannotated datasets (a technique known as self-supervised learning).

Dr Dan Ruderman, director of analytics and machine learning at the Ellison Institute commented: ‘With clinically annotated pathology data in short supply, we must use it wisely when building classifiers. ‘Our work leveraged abundant unannotated data to find a reduced set of tumour features that can represent unique biology. Building classifiers upon the biology that these features represent enables us to efficiently focus the precious annotated data on clinical aspects.’

Once the model was trained to identify breast cancer tissue structure that distinguishes patients, the second step called upon its established grouping ability to learn which of those known patterns correlated to a particular diagnosis.  

The discovery training set of 939 cases obtained from The Cancer Genome Atlas enabled the algorithm to accurately assign diagnostic categories of ER, PR, and Her2 status to whole-slide H&E images with 0.89 AUC (ER), 0.81 AUC (PR), and 0.79 AUC (HER2) on a large independent test set of 2531 breast cancer cases from the Australian Breast Cancer Tissue Bank.

Running on Oracle Cloud technology, the research study aims to create a new paradigm in medical machine learning, which may allow machine learning techniques to process unannotated or unlabeled tissue specimens, as well as variably processed tissue samples. This could greatly increase the number of samples that could be processed and used to train the machine learning algorithm leading to a more accurate diagnosis.

In addition to Rawat and Ruderman, other study authors include Itzel Ortega, Preeyam Roy and David Agus of the Ellison Institute; along with Ellison Institute affiliate Fei Sha of USC Michelson Center for Convergent Bioscience; and USC collaborator Darryl Shibata of the Norris Comprehensive Cancer Center at Keck School of Medicine.

The study’s computing resources were facilitated by Oracle Cloud Infrastructure through Oracle for Research, Oracle’s global program providing free cloud credits and technical support to researchers, and was supported in part by the Breast Cancer Research Foundation grant BCRF-18-002. The research appears in Scientific Reports.

This article is based on an announcement written by Alexandra Demetriou for the Lawrence J Ellison Institute for Transformative Medicine of USC.