Artificial intelligence deciphers genetic instructions
A German-American team of scientists have deciphered some of the more elusive instructions encoded in DNA with the help of artificial intelligence (AI). Their neural network trained on high-resolution maps of protein-DNA interactions uncovers subtle DNA sequence patterns throughout the genome, thus providing a deeper understanding of how these sequences are organised to regulate genes.
Artificial intelligence algorithms are extremely powerful at fitting massive and complex datasets. But their interpretation, rationalising how the machine performs specific predictions when presented a given input, is notoriously hard. This black box behaviour hampers wide acceptation of AI in medical diagnostics, where justifications matter, and restrain their utility in natural sciences where understanding mechanisms is the goal.
'Neural networks are black boxes, but they can be interrogated digitally. So, with a large number of virtual experiments, we figured out the rules the neural net learned’ says first author Dr Žiga Avsec, member of the group of Julien Gagneur, professor of computational molecular medicine at the Technical University of Munich. Together with Anshul Kundaje, professor at Stanford University, he created the first version of the model when he visited Stanford as a guest scientist. An interdisciplinary team of biologists and computational researchers continued this research and have now shown that neural networks can be used to decipher complex instructions encoded in DNA.
Researchers working on this project from the Technical University of Munich, the Stowers Institute for Medical Research and Stanford University continued are employing neural networks, such as those used for facial recognition, together with newly developed model interpretation techniques that can be used to decipher complex instructions encoded in DNA.
One of the big unsolved problems in biology is the genome’s second code, its regulatory code. The DNA bases encode not only the instructions for how to build proteins, but also when and where to make these proteins in an organism.
The regulatory code is read by proteins called transcription factors that bind to short stretches of DNA called motifs. However, how particular combinations and arrangements of motifs specify regulatory activity is an extremely complex problem that has been hard to pin down.
DNA binding experiments and computational modelling going hand in hand
The key was to perform transcription factor-DNA binding experiments and computational modelling at the highest possible resolution, down to the level of individual DNA bases. The increased resolution allowed the team not only to train highly accurate neural network models but also to extract the key elements and patterns from the models, including transcription factor binding motifs and the combinatorial rules by which they function together as code.
Applied to master regulators of stem cell differentiation and confirmed experimentally by CRISPR, the approach revealed complex rules involving a precise positioning along the DNA double helix and specific ordering of events.
‘This was extremely satisfying,’ commented project leader Julia Zeitlinger, investigator at the Stowers Institute and professor at the University of Kansas Medical Center, ‘as the results fit beautifully with existing experimental results, and also revealed novel insights that surprised us.’
A pattern becomes visible: how Nanog binds to DNA
For example, the researchers found that a well-studied transcription factor called Nanog binds cooperatively to DNA when multiples of its motif are present in a periodic fashion such that they appear on the same side of the spiralling DNA helix.
‘There has been a long trail of experimental evidence that such motif periodicity sometimes exists in the regulatory code,’ Zeitlinger stated. However, the exact circumstances were elusive, and Nanog had not been a suspect. Discovering that Nanog has such a pattern, and seeing additional details of its interactions, was surprising because we did not specifically search for this pattern.”
‘This is the key advantage of using neural networks for this task. A classic computational model is built on hand-crafted, rigid rules to ensure that it can be interpreted,’ says Avsec. ‘However, biology is extremely rich and complicated. By abandoning the need to interpret individual parameters, we can train much more flexible and nuanced models that capture any biological phenomena, including those yet unknown.’
A powerful bottom-up approach
This neural net model – named BPNet for Base Pair Network – is a powerful bottom-up approach similar to facial recognition in images, where a neural network first detects edges in the pixels, then learns how edges form facial elements like the eye, nose or mouth, and finally how facial elements together form a face.
Instead of learning from pixels, BPNet learns from the raw DNA sequence and learns to detect sequence motifs and eventually the higher-order rules by which the elements predict the base-resolution binding data.
Both the Zeitlinger Lab and the Kundaje Lab are already using BPNet to reliably identify binding motifs for other cell types, relate motifs to biophysical parameters and learn other structural features in the genome such as those associated with DNA packaging. To enable other scientists to use BPNet and adapt it for their own needs, the researchers have made the entire software framework available with documentation and tutorials.
This work was supported by in part by the Stowers Institute for Medical Research and the National Human Genome Research Institute and National Institute of General Medical Sciences of the National Institutes of Health (NIH). Additional support included the German Federal Ministry of Education and Research and a Stanford BioX Fellowship and Howard Hughes Medical Institute International Student Research Fellowship.
Gene sequencing was performed at the Stowers Institute for Medical Research and the University of Kansas Medical Center Genomics Core supported by the NIH awards from the National Institute of Child Health and Human Development and the National Institute of General Medical Sciences.