Machine learning begins to unlock non-coding variation

Summary: New methods that use machine learning techniques to directly predict regulatory properties of non-coding sequence will likely play a key role in interpreting non-coding genetic variation. QTL analysis could quickly become obsolete.

Understanding the impact of non-coding genetic variation is key to understanding heritable complex traits in humans. We’ve all heard the story: the vast majority of loci (~95%) identified by genome-wide association studies (GWAS) implicate non-coding variation. This suggests that variation in regulatory function, rather than in protein coding sequences, is largely responsible for common diseases like diabetes and Crohn’s Disease. One common hypothesis is that many of these loci share an underlying model: a genetic variant affects binding of some factor to DNA, which affects transcription of a nearby gene(s), which affects some downstream cellular process leading to disease (even though we’re learning more and more that transcriptional changes might not necessarily lead to actual changes in protein levels). It sounds simple, but interpreting non-coding variation turns out to be extremely challenging.

Learning the regulatory code

So how can we predict which non-coding variants have “causal” effects leading to disease? Here are some options:

Is the site conserved across evolution? Then it might be important. But this doesn’t tell us a whole lot about what that variant might be doing and what cells it’s doing something in. Plus it will miss things that are only important in humans, and anyway hasn’t been found to be hugely correlated with things like eQTLs and GWAS.
Does the variant overlap some annotation predictive of regulatory activity? e.g., does it overlap DNAseI hypersensitive sites, certain histone modifications, or certain transcription factor binding sites? With projects like ENCODE and the Roadmap Epigenomics Project, we now have tons of cell-type specific regulatory annotations. But just because a variant overlaps (or doesn’t) one of these annotations doesn’t mean it has any causal relationship with that annotation.
Is the variant a QTL? i.e., is the genotype of that variant associated with varying levels of your annotation of interest? Detecting QTLs requires measuring your phenotype of interest in dozens to thousands of samples. And at the end of the day, you still are left with a bunch of associations, which say nothing about causality.
Is the variant directly predicted to modulate a phenotype of interest in a given cell type? Until recently, we didn’t have a good way to answer this question besides performing months or years of experimental work. But new machine learning methods are giving us ways to start producing meaningful predictions of non-coding function.

New methods provide direct sequence-based predictions of regulatory activity

In the last several months, a handful of new methods have come out for predicting regulatory activity of non-coding regions. All of these take similar forms: train a machine learning model to predict an annotation of interest based on local sequence features. These models seem to mostly capture features related to sequence specificities of transcription factors, but they can also take into account things like broader sequence context, co-binding of different factors, etc. There are several things I find beautiful about these methods. First, we can now use these models to directly predict the impact of a mutation by feeding a model several versions of sequences containing different alleles and quantifying the change. Second, since quite accurate models can be built using a single dataset from a cell type of interest, these methods preclude the need to measure these molecular phenotypes across hundreds of samples as is required for QTL analysis. This is a huge advantage. Below I take a brief look at some recently published methods (sorry if I am missing some), divided into two general classes.

Kmer-based models

One class of methods are “kmer” based, meaning they train an underlying model to learn the effect of short kmers (e.g. <=10bp) on local sequence annotations. Super simplified version: every time I see the kmer ATCG I see tons of my transcription factor binding but every time I see AAAT I see nothing.

deltaSVM (Lee et al. Nature Genetics 2015) The authors build an SVM using “gapped kmers” (here k=10bp) as features to classify between putative regulatory sequences vs. negative control sequences. Lee et al. used DNAseI hypersensitivity to identify regulatory sequences, but other annotations could feasibly be used. Once built, the SVM scores for each kmer can be used to predict the signal along a novel sequence. By comparing the delta in scores of each kmer overlapping a given variant for the reference and alternate allele, one can score the impact of individual variants. Variant effects captured by deltaSVM are correlated with “dsQTL” measurements (r=0.72), targeted mutagenesis experiments (r=0.53-0.78), and massively parallel reporter assays (r=~0.63). Three examples of validated causal SNPs are shown where deltaSVM correctly picks out the right variant, but only when trained using the relevant cell type, highlighting the importance of cell type specificity when interpreting non-coding variation. One concern I had is that the deltaSVM score itself can not be readily applied to insertions or deletions, but it shouldn’t be too hard to devise a score that extends to other types of variants.
GERV (Zeng et al. Bioinformatics 2015). This is another kmer-based approach, this time with a focus on predicting transcription factor binding (vs. DNAseI HS above). GERV builds a generative model (rather than a classifier) of transcription factor binding by learning effects of all kmers up to a certain length (k=1 to 8 in the paper). It uses the log-linear combination of kmer effects plus local DNAseI HS information as a covariate to predict ChIP-seq read count. A key innovation is that they also account for “spatial” effects of kmers on larger local sequence context (+/- 200bp from the kmer itself), allowing it to learn things besides canonical motifs like cofactors that bind nearby. The method predicts ChIP-seq signal from novel sequences fairly accurately (r=0.76) and accurately classifies bound and non-bound regions (AUC=0.97 on NF-kb). The paper shows examples of using GERV to prioritize allele-specific TF binding and SNPs implicated by GWAS known to alter TF binding. Overall I think it’s great to be able to predict these signals in ways that would have otherwise required many experiments across many samples. The paper focuses on TF binding, which requires one to already have an idea of what factor and what cell type are important, but there’s no reason it can’t be applied to other annotations.

“Deep learning” using convolutional neural nets

A second class of methods relies on “deep learning” (every time I hear that word I can’t help but thinking about this tweet by Daniel MacArthur). Specifically, these methods rely on a technique called deep convolutaional neural networks (CNNs), which I had admittedly never heard of until a couple months ago, and I should probably learn more about. CNNs can capture complicated nonlinear sequence features that might not be well captured by kmer based approaches, and it seems we can even pull out relevant features in ways that we can learn some biology from. I really think these will be a valuable approach for many aspects of genomics going forward.

deepSEA (Zhou and Troyanskaya, Nature Methods 2015. This method uses chromatin profiling data to train CNNs to predict effects of noncoding variation on TF binding, DNA accessibility, and histone modifications. Like GERV, it also integrates sequence information from wider contexts, allowing it to learn features at multiple spatial scales, unlike many previous efforts to learn TF binding. Importantly, they “multitask” to jointly learn models for different factors that might share predictive features (as do the other methods described below). They trained the model on close to a thousand TF, DNAseI, and histone modification datasets and showed they can distinguish positive vs. negative sequences (TF AUC=0.96, DNAseI AUC=0.92, histone AUC=0.86) more accurately than previous methods like deltaSVM. They use deepSEA to evaluate effects of SNPs and indels and preioritize eQTL and GWAS variants. Overall it looks like deepSEA performs quite well, but I was hoping to see more about how to interpret these models to learn about important biological features, which the two methods below did more of.
DeepBind (Alipanahi et al. Nature Biotechnology 2015 DeepBind also uses deep CNNs to learn models of sequence specificity. This was actually the first of these papers I came across, and it took me a couple readings of the introduction to understand the goal: given regions experimentally found to be bound by some protein, what is the model describing bound sequences? This group is clearly excited about the “big data” movement: they emphasize they trained on 12TB (!) of data from hundreds of ENCODE experiments and can perform massively parallel computation on sophisticated GPUs. Some of the authors are even involved in a new company, “Deep Genomics”. Similar to the other methods, the input consists of sets of sequences and binding scores (choose your favorite experimental method) which are used to train the CNN to predict binding scores for novel sequences. They also produce some very helpful visualizations. The CNN outputs learned “weighted ensembles of PWMs” that can be compared to known PWMs. They also generate “mutation maps” that indicate how each possible variant affects binding. Two main applications are presented: identifying RNA binding sites that could be involved in splicing regulation and identifying disease associated variants affecting TF binding (I think their Figure 4 is quite beautiful, in which they analyze a handful of non-coding variants known to alter TF binding). I think it’s exciting that they show that this method extends beyond just TF binding, emphasizing how CNNs can be applied to a diverse set of problems.
Basset (Kelley, Snoek, and Rinn, bioRxiv 2015) Basset once again uses deep CNNs, but this time to learn “functional activity” (e.g. DNAseI hypersensitivity) rather than binding of a specific protein. They point out a big focus of this work is making these tools available to the wider genetics community: “To fully exploit the value of these models, it is essential that they are technically and conceptually accessible to the researchers who can take advantage of their potential.” To that end, they have a nice open source project on Github with great documentation and some awesome IPython notebook tutorials (see below). They also spend a little more time discussing what’s going on in these CNNs (although I’m still pretty confused). Importantly, the first “convolution” layer corresponds to optimizing a set of “filters” that can be thought of as PWMs. The resulting filters from the model can be recovered and compared to known binding motifs. Like the other approaches, they use lots of ENCODE and Roadmap Epigenomics data to train and look at applying their model to assigning GWAS variants cell-specific scores. For example, they show that rs4409785, which is associated with several autoimmune diseases and assigned high probability of causality by a statistical fine-mapping methods called PICS (Farh et al), is strongly implicated to alter CTCF binding across a range of cell types. This again highlights the exciting potential to start digging into GWAS in a way that previously required months or years of experimental work. It also highlights the huge wealth of information that already lies in existing public datasets.

Are these methods accessible to non-machine learning gurus?

I was happy to see that all of these studies made some or all of their methods available by providing the source code, interactive web applications, or both. deltaSVM provides source code and precomputed models. GERV comes packaged in a Docker container, and has extensive instructions on how to use it on the Amazon cloud. DeepSea provides a webapp to perform in silico mutagenesis or predict the epigenetic state of a sequence as well as the source code to do so. DeepBind has a webapp to visualize motif patterns discovered by their models plus packaged binaries to perform predictions, although I didn’t see a way to train new models. Finally, Basset lives up to its claim of making all of this accessible and I think is the one I’ll likely invest in learning. Although it took a little effort to install all the dependencies, there is good documentation and wonderful IPython tutorials walking through many of the steps to train and apply the models, plus precomputed models from public datasets. Kudos to all of these studies for making these things accessible.

Conclusion

I think these methods have huge potential applications in human genetics. One criticism I’ve heard is that these just provide one more annotation to add to our list. But they are much more than that: we can now perform a single experiment which then allows us to directly predict the impact of any variant in a given cell type. Sure there are still limitations: TF binding will not be the answer every time, and there are likely variants/regions/annotations these methods simply won’t work well for because something else is at play. But I think these will prove to be valuable tools going forward as we begin to sift through all of these non-coding variants in much deeper ways than we’ve been able to do with only genetic associations.