Deep learning in GATK4

Posted by samwell on 21 Dec 2017 (9)

By Sam Friedman, deep learning developer in GATK4

Over the past couple of weeks, there's been a lot of chatter online --and in the press!-- about the applicability of deep learning to variant calling. This is exciting to me because I've been working on developing a deep learning approach to variant filtering based on Convolutional Neural Networks (CNNs), with the goal of replacing the VQSR filtering step in the GATK germline short variants pipeline. In fact, multiple groups have been picking up on the promise of deep learning and applying it to genomics.

As far as I'm aware, the first group to publish a deep learning-based approach for variant calling was the Campagne lab at Cornell, who released their variationanalysis software in December 2016. There's also a group at Illumina that has been doing some interesting work with deep learning for predicting functional effects of variants, and some of my colleagues are currently working with researchers at Microsoft to see if CNNs can be used to discover complex Structural Variations (SVs) from short sequencing reads.

When the Google team made the source for DeepVariant public last week, I ran a few tests to compare it to the tool that I've been working on for the last six months (GATK4 CNN). The results are summarized in the table below.

Both my tool and DeepVariant outperform VQSR (our "traditional" variant filtering algorithm), especially when VQSR is run on a single sample rather than on a cohort according to our current best practices. The delta isn't all that large on SNPs, but that's expected because germline SNPs are largely a solved problem, so all tools tend to do great there. The harder problem is indel calling, and that's where we see more separation between the tools. The good news is we're all doing better than VQSR on calling indels, which means progress for the research community, whatever else happens. It doesn't hurt my mood that these preliminary findings suggest GATK4 CNN is doing even better than DeepVariant :)

But keep in mind it's early days yet for deep learning in genomics, so a lot could still change as we all figure out how best to take advantage of these methods. Read on if you want to know more about how these results were generated and how my tool works under the hood.

The table above was created from a held out test set of variants enriched for difficult calls. We compute the ROC curves from the VQSLOD scores from VQSR, QUAL scores from deep variant, and the scores from our CNN. SynDip, if you're not familiar with it, is the new truth set we're using as recently described by my colleague Yossi Farjoun here, and gnomAD is the Genome Aggregation Database produced by Daniel MacArthur and colleagues.

I'll be giving a talk about this work at the AGBT conference in Orlando in February 2018, and I'm aiming to have the tools ready for general release at that time. If you'd like to know more in the meantime, read on below, and feel free to check out the code on the GATK repo on Github.

How it works under the hood

Deep learning is a pretty rich field of research that comprises many different approaches, and there's still a lot to be figured out in terms of what's going to transfer well to any particular problem. For example, the Campagne lab's approach starts from reads and, in their own words, "maps structured data (alignment) to a small number of informative features, which can be used for efficient DL model training". In my work I've chosen to treat the problem as a variant filtering question (or classification in machine learning terms), where I take the output of the caller and use a deep learning process to filter that callset. It seems the Google team's approach is similar in that their software involves a first step that looks a lot like the HaplotypeCaller (which makes sense since Mark and Ryan wrote most of the HC's original code), then applies the deep learning classifier to the variant candidates.

My approach uses variants and annotations encoded as tensors, which carry the precise read and reference sequences, read flags, as well as base and mapping qualities. The modeling architecture I use is specifically designed to process these read tensors, and so avoids the extensive use of maxpooling found in many image classification CNNs (maxpooling is motivated by the smoothness prior of natural images, where neighboring pixels tend to be the same, but DNA sequences are not smooth, so we reserve maxpooling for the deeper layers of the net). Because the tensor dimensions contain different types of data, I convolve each separately so as to allow deep layers in the net to see this meaningful difference. This architecture also integrates annotations seamlessly into the classification, which accelerates learning times dramatically and opens the door to unsupervised training down the road.

The animations below shows two read tensors one containing a heterozygous SNP the other with a homozygous deletion of a short tandem repeat. You can see the reads with the individual base qualities, the reference sequence, the read flags and the mapping quality each in their own channel.

If you'd like to know more or if you're a developer yourself and want to discuss the pros and cons of various approaches, please feel free to comment in the discussion thread below.

Return to top

Thu 21 Dec 2017

akundaje on 21 Dec 2017

What's your class imbalance for your classification tasks. I expect there is significant imbalance since variants are rare events as compared to the size of the genome. If that is the case, auROC can be very misleading. You'd want to report area under the precision recall curve and recall at specific precisions eg. 70, 80, 90 %. Had posted this on Twitter. You guys should get on there. -Anshul Kundaje.

samwell on 21 Dec 2017

Hi Anshul, Thanks for the comment. The AUC table was made from a test set of variants enriched for negative examples so that the classes are nearly balanced. -Sam

akundaje on 21 Dec 2017

Ok great. But does your test set reflect the actual imbalance you expect to see when you use this to call variants across the whole genome? If not, then the test set would not be representative.

samwell on 21 Dec 2017

Good point! We look only at sites emitted by HaplotypeCaller, not over the whole genome. Our variant _filtering_ architecture classifies these calls as either real sites of variation or incorrect calls. We think AUC is still a valuable metric in this context, because HaplotypeCaller's mandate is sensitivity and artifactual calls are not so rare. Of course, we also look at Precision and Recall Curves. Below you can see a table of area under the precision recall curve on the same test set as auROC above. ![]( Class balancing is often tricky and it is especially difficult in genomics. In variant filtration, we have many fewer negative examples than positive ones and typically far fewer INDELs than SNPs. To avoid learning this background distribution, during training we uniformly sample from each of the classes in each minibatch. For variant calling, where the class imbalance is even more extreme, we have resorted to using a weighted categorical crossentropy as our loss function, which penalizes errors at sites of variation far more than errors at homozygous reference sites. We are always looking for better ways to cope with the class balance problem, if you (or any reader) has ideas, please share!

akundaje on 21 Dec 2017

Looks great. We use similar strategies with forcibly adding minority class samples to each minibatch and weighted loss. But still trying to figure out a loss function that explicitly optimizes precision/recall rather than auROC or cross entropy. Not there yet. Surprisingly few elegant direct solutions to the label imbalance problem.

anuar12 on 21 Dec 2017

If one uses binary cross-entropy loss and optimizes for F1-score, you can try modifying binary cross-entropy to -log^2(x) or log^3(x), this way the loss would penalize wrong predictions after 0.5 more since F1 operates on discrete classes. And penalty for points that are closer to correct probability will be lower wrt original binary cross-entropy.

ambrex on 21 Dec 2017

Nice animations of the tensors. Would you mind sharing what tool you used to create them?

shlee on 21 Dec 2017

HI @ambrex, I'll let the developer know you have a question. It may be a while before they respond. We are gearing up for the GATK4 release next week.

samwell on 21 Dec 2017

Hi @ambrex Glad you liked the tensor animations! We wrote our own viewer. You can check out the [source code here]( "source code here"). It is built in Java on top of the [processing]( "processing") library.

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


RT @BroadFireCloud: We've updated the preprocessing #GATK4, somatic CNV & SNV featured workspaces w/ time & cost benchmarks! Grab free cred…
18 Jan 18
This shows our mothership, the Data Sciences Platform at Broad. It’s amazing... and it’s expanding! Check out the v…
17 Jan 18
@ksuhre @desertGenomics Btw, for those who didn't get it the reference is; not the *most* b…
17 Jan 18
@BioinfoMcDermot Can’t claim credit for this happy coincidence but delighted it worked out that way!
16 Jan 18
@FabienCampagne If you post details in the forum we’d be happy to look into it. Was this with the 4.0 release?
16 Jan 18

- Our favorite tweets from others

@gatk_dev Thanks for giving GATK a BSD license. Great scientific software available to all
13 Jan 18
Thanks @broadinstitute @gatk_dev for the awesome Amazon gift card!!! I am happy to answer all of your surveys!
11 Jan 18
The @broadinstitute this week released #GATK4, the much-anticipated version 4 update to the Genome Analysis Toolkit…
10 Jan 18
Ditto here, thanks @gatk_dev @BroadFireCloud! New opportunities for advancement in #genomics #cancer #ngs research…
10 Jan 18
Thanks for free credits @gatk_dev @BroadFireCloud & @googlecloud!
10 Jan 18

See more of our favorite tweets...