Short Variant Callers Roundup: Two Sisters

Authors of the blog: Laura Gauthier and David Benjamin

I like to think of Mutect2 and HaplotypeCaller as two sisters. One is a little older, but they live in the same house and share a fair amount of genetic material. One is gregarious and surrounds herself with friends. The other prefers the intimate company of a few special people. But they both have a formidable intellect, a determined work ethic, and a selfless instinct to help. Mutect2 and HaplotypeCaller both follow the same basic recipe: determine the active region, assemble the haplotypes, evaluate per-read allele likelihoods, and calculate variant likelihoods. Because their code is fundamentally alike for the first three steps, many changes in one tool are likely to impact the other. Recently Mutect2 development has been much more active than HaplotypeCaller development, but both tools are able to benefit. In his blog post (see https://software.broadinstitute.org/gatk/blog?id=23400), David Benjamin outlined the new Mutect2 features in more detail. Below you’ll see how many of them have consequences for HaplotypeCaller as well.


Improved active region logic: more accurate filtering

Previous algorithms did not count bases that were hardclipped during local assembly because they fell outside the active region. Overfiltering variants at the ends of reads made Mutect2 less sensitive, and penalized a number of germline variants as well. We’ve improved the active region logic by clarifying how the position of a variant within a read is measured, an update that benefits Mutect2 via the ReadPosition annotation, and HaplotypeCaller via ReadPosRankSum. Mutect2 has already demonstrated improved sensitivity and the current implementation should improve filtering accuracy in germline cases like these as well.

Improved graph assembly: better accuracy at high depth

Inspired by Mutect2 work on mitochondrial data, David improved the tendency to assemble low quality haplotypes by introducing “adaptive pruning” into the assembly of candidate haplotypes (see https://github.com/broadinstitute/gatk/docs/local_assembly.pdf for more information about the local assembly used). Mitochondrial data typically have very high depth (thousands of reads per base), and many positions will contain at least three matching sequencing errors by chance. Because the standard threshold for removing errors from the graph is two reads, the result is many more “variants” in the graph. The new solution uses local coverage to determine the evidence for “pruning” likely erroneous paths from the graph. This improved the sensitivity for low allele fraction variants in mitochondrial genome data and has proven critical for mitochondrial exome analysis. This feature is off by default for HaplotypeCaller, but can be turned on in high depth regions with dense variants. Samples with ploidy higher than two or pooled experiments should benefit greatly as well.

Improved per-read likelihoods: better local assembly visualization

Read visualization - based on per-read likelihood calculations - was improved in both tools by David’s change in how “uninformative” reads are handled. The likelihoods calculated by the PairHMM algorithm favor deletions because of a slight bias towards shorter haplotypes. If reads could be explained by either a deletion or a SNP, the shorter allele will have a slightly higher likelihood, introducing a deletion into a read in a repetitive reference context when the more parsimonious representation would be a SNP. The result: read realignments from the bamout mode showing spurious deletions in some cases. GATK 4.1’s version prefers the alignment with the simpler CIGAR, which improves the bamout visualization for both tools, most noticeably in regions with STRs.

Caption: top view is GATKv4.0, bottom is GATKv4.1: the read with the 5bp deletion in the top becomes the read with the low quality G on the left in the bottom.

Improved genotyping: GVCF mode in beta for somatic and germline mitochondria research

The GATK 4.1.0.0 release introduces a beta version of GVCF mode for Mutect2 designed for mitochondrial variant calling -- a use case at the interface between germline and somatic research. This feature leverages the tried and true reference confidence model used by HaplotypeCaller. Future updates and fixes to that code will benefit both callers now that Mutect2 and HaplotypeCaller are sharing code for non-reference likelihood calculation and GVCF generation.

Sum it up: common code and many common benefits

While probably only a minority of GATK users run both HaplotypeCaller and Mutect2 in the course of their work, because of the two tools’ shared “ancestry”, improvements in either tool are able to benefit even more users. In the coming months we’re returning to development on the newest addition to the short variant calling family -- the Spark version of HaplotypeCaller. Initial optimization efforts have already yielded improvements by removing bottlenecks in shared code. We expect all users to see improvements in HaplotypeCaller (and likely Mutect2) performance, whether running the Spark version or not.


Return to top

Wed 20 Feb 2019
Comment on this article


- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

RT @konradjk: Our slides from today's @broadinstitute MPG session are now up! Slides by @dgmacarthur myself @cureffi @nickywhiffin and spec…
12 Apr 19
RT @NICR_NCL: NICR bring the @broadinstitute to Newcastle - the workshop focuses on the core steps involved in calling variants with the Br…
11 Apr 19
Newest #GATK workshop announced: Newcastle, UK -- June 18-21 -- register now at https://t.co/VP2ngzlfuh https://t.co/miGkjjZx82
10 Apr 19
Workshop season is right around the corner -- don't miss out, sign up now https://t.co/cwfEejyCkL
10 Apr 19
Read more about the @BroadGenomics liquid biopsy analysis pipeline in this blog post by method developer Mark Fleha… https://t.co/B5YNCRXoxq
3 Apr 19

- Our favorite tweets from others

@lukwam @broadinstitute @gatk_dev Nice to see Cromwell and GATK as the tools of choice
11 Apr 19
Demo: Checking output from GATK best practices. @broadinstitute @gatk_dev #gatk #genomics #cromwell #bestpractices… https://t.co/iAwmy10zDJ
11 Apr 19
NICR bring the @broadinstitute to Newcastle - the workshop focuses on the core steps involved in calling variants w… https://t.co/FRhlj2iws1
10 Apr 19
Have questions about genomic data generation? @AJH_Genomics and @JaneW_Genomics are waiting for you at Broad Booth… https://t.co/IfKfC63Ese
1 Apr 19
The second #AACR19 poster session has begun! Find a guide to all Broad posters here: https://t.co/Lnaj54XbfT https://t.co/bnKU2yZUOj
1 Apr 19

See more of our favorite tweets...