Here it is at last… as in, last release for 2016, and possibly the last point release of GATK 3 ever!

Aside from the usual pile of bug fixes, the new features in this version are actually (almost) all features or improvements that were developed for GATK 4. We backported them to the GATK 3 framework to make them widely available sooner rather than later, since we still have some work to do to make GATK 4 complete enough to become the new standard. Of course there are a lot of other new things in the GATK 4 alpha version that we can't backport (especially those related to speed/performance improvements) because they depend on the new framework. But what we could backport, we did.

The hottest change here is a new model for calculating the QUAL score, but be aware it's there on an opt-in basis, not enabled by default. This also comes with a lower default value for the -stand_call_conf threshold, and deprecation of the confusing and ultimately rather pointless threshold -stand_emit_conf. We're also introducing some logic for prioritizing alleles to improve performance in messy regions. And we've got some improvements for MuTect2, although that tool does remain in beta status for now.

As usual, see the release notes for a full list of changes, and read on below for details on what we think you'll care most about.


A certain je ne sais QUAL ...

There's been a recent infusion of new blood in the development team -- meaning new members, to be clear (GATK development does not actually involve any demonic rituals, despite any rumors you might have heard). With that came a renewal of ideas on how to calculate key variant metrics, including QUAL.

As we explain in loving detail in this document, the QUAL score is the Phred-scaled posterior probability that all samples in your callset are homozygous reference. In other words, it represents the probability that all the variant evidence you saw in your data is wrong. It's not a very reliable way of ranking variant quality as such, because it's vulnerable to artifacts like inflation at high read depth -- but it does allow us to rule out the majority of glaringly false calls at the low end of the scale.

The current model for calculating QUAL has some flaws that manifest, among other things, as a tendency to excessively penalize singletons and doubletons (variants observed only in one or two samples in a cohort), especially at large cohort sizes. It also uses different and needlessly complicated logic for dealing with haploid, diploid and polyploid cases, leading to "amusing" inconsistencies. No one likes that.

So then some magic happened and now we have a new model that is simpler and behaves better, according to our tests. We're using it in our own work already, but we're not switching it on by default in the public release because it is a pretty big change. Instead, it's available as an opt-in feature that you can enable by setting -newQual in any command line invoking a germline caller (HaplotypeCaller, GenotypeGVCFs or UnifiedGenotyper if you really must use it).

Assuming it continues to behave well in our hands and yours (those of you who switch it on), it will be the default model in GATK 4. And then it will get documented in loving detail too, of course.


One threshold for calling

One of the last steps in the germline short variant calling process is the calculation of the QUAL score for each candidate variant. Once that's done, a threshold is applied on the QUAL score and we discard any variants that scored lower than the given threshold value. When you're running HaplotypeCaller in "GVCF mode", i.e. with -ERC GVCF or -ERC BP_RESOLUTION, that threshold is set to zero and every record is written to the output file. In every other case, the threshold is set by the -stand_call_conf argument, which stands for "standard calling confidence".

That sounds perfectly reasonable, doesn't it? Well, in practice that's not exactly how it works -- or has worked so far, anyway. No, we looove to provide sooo many additional options, we just had to use two arguments to do the QUAL thresholding. One called -stand_call_conf, and a second one called -stand_emit_conf. The first one works as advertised; the second was meant to make it possible to include candidate variants with a lower QUAL score to be written to the output, BUT with a "LowQual" tag of shame in the FILTER field. It was supposed to provide a sort of filtering preview.

Frankly, for the past few years we've used the same value for both, which effectively cancels out the -stand_emit_conf functionality, and generally speaking the presence of that argument has been sowing confusion since the dawn of time. In my view it's at odds with our philosophy of "call everything that moves then filter things out properly with other annotations". So we're killing it. It's gone.

So to summarize, the -stand_call_conf is the last threshold left standing for variant calling. Oh, and while we were at it we lowered the default value to 10 instead of 30. This is more generous than it needs to be, but you can always filter whatever's in the output that you don't want. Whereas you can't easily go back and relax the threshold if it was higher than you wanted -- or if you hadn't even realized it was a thing you could do to increase sensitivity.

Meaningful nod to people who do variant caller comparisons...


Culling the genotype herd to avoid a stampede

In some regions where the sequence context is very repetitive, we tend to find many candidate alleles for the same position, even within a single sample. When that happens, and especially if the requested ploidy is high (e.g. in pooled experiments), the number of possible genotypes that we have to evaluate (i.e. calculate likelihoods for) becomes downright astronomical. Depending on conditions, the consequences can range from unacceptably long runtimes to complete crash.

We previously tried to solve this by providing an argument called -maxAltAlleles to limit the number of alleles the caller has to consider, but the way it was wired up only limited the alleles that were output, not those considered internally. So it only solved superficial problems, and it didn't account for ploidy directly.

Now we're trying a new approach that involves setting a limit on the number of genotypes that we're willing to consider, instead of a number of alleles. Under the hood, this ties into some logic that drops alt alleles that are really unlikely until we get to a number of possible genotypes that we deem acceptable. The default value is set to be comfortable enough that this only kicks in at complex sites when ploidy is high, but it can be modified with the argument -maxGenotypeCnt to be more or less generous.

Note that -maxAltAlleles is still applicable, but the current implementation is set to give precedence to -maxGenotypeCnt. So at sites where sample ploidy and -maxAltAlleles combine to give a genotype count higher than the value in -maxGenotypeCnt, the -maxAltAlleles limit will be ignored, and alternate alleles will be removed based on ploidy and -maxGenotypeCnt.

If you want to tweak these settings, keep in mind their interactions and the rule of precedence, otherwise you might run into surprises (and not necessarily of the surprise birthday cake at your team meeting kind) that we will NOT consider to be bugs. For example, let's say you provide -maxAltAlleles with a high value, leave the -maxGenotypeCnt as default, and works with a high ploidy sample. Due to the newly imposed maximum genotype count, alt alleles actually used in genotyping will be limited to far less than the maximum you requested. For example, with ploidy 18 and maximum genotype count set to 1024 (the current, arbitrary default value, but definitely reasonable in most cases), the maximum allele count is 3 (alt allele count 2), potentially much lower than the -maxAltAlleles you requested.


MuTect2 starting to see the light at the end of the beta tunnel

MuTect2 is a next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller. It was first made publicly available in GATK version 3.5 as a beta tool earmarked for experimental work only (no production or commercial work). In evaluations performed with our colleagues in the Broad's Cancer Genome Analysis group (CGA), we found that MuTect2 was doing a great job on indels, but its sensitivity on SNVs was slightly inferior to the original MuTect on which it was based.

Due to a shift in priorities we then had to put MuTect2 development on hiatus, which is why it stayed virtually unchanged in GATK 3.6. But I'm happy to report that MuTect2 is now back in active development! In this release, we have a small but appreciable crop of improvements to MuTect2, which will be the last ever made in a 3.x version.

The improvements made in this version mainly have to do with cleaning up the code and simplifying parts where hybridization with the HaplotypeCaller machinery got a bit Frankenstein-y. As part of that, we're now exposing a couple of downsampling-related arguments that were previously hardcoded, so -maxReadsInRegionPerSample and -minReadsPerAlignment can now be set from command line. We've also added back a few components that were in the original MuTect but weren't ported in the move to GATK, including the clustered read position and strand bias filters. Work is still ongoing to determine exactly what is the best way to leverage these components for best results.

To be clear, despite these improvements we're still keeping MuTect2 in beta status pending full satisfaction from our CGA friends -- so that does mean that the eventual fully-supported version of MuTect2 will be released in GATK 4 only. We'll post a roadmap/expected timeline of GATK4 and MuTect2 development in the coming weeks.


Return to top

wenbinmei


Hi, GATK development team. If I have already called individual sample using HaplotypeCaller and will do joint calling using GenotypeGVCFs. Do you think it make sense for me to use new -newQual in the GenotypeGVCFs or not use it? Thank you for help.

Mon 12 Dec 2016

Geraldine_VdAuwera


Hi @wenbinmei, Yes, you can use the new QUAL model on GVCFs generated with an earlier (recent) version.

Mon 12 Dec 2016

wenbinmei


It is ok to use new QUAL model on GVCFs even I used the old QUAL model on HaplotypeCaller ? Does it will improve SNP calling ? Thanks!

Mon 12 Dec 2016

Geraldine_VdAuwera


Yes, because GenotypeGVCFs will ignore the QUAL originally calculated by HaplotypeCaller and will recalculate it based on the PLs of the genotypes recorded in the GVCFs. This may improve sensitivity to rare variants.

Mon 12 Dec 2016

MattB


Hi Dev team / Geraldine, I've been experimenting with `-newQual` using the HC in classic none GVCF mode, and you should note that by using `-newQual` on it's own, I'm seeing `standard_min_confidence_threshold_for_calling=10.0 standard_min_confidence_threshold_for_emitting=30.0` in the VCF header under `##GATKCommandLine.HaplotypeCaller=<>`. So it at least appears that the emitting threshold is still persists at the old value according to the VCF header (I' guessing it's actually ignored). Of course if I attempt to explicitly set `-stand_emit_conf 10` I get the deprecation warning. I've not yet tested this behaviour in other tools but expect it arises from something common.

Mon 12 Dec 2016

Geraldine_VdAuwera


Oh, I think I know why that happens, it's an oversight from the refactoring process. It shouldn't have any functional consequence beyond the residual line in the VCF header. Thanks for pointing this out! We'll make sure to remove it.

Mon 12 Dec 2016

wisonlee


Hi Dev team / Geraldine, I'd like to know how to use MuTect2's `-maxReadsInRegionPerSample`. I tried several combinations with `-dt` and `-dcov` but to no avail. Could you please give me some hints on it? Thanks, Richard

Mon 12 Dec 2016

Sheila


@wisonlee Hi Richard, I am not sure I understand your question. Can you post the exact command you ran and the issue you are having? Perhaps the [documentation](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php#--maxReadsInRegionPerSample) will help? Thanks, Sheila

Mon 12 Dec 2016

wisonlee


Hi @Sheila I run into the following error if I combine `-dt BY_SAMPLE`, `-dcov 1000` and `--maxReadsInRegionPerSample 1000 `: ``` INFO 14:22:28,492 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:22:28,528 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 INFO 14:22:28,529 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 14:22:28,529 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 14:22:28,529 HelpFormatter - [Mon Mar 13 14:22:28 CST 2017] Executing on Linux 2.6.32-642.3.1.el6.x86_64 amd64 INFO 14:22:28,530 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_91-b14 INFO 14:22:28,537 HelpFormatter - Program Args: -T MuTect2 -R hs37d5.fa -I:tumor in_case.bam -I:normal in_ctrl.bam -L my_roi.bed --dbsnp dbsnp_138.b37.vcf.gz --cosmic b37_cosmic.vcf.gz -contamination 0 --max_alt_alleles_in_normal_count 3 --max_alt_alleles_in_normal_qscore_sum 40 --max_alt_allele_in_normal_fraction 0.02 -dt BY_SAMPLE -dcov 1000 --maxReadsInRegionPerSample 1000 -o out.vcf ##### ##### ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Invalid command line: Cannot use -dcov or --downsample_to_coverage for ActiveRegionWalkers, use another downsampling argument ##### ERROR ------------------------------------------------------------------------------------------ ``` If I remove the `-dt` and `-dcov` parameters, it runs okay but with the `dt` set to default (`Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000` ) in the log. I don't know whether the `--maxReadsInRegionPerSample 2000` parameter is working as intended. ``` INFO 08:43:59,994 HelpFormatter - --------------------------------------------------------------------------------- INFO 08:43:59,998 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 INFO 08:43:59,998 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 08:43:59,998 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 08:43:59,998 HelpFormatter - [Thu Mar 16 08:43:59 CST 2017] Executing on Linux 2.6.32-642.3.1.el6.x86_64 amd64 INFO 08:43:59,999 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_91-b14 INFO 08:44:00,003 HelpFormatter - Program Args: -T MuTect2 -R hs37d5.fa -I:tumor in_case.bam -I:normal in_ctrl.bam -L my_roi.bed --dbsnp dbsnp_138.b37.vcf.gz --cosmic b37_cosmic.vcf.gz -contamination 0 --max_alt_alleles_in_normal_count 3 --max_alt_alleles_in_normal_qscore_sum 40 --max_alt_allele_in_normal_fraction 0.02 -dt BY_SAMPLE -dcov 1000 --maxReadsInRegionPerSample 2000 -o out.vcf #### INFO 08:44:00,086 GenomeAnalysisEngine - Strictness is SILENT INFO 08:44:00,265 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 08:44:00,275 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:44:00,349 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07 INFO 08:44:01,021 IntervalUtils - Processing 153371 bp from intervals WARN 08:44:01,028 IndexDictionaryUtils - Track cosmic doesn't have a sequence dictionary built in, skipping dictionary validation WARN 08:44:01,028 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation INFO 08:44:01,132 GenomeAnalysisEngine - Preparing for traversal over 2 BAM files INFO 08:44:01,890 GenomeAnalysisEngine - Done preparing for traversal INFO 08:44:01,891 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 08:44:01,892 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 08:44:01,892 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 08:44:02,021 MuTect2 - Using global mismapping rate of 45 => -4.5 in log10 likelihood units Using AVX accelerated implementation of PairHMM INFO 08:44:05,667 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file INFO 08:44:05,668 VectorLoglessPairHMM - Using vectorized implementation of PairHMM ``` Thanks, Richard

Mon 12 Dec 2016

Sheila


@wisonlee Hi Richard, Ah, I see. Yes, we do not recommend using -dt or -dcov in MuTect2 or HaplotypeCaller. Have a look at [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/3989/downsampling-to-coverage-and-the-3-x-haplotypecaller) and [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/4614/haplotypecaller-and-downsampling). -Sheila

Mon 12 Dec 2016





- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

Some really great points here. We teach multiple workshops every year; while the short format is not the best possi… https://t.co/r2EoyKNBFK
14 Dec 17
@nickschurch @pathogenomenick @nanopore Heeey... well you may have a point. We don't really do dirty ;) Never tried… https://t.co/zTH5QI89XZ
14 Dec 17
What's new in GATK4? In this short video, Laura Gauthier explains how the speed and scalability of joint calling is… https://t.co/EAPdrZcqJI
14 Dec 17
What is truth? Or, how an accident of nature can illuminate our path https://t.co/qperntHUVz
8 Dec 17
Why is HaplotypeCaller slower in the most recent GATK4 beta versions? Because it's saving its strength for the 4.0… https://t.co/p3ZBPojxkR
6 Dec 17

- Our favorite tweets from others

My first @gatk_dev blog-post https://t.co/JwVzAtqDvt about the CHM cell-lines experiment we now use as benchmark da… https://t.co/8VdqG6H4Zk
9 Dec 17
Wanna be a baller, HaplotypeCaller 20K genotypes in the VCF file Caller, gettin' phased tonight https://t.co/bOGSI4UL23
17 Nov 17
This amazing genomics toolkit helps researchers find insights that save lives - I know! GATK users - please provide… https://t.co/gdY4FDPX8K
2 Nov 17
using GATK to identify SNPs while handing out candy... Happy Halloween! @broadinstitute @gatk_dev #bioinformatics #researchisfun #Halloween
31 Oct 17
Although it made me cry sometimes, I owe them a lot and love them much more. https://t.co/vUj0cBllgn
16 Oct 17
See more of our favorite tweets...