Version History
The current version is 3.7

These articles highlight the key improvements in major and minor version releases (for example, 3.4) and explain their significance. To view a complete list of changes per release (including minor changes and bug fixes), please see the release notes (next tab).

Created 2016-12-12 16:01:35 | Updated |

Comments (10)

Here it is at last… as in, last release for 2016, and possibly the last point release of GATK 3 ever!

Aside from the usual pile of bug fixes, the new features in this version are actually (almost) all features or improvements that were developed for GATK 4. We backported them to the GATK 3 framework to make them widely available sooner rather than later, since we still have some work to do to make GATK 4 complete enough to become the new standard. Of course there are a lot of other new things in the GATK 4 alpha version that we can't backport (especially those related to speed/performance improvements) because they depend on the new framework. But what we could backport, we did.

The hottest change here is a new model for calculating the QUAL score, but be aware it's there on an opt-in basis, not enabled by default. This also comes with a lower default value for the -stand_call_conf threshold, and deprecation of the confusing and ultimately rather pointless threshold -stand_emit_conf. We're also introducing some logic for prioritizing alleles to improve performance in messy regions. And we've got some improvements for MuTect2, although that tool does remain in beta status for now.

As usual, see the release notes for a full list of changes, and read on below for details on what we think you'll care most about.

A certain je ne sais QUAL ...

There's been a recent infusion of new blood in the development team -- meaning new members, to be clear (GATK development does not actually involve any demonic rituals, despite any rumors you might have heard). With that came a renewal of ideas on how to calculate key variant metrics, including QUAL.

As we explain in loving detail in this document, the QUAL score is the Phred-scaled posterior probability that all samples in your callset are homozygous reference. In other words, it represents the probability that all the variant evidence you saw in your data is wrong. It's not a very reliable way of ranking variant quality as such, because it's vulnerable to artifacts like inflation at high read depth -- but it does allow us to rule out the majority of glaringly false calls at the low end of the scale.

The current model for calculating QUAL has some flaws that manifest, among other things, as a tendency to excessively penalize singletons and doubletons (variants observed only in one or two samples in a cohort), especially at large cohort sizes. It also uses different and needlessly complicated logic for dealing with haploid, diploid and polyploid cases, leading to "amusing" inconsistencies. No one likes that.

So then some magic happened and now we have a new model that is simpler and behaves better, according to our tests. We're using it in our own work already, but we're not switching it on by default in the public release because it is a pretty big change. Instead, it's available as an opt-in feature that you can enable by setting -newQual in any command line invoking a germline caller (HaplotypeCaller, GenotypeGVCFs or UnifiedGenotyper if you really must use it).

Assuming it continues to behave well in our hands and yours (those of you who switch it on), it will be the default model in GATK 4. And then it will get documented in loving detail too, of course.

One threshold for calling

One of the last steps in the germline short variant calling process is the calculation of the QUAL score for each candidate variant. Once that's done, a threshold is applied on the QUAL score and we discard any variants that scored lower than the given threshold value. When you're running HaplotypeCaller in "GVCF mode", i.e. with -ERC GVCF or -ERC BP_RESOLUTION, that threshold is set to zero and every record is written to the output file. In every other case, the threshold is set by the -stand_call_conf argument, which stands for "standard calling confidence".

That sounds perfectly reasonable, doesn't it? Well, in practice that's not exactly how it works -- or has worked so far, anyway. No, we looove to provide sooo many additional options, we just had to use two arguments to do the QUAL thresholding. One called -stand_call_conf, and a second one called -stand_emit_conf. The first one works as advertised; the second was meant to make it possible to include candidate variants with a lower QUAL score to be written to the output, BUT with a "LowQual" tag of shame in the FILTER field. It was supposed to provide a sort of filtering preview.

Frankly, for the past few years we've used the same value for both, which effectively cancels out the -stand_emit_conf functionality, and generally speaking the presence of that argument has been sowing confusion since the dawn of time. In my view it's at odds with our philosophy of "call everything that moves then filter things out properly with other annotations". So we're killing it. It's gone.

So to summarize, the -stand_call_conf is the last threshold left standing for variant calling. Oh, and while we were at it we lowered the default value to 10 instead of 30. This is more generous than it needs to be, but you can always filter whatever's in the output that you don't want. Whereas you can't easily go back and relax the threshold if it was higher than you wanted -- or if you hadn't even realized it was a thing you could do to increase sensitivity.

Meaningful nod to people who do variant caller comparisons...

Culling the genotype herd to avoid a stampede

In some regions where the sequence context is very repetitive, we tend to find many candidate alleles for the same position, even within a single sample. When that happens, and especially if the requested ploidy is high (e.g. in pooled experiments), the number of possible genotypes that we have to evaluate (i.e. calculate likelihoods for) becomes downright astronomical. Depending on conditions, the consequences can range from unacceptably long runtimes to complete crash.

We previously tried to solve this by providing an argument called -maxAltAlleles to limit the number of alleles the caller has to consider, but the way it was wired up only limited the alleles that were output, not those considered internally. So it only solved superficial problems, and it didn't account for ploidy directly.

Now we're trying a new approach that involves setting a limit on the number of genotypes that we're willing to consider, instead of a number of alleles. Under the hood, this ties into some logic that drops alt alleles that are really unlikely until we get to a number of possible genotypes that we deem acceptable. The default value is set to be comfortable enough that this only kicks in at complex sites when ploidy is high, but it can be modified with the argument -maxGenotypeCnt to be more or less generous.

Note that -maxAltAlleles is still applicable, but the current implementation is set to give precedence to -maxGenotypeCnt. So at sites where sample ploidy and -maxAltAlleles combine to give a genotype count higher than the value in -maxGenotypeCnt, the -maxAltAlleles limit will be ignored, and alternate alleles will be removed based on ploidy and -maxGenotypeCnt.

If you want to tweak these settings, keep in mind their interactions and the rule of precedence, otherwise you might run into surprises (and not necessarily of the surprise birthday cake at your team meeting kind) that we will NOT consider to be bugs. For example, let's say you provide -maxAltAlleles with a high value, leave the -maxGenotypeCnt as default, and works with a high ploidy sample. Due to the newly imposed maximum genotype count, alt alleles actually used in genotyping will be limited to far less than the maximum you requested. For example, with ploidy 18 and maximum genotype count set to 1024 (the current, arbitrary default value, but definitely reasonable in most cases), the maximum allele count is 3 (alt allele count 2), potentially much lower than the -maxAltAlleles you requested.

MuTect2 starting to see the light at the end of the beta tunnel

MuTect2 is a next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller. It was first made publicly available in GATK version 3.5 as a beta tool earmarked for experimental work only (no production or commercial work). In evaluations performed with our colleagues in the Broad's Cancer Genome Analysis group (CGA), we found that MuTect2 was doing a great job on indels, but its sensitivity on SNVs was slightly inferior to the original MuTect on which it was based.

Due to a shift in priorities we then had to put MuTect2 development on hiatus, which is why it stayed virtually unchanged in GATK 3.6. But I'm happy to report that MuTect2 is now back in active development! In this release, we have a small but appreciable crop of improvements to MuTect2, which will be the last ever made in a 3.x version.

The improvements made in this version mainly have to do with cleaning up the code and simplifying parts where hybridization with the HaplotypeCaller machinery got a bit Frankenstein-y. As part of that, we're now exposing a couple of downsampling-related arguments that were previously hardcoded, so -maxReadsInRegionPerSample and -minReadsPerAlignment can now be set from command line. We've also added back a few components that were in the original MuTect but weren't ported in the move to GATK, including the clustered read position and strand bias filters. Work is still ongoing to determine exactly what is the best way to leverage these components for best results.

To be clear, despite these improvements we're still keeping MuTect2 in beta status pending full satisfaction from our CGA friends -- so that does mean that the eventual fully-supported version of MuTect2 will be released in GATK 4 only. We'll post a roadmap/expected timeline of GATK4 and MuTect2 development in the coming weeks.

Created 2016-06-01 22:34:38 | Updated 2016-06-02 04:41:03 |

Comments (13)

What better way to start the summer than with a new GATK release?

Umm no don't answer that, there's loads of good options. You could have a barbecue, eat some ice cream, go on a hike if that's the sort of thing that floats your kayak... Or you might live somewhere where winter is just starting and everything I just said there was terribly insensitive. Sorry.

Ahem. As I mentioned in my recent sneak preview blog post, the bulk of our development effort (speed! copy number! unicorns!) is now going into the GATK4 project. Accordingly, development in the GATK3 framework is winding down, so this release consists mainly of bug fixes, added convenience functions, and relatively minor behavior tweaks.

That being said we do have a few new experimental features in the VQSR tools (which haven't yet been fully ported to GATK4, hence the ongoing development in GATK3) that are pushing the envelope of allele-specific filtering. So that's interesting, if not yet fully documented (someone should really get on that). And you'll probably care about some of those tweaks I casually mentioned above -- in fact I guarantee that at least one of these things will matter to you in some way. If you read through the whole thing and don't find anything relevant to you, tell me in the comments that I was wrong. That's what the internet is for.

As usual, here I go over the changes that matter the most / to the most; consider going through the release notes as well for a full list of changes.

One version of Java to rule them all

Possibly the most sweeping change in this version is that it introduces support for Java 8. As noted recently, when we switched our test framework to Java 8 we encountered multiple failures in GATK 3.5 tests, which I discussed here. We fixed the underlying issues, so from version 3.6 onward GATK now runs reliably on Java 8. As a nice bonus, this puts us back into sync with HTSJDK and the Picard toolkit, which had been running on Java 8 for a few months already. If you were doing it right, you had both versions of Java installed and ran each toolkit, GATK and Picard, on the appropriate version. How much hassle? Too much hassle! — Now you can run everything on Java 8.

ET finally gets home; discovers phone bill, flees to Canada

Here's another change that will affect everyone regardless of use case, in a good way: we removed the Phone Home usage reporting system. It served its purpose for many years, but we've outgrown it. So we ripped it all out. If you previously used a key to deactivate it with the -et NO_ET and —key <your_key> arguments, you can stop and take those out. Or if you're just too busy, leave them in -- the Phone Home code is all gone but we left in the argument definitions, so the parser shouldn't fail if you leave them in your commands. This shouldn't break any scripts.

Indel realignment tools drop out, go open-source

In the next few days we'll be making some updates to our Best Practices documentation to reflect updates to our production pipelines, and one thing you'll notice is that local realignment around indels is no longer included in the pre-processing part of the pipelines. More on that in a follow-up post; in a nutshell, indel realignment is just not useful enough anymore when you're calling variants with haplotype-based tools like HaplotypeCaller and MuTect2. This does mean we will no longer support the indel realignment tools as actively; but since others may care more about preserving and possibly even expanding this functionality, we've decided to move the relevant tools, IndelRealigner and RealignerTargetCreator, and related classes to the part of the GATK that is open-sourced under the MIT license.

Confidence matters

In response to popular demand, we made two frequently requested enhancements to the output of the variant discovery tools.

First, HaplotypeCaller and GenotypeGVCFs will now emit a no-call (./.) for any sample where GQ is zero, in both normal and GVCF modes, instead of emitting a specific genotype in which we have zero confidence. Note that these do not necessarily indicate that the variant call itself is bad, but that we are unable to choose between two (or more) genotypes for that particular sample.

Second, GenotypeGVCFs will now emit a QUAL value for invariant sites (i.e. where all samples are genotyped homozygous-variant) when run in -allSites mode. By the way, we're aware that many of you would like to see a GVCF-like output option for the final VCF, in which you'd get essentially the same informational content as an all-sites VCF but with some form of interval-block compression. We still do not provide that functionality, and we do not have any immediate plans to work on implementing it in GATK3; however we are talking about how to make something like that happen in GATK4. Why the feet-dragging? Well, while it's not hard to do something like that badly, doing it correctly is quite a bit harder, so it will take some non-trivial development effort —- and since it's not something we currently need at Broad for our production needs, it's difficult to prioritize. But hey, the more you tell us you want this, the more we can justify putting it in the roadmap, so leave a comment below if you care about this.

Bug in the spotlight

Today's bug of infamy concerns the handling of allele depths (AD) in the joint variant discovery workflow, when the AD of the NON_REF allele is non-zero. This was an interesting little edge case. First we have HaplotypeCaller emitting a GVCF where, at a given position, the AD for the NON_REF allele is non-zero. This is fine, even expected. The problem arises during GVCF merging or joint genotyping, when CombineGVCFs or GenotypeGVCFs takes this NON_REF AD count and -- this is the bug -- copies it over to one of the alleles seen in another sample. This results in observing sum(AD) > DP, which is obviously not good. The new behavior, post-fix, is that we still record the NON_REF AD in the single-sample GVCF, but we don't copy it over to concrete alleles in CombineGVCFs or GenotypeGVCFs. Because, duh.

Dude, where's my read?

Some days it seems like the number one preoccupation of GATK forum denizens is tracking down what happened to every last read in a pileup. That's probably because HaplotypeCaller and MuTect2, the two stars of our little show, can be awfully picky about what reads they include at key stages of the calling process. Their (totally justified) tendency to discount reads that they don't consider useful often causes confusion and/or concerns that reads are somehow being dropped for no reason. That's a concern we'd like to help alleviate. HaplotypeCaller and MuTect2 both already have an option to output a bam file showing the post-assembly read alignments and candidate haplotypes, which we call a bamout and by default shows the state of the data used by HC and M2 to make their calls. Now we've added an option to include in that output any reads that were not considered informative or were dropped for any reason internal to the caller (separate from any read filtering done by the engine). The option is called --emitDroppedReads and should be used in conjunction with -bamout.

Created 2015-11-25 07:37:00 | Updated 2015-11-25 14:21:18 |

Comments (23)

The last GATK 3.x release of the year 2015 has arrived!

The major feature in GATK 3.5 is the eagerly awaited MuTect2 (beta version), which brings somatic SNP and Indel calling to GATK. This is just the beginning of GATK’s scope expansion into the somatic variant domain, so expect some exciting news about copy number variation in the next few weeks! Meanwhile, more on MuTect2 awesomeness below.

In addition, we’ve got all sorts of variant context annotation-related treats for you in the 3.5 goodie bag -- both new annotations and new capabilities for existing annotations, listed below.

In the variant manipulation space, we enhanced or fixed functionality in several tools including LeftAlignAndTrimVariants, FastaAlternateReferenceMaker and VariantEval modules. And in the variant calling/genotyping space, we’ve made some performance improvements across the board to HaplotypeCaller and GenotypeGVCFs (mostly by cutting out crud and making the code more efficient) including a few improvements specifically for haploids. Read the detailed release notes for more on these changes. Note that GenotypeGVCFs will now emit no-calls at sites where RGQ=0 in acknowledgment of the fact that those sites are essentially uncallable.

We’ve got good news for you if you’re the type who worries about disk space (whether by temperament or by necessity): we finally have CRAM support -- and some recommendations for keeping the output of BQSR down to reasonable file sizes, detailed below.

Finally, be sure to check out the detailed release notes for the usual variety show of minor features (including a new Queue job runner that enables local parallelism), bug fixes and deprecation notices (a few tools have been removed from the codebase, in the spirit of slimming down ahead of the holiday season).

Introducing MuTect2 (beta): calling somatic SNPs and Indels natively in GATK

MuTect2 is the next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller.

The original MuTect (Cibulskis et al., 2013) was built on top of the GATK engine by the Cancer Genome Analysis group at the Broad Institute, and was distributed as a separate package. By all accounts it did a great job calling somatic SNPs, and was part of the winning entries for multiple DREAM challenges (including some submitted by groups outside the Broad). However it was not able to call indels; and the less said about the indel caller that accompanied it (first named SomaticIndelDetector then Indelocator) the better.

This new incarnation of MuTect leverages much of the HaplotypeCaller’s internal machinery (including the all-important graph assembly bit) to call both SNPs and indels together. Yet it retains key parts of the original MuTect’s internal genotyping engine that allow it to model somatic variation appropriately. This is a major differentiation point compared to HaplotypeCaller, which has expectations about ploidy and allele frequencies that make it unsuitable for calling somatic variants.

As a convenience add-on to MuTect2, we also integrated the cross-sample contamination estimation tool ContEst into GATK 3.5. Note that while the previous public version of this tool relied on genotyping chip data for its operation, this version of the tool has been upgraded to enable on-the-fly genotyping for the case where genotyping data is not available. Documentation of this feature will be provided in the near future. Both MuTect2 and ContEst are now featured in the Tool Documentation section of the Guide. Stay tuned for pipeline-level documentation on performing somatic variant discovery, to be added to the Best Practices docs in the near future.

Please note that this release of MuTect2 is a beta version intended for research purposes only and should not be applied in production/clinical work. MuTect2 has not yet undergone the same degree of scrutiny and validation as the original MuTect since it is so new. Early validation results suggest that MuTect2 has a tendency to generate more false positives as compared to the original MuTect; for example, it seems to overcall somatic mutations at low allele frequencies, so for now we recommend applying post-processing filters, e.g. by hard-filtering calls with low minor allele frequencies. Rest assured that data is being generated and the tools are being improved as we speak. We’re also looking forward to feedback from you, the user community, to help us make it better faster.

Finally, note also that MuTect2 is distributed under the same restricted license as the original MuTect; for-profit users are required to seek a license to use it (please email To be clear, while MuTect2 is released as part of GATK, the commercial licensing has not been consolidated under a single license. Therefore, current holders of a GATK license will still need to contact our licensing office if they wish to use MuTect2.

Annotate this: new and improved variant context annotations

Whew that was a long wall of text on MuTect2, wasn’t it. Let’s talk about something else now. Annotations! Not functional annotations, mind you -- we’re not talking about e.g. predicting synonymous vs. non-synonymous mutations here. I mean variant context annotations, i.e. all those statistics calculated during the variant calling process which we mostly use to estimate how confident we are that the variants are real vs. artifacts (for filtering and related purposes).

So we have two new annotations, BaseCountsBySample (what it says on the can) and ExcessHet (for excess heterozygosity, i.e. the number of heterozygote calls made in excess of the Hardy-Weinberg expectations), as well as a set of new annotations that are allele-specific versions of existing annotations (with AS_ prefix standing for Allele-Specific) which you can browse here. Right now we’re simply experimenting with these allele-specific annotations to determine what would be the best way to make use of them to improve variant filtering. In the meantime, feel free to play around with them (via e.g. VariantsToTable) and let us know if you come up with any interesting observations. Crowdsourcing is all the rage, let’s see if it gets us anywhere on this one!

We also made some improvements to the StrandAlleleCountsBySample annotation, to how VQSR handles MQ, and to how VariantAnnotator makes use of external resources -- and we fixed that annoying bug where default annotations were getting dropped. All of which you can read about in the detailed release notes.

These Three Awesome File Hacks Will Restore Your Faith In Humanity’s Ability To Free Up Some Disk Space

CRAM support! Long-awaited by many, lovingly implemented by Vadim Zalunin at EBI and colleagues at the Sanger Institute. We haven’t done extensive testing, and there are a few tickets for improvements that are planned at the htsjdk level -- but it works well enough that we’re comfortable releasing it under a beta designation. Meaning have fun with it, but do your own thorough testing before putting it into production or throwing out your old BAMs!

Static binning of base quality scores. In a nutshell, binning (or quantizing) the base qualities in a BAM file means that instead of recording all possible quality values separately, we group them into bins represented by a single value (by default, 10, 20, 30 or 40). By doing this we end up having to record fewer separate numbers, which through the magic of BAM compression yields substantially smaller files. The idea is that we don’t actually need to be able to differentiate between quality scores at a very high resolution -- if the binning scheme is set up appropriately, it doesn’t make any difference to the variant discovery process downstream. This is not a new concept, but now the GATK engine has an argument to enable binning quality scores during the base recalibration (BQSR) process using a static binning scheme that we have determined produces optimal results in our hands. The level of compression is of course adjustable if you’d like to set your own tradeoff between compression and base quality resolution. We have validated that this type of binning (with our chosen default parameters) does not have any noticeable adverse effect on germline variant discovery. However we are still looking into some possible effects on somatic variant discovery, so we can’t yet recommend binning for that application.

Disable indel quality scores. The Base Recalibration process produces indel quality scores in addition to the regular base qualities. They are stored in the BI and BD tags of the read records, taking up a substantial amount of space in the resulting BAM files. There has been a lot of discussion about whether these indel quals are worth the file size inflation. Well, we’ve done a lot of testing and we’ve now decided that no, for most use cases the indel quals don’t make enough of a difference to justify the extra file size. The one exception to this is when processing PacBio data, it seems that indel quals may help model the indel-related errors of that technology. But for the rest, we’re now comfortable recommending the use of the --disable_indel_quals argument when writing out the recalibrated BAM file with PrintReads.

Created 2015-05-22 01:58:47 | Updated 2015-07-09 22:27:40 |

Comments (35)

Folks, I’m all out of banter for this one, so let’s go straight to the facts. GATK 3.4 contains a shedload of improvements and bug fixes, including some new functionality that we hope you’ll find useful. The full list is available in the detailed release notes.

None of the recent changes involves any disruption to the Best Practice workflow (I hear some sighs of relief) but you’ll definitely want to check out the tweaks we made to the joint discovery tools (HaplotypeCaller, CombineGVCFs and GenotypeGVCFs), which are rapidly maturing as they log more flight time at Broad and in the wild.

Key changes to the joint discovery tools

Let’s start at the very beginning with HaplotypeCaller (a very good place to start). On the usability front, we’ve finally given in to the nigh-universal complaint about the required variant indexing arguments (--variant_index_type LINEAR --variant_index_parameter 128000) being obnoxious and a waste of characters. So, tadaa, they are no longer required, as long as you name your output file with the extension .g.vcf so that the engine knows what level of compression to use to write the gVCF index (which leads to better performance in downstream tools). We think this naming convention makes a lot of sense anyway, as it’s a great way to distinguish gVCFs from regular VCFs on sight, so we hope most of you will adopt it. That said, we stopped short of making this convention mandatory (for now…) so you don’t have to change all your scripts and conventions if you don’t want to. All that will happen (assuming you still specify the variant index parameters as previously) is that you’ll get a warning in the log telling you that you could use the new convention.

Where we’ve been a bit more dictatorial is that we’ve completely disabled the use of -dcov with HaplotypeCaller because it was causing very buggy behavior due to an unforeseen complication in how different levels of downsampling are applied in HaplotypeCaller. We know that the default setting does the right thing, and there’s almost no legitimate reason to change it, so we’re disabling this for the greater good pending a fix (which may be a long time coming due to the complexity of the code involved).

Next up, CombineGVCFs gets a new option to break up reference blocks at every N sites. The new argument --breakBandsAtMultiplesOf Nwill ensure that no reference blocks in the combined gVCF span genomic positions that are multiples of N. This is meant to enable scatter-gather parallelization of joint genotyping on whole-genome data, as a workaround to some annoying limitations of the GATK engine that make it unsafe to use -L intervals that might start within the span of a block record. For exome data, joint genotyping can easily be parallelized by scatter-gathering across exome capture target intervals, because we know that there won’t be any hom-ref block records spanning the target interval boundaries. In contrast, in whole-genome data, there is no equivalent predictable termination of block records, so it’s not possible to know up front where it would be safe to set scatter-gather interval start and end points -- until now!

And finally, GenotypeGVCFs gets an important bug fix, and a very useful new annotation.

The bug is something that has arisen mostly (though not exclusively) from large cohort studies. What happened is that, when a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, GenotypeGVCFs would emit a homozygous reference genotype for sample B at that position -- which is obviously incorrect. The fix is that now, sample B will be genotyped as having a symbolic <*:DEL> allele representing the deletion.

The new annotation is called RGQ for Reference Genotype Quality. It is a new sample-level annotation that will be added by GenotypeGVCFs to monomorphic sites if you use the -allSites argument to emit non-variant sites to the output VCF. This is obviously super useful for evaluating the level of confidence of those sites called homozygous-reference.

New RNAseq tool: ASEReadCounter

This new coverage analysis tool is designed to count read depth in a way that is appropriate for allele-specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. The default output format produced by this tool is a structured text file intended to be consumed by the statistical analysis toolkit MAMBA. A paper by Stephane Castel and colleagues describing the complete ASE analysis workflow is available as a preprint on bioarxiv.

New documentation features: “Common Problems” and “Issue Tracker”

We’ve added two new documentation resources to the Guide.

One is a new category of documentation articles called Common Problems, to cover topics that are a specialized subset of FAQs: problems that many users encounter, which are typically due to misunderstandings about input requirements or about the expected behavior of the tools, or complications that arise from certain experimental designs. This category is being actively worked on and we welcome suggestions of additional topics that it should cover.

The second is an Issue Tracker that lists issues that have been reported as well as features or enhancements that have been requested. If you encounter a problem that you think might be a bug (or you have a feature request in mind), you can check this page to see if it’s something we already know about. If you have submitted a bug report, you can use the issue tracker to check whether your issue is in the backlog, in the queue, or is being actively worked on. In future we’ll add some functionality to enable voting on what issues or features should be prioritized, so stay tuned for an announcement on that!

Created 2014-10-23 23:09:56 | Updated |

Comments (18)

Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.

Because -ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.

Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.

Genotype refinement workflow with all the trimmings

As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.

Non-diploids, rejoice!

This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

HaplotypeCaller gets physical

You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:

or this:

But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).

In a nutshell, phased records will look like this:

1   1372243  .  T  <NON_REF>  .  .  END=1372267  <snip>  <snip>
1   1372268  .  G  A,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip>
1   1372269  .  G  T,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip>
1   1372270  .  C  <NON_REF>  .  .  END=1372299  <snip>  <snip>

You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).

The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.

Heads or tails

Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).

We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the --recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.

Variant annotations finally make sense

Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.

Created 2014-07-30 20:26:12 | Updated |

Comments (2)

Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as -ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.

Working out the kinks in the "reference confidence model" workflow

The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).

You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as --includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).

HaplotypeCaller now emits post-realignment coverage metrics

The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the -bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.

R you up to date on your libraries?

We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.

A sincere apology to GATK-based tool developers

We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.

In a nutshell, the base name of the codebase has changed from sting to gatk (which hopefully makes more sense), and the most common effect is that sting.gatk classpath segments are now This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.

Created 2014-03-18 00:36:21 | Updated 2014-03-20 14:10:47 |

Comments (27)

This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.

Intel inside GATK

So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.

We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!

In practice

Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.

Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command: -pairHMM VECTOR_LOGLESS_CACHING.

Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.

Linux kernel version Architecture / Processor Expected speedup Instruction set
Any 64-bit Linux Any x86 64-bit 1-1.5x Non-vector
Linux 2.6 or newer Penryn (Core 2 or newer) 1.3-1.8x SSE 4.1
Linux 2.6.30 or newer SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer) 2-2.5x AVX

To find out exactly which processor is in your machine, you can run this command in the terminal:

$ cat /proc/cpuinfo | grep "model name"                                                                                    
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions.

Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware:

Penryn processors

Sandy Bridge processors

Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures:

  • SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above).

  • The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want.

  • The kernel version 2.6.30 was released in 2009, so we expect every sane person or IT out there to be using something better than this.

Created 2014-03-17 23:32:16 | Updated |

Comments (0)

Better late than never, here is the now-traditional "Highlights" document for GATK version 3.0, which was released two weeks ago. It will be a very short one since we've already gone over the new features in detail in separate articles --but it's worth having a recap of everything in one place. So here goes.

Work smarter, not harder

We are delighted to present our new Best Practices workflow for variant calling in which multisample calling is replaced by a winning combination of single-sample calling in gVCF mode and joint genotyping analysis. This allows us to both bypass performance issues and solve the so-called "N+1 problem" in one fell swoop. For full details of why and how this works, please see this document. In the near future, we will update our Best Practices page to make it clear that the new workflow is now the recommended way to go for calling variants on cohorts of samples. We've already received some pretty glowing feedback from early adopters, so be sure to try it out for yourself!

Jumping on the RNAseq bandwagon

All the cool kids were doing it, so we had to join the party. It took a few months of experimentation, a couple of new tools and some tweaks to the HaplotypeCaller, but you can now call variants on RNAseq with GATK! This document details our Best Practices recommendations for doing so, along with a non-trivial number of caveats that you should keep in mind as you go.

Goodbye to ReduceReads

Nice try, but no. This tool is obsolete now that we have the gVCF/reference model pipeline (see above). Note that this means that GATK 3.0 will not support BAM files that were processed using ReduceReads!

Changes for developers

We've switched the build system from Ant to Maven, which should make it much easier to use GATK as a library against which you can develop your own tools. And on a related note, we're also making significant changes to the internal structure of the GATK codebase. Hopefully this will not have too much impact on external projects, but there will be a doc very shortly describing how the new build system works and how the codebase is structured.

Hardware optimizations held for 3.1

For reasons that will be made clear in the near future, we decided to hold the previously announced hardware optimizations until version 3.1, which will be released very soon. Stay tuned!

Created 2013-12-19 21:17:54 | Updated 2014-02-07 18:37:02 |

Comments (0)

Better late than never, here are the highlights of the most recent version release, GATK 2.8. This should be short and sweet because as releases go, 2.8 is light on new features, and is best described as a collection of bug fixes, which are all* dutifully listed in the corresponding release notes document. That said, two of the changes we've made deserve some additional explanation.

* Up to now (this release included) we have not listed updates/patches to Queue in the release notes, but will start doing so from the next version onward.

VQSR & bad variants: no more guessing games

In the last release (2.7, for those of you keeping score at home) we trumpeted that the old -percentBad argument of VariantRecalibrator had been replaced by the shiny new -numBad argument, and that this was going to be awesome for all sorts of good reasons, improve stability and whatnot. Weeeeeeell it turned out that wasn't quite the case. It worked really well on the subset of analyses that we tested it on initially, but once we expanded to different datasets (and the complaints started rolling in on the forum) we realized that it actually made things worse in some cases because the default value was less appropriate than what -percentBad would have produced. This left people guessing as to what value would work for their particular dataset, with a great big range to choose from and very little useful information to assist in the choice.

So, long story short, we (and by "we" I mean Ryan) built in a new function that allows the VariantRecalibrator to determine for itself the amount of variants that is appropriate to use for the "bad" model depending on the data. So the short-lived -numBad argument is gone too, replaced by... nothing. No new argument to specify; just let the VariantRecalibrator do its thing.

Of course if you really want to, you can override the default behavior and tweak the internal thresholds. See the tool doc here; and remember that a good rule of thumb is that if you can't figure out which arguments are involved based on that doc, you probably shouldn't be messing with this advanced functionality.

Reference calculation model

This is still a rather experimental feature, so we're still making changes as we go. The two big changes worth mentioning here are that you can now run this on reduced reads, and that we've changed the indexing routine to optimize the compression level. The latter shouldn't have any immediate impact on normal users, but it was necessary for a new feature project we've been working on behind the scenes (the single-sample-to-joint-discovery pipeline we have been alluding to in recent forum discussions). The reason we're mentioning it now is that if you use -ERC GVCF output, you'll need to specify a couple of new arguments as well (-variant_index_type LINEAR and -variant_index_parameter 128000, with those exact values). This useful little fact didn't quite make it into the documentation before we released, and not specifying them leads to an error message, so... there you go. No error message for you!

What's up, doc?

That's all for tool changes. In addition to those, we have made a number of corrections in the tool documentation pages, updated the Best Practices (mostly layout, tiny bit of content update related to the VQSR -numBad deprecation) and made some minor changes to the website, e.g. updated the list of publications that cite the GATK and improved the Guide index somewhat (but that's still a work in progress).

Created 2013-08-30 13:55:05 | Updated |

Comments (0)

Yay, August is over! Goodbye steamy hot days, hello mild temperatures and beautiful leaf-peeping season. We hope you all had a great summer (in the Northern hemisphere at least) and caught a bit of a vacation. For our part, we've been chained to our desks the whole time!

Well, not really, but we've got a feature-rich release for you nonetheless. Lots of new things; not all of them fully mature, so heed the caveats on the experimental features! We've also made some key improvements to VQSR that we're very excited about, some bug fixes to various tools of course, and a new way to boost calling performance. Full list in the release notes as usual, and highlights below.

Estimating the confidence of reference calls

When UnifiedGenotyper and HaplotypeCaller emit variant calls, they tell you how confident you can be that the variants are real. But how do you know how confident to be that the rest are reference, i.e. non-variant? It's actually a pretty hard problem… and this is our answer:

  • For HaplotypeCaller, we’ve developed a full-on reference model that produces reference confidence scores. To use it, you need to enable the --emitRefConfidence mode. This mode is a little bit complicated so be sure you read the method article before you try to use it.

  • For UnifiedGenotyper, we don’t have a completely fleshed-out model, but we’ve added the -allSitePLs argument which, in combination with the EMIT_ALL_SITES output mode, will enable calculation of PLs for all sites, including reference. This will give a measure of reference confidence and a measure of which alt alleles are more plausible (if any). Note that this only works with the SNP calling model. Again, this is not as good or as complete as the reference model in HaplotypeCaller, so we urge you to use HaplotypeCaller for this unless you really need to use UnifiedGenotyper.

These are two highly experimental features; they work in our tests, but your mileage may vary, so please examine your results carefully. We welcome your feedback!

Modelling PCR errors that cause indel artifacts

A common problem in calling indels is that you get false positives that are associated with PCR slippage around short tandem repeats (especially homopolymers). Until we can all switch to PCR-free amplification, we're stuck with this issue. So we thought it would be nice to be able to model this type of error and mitigate its impact on our indel calls. The new --pcr_indel_model argument allows the HaplotypeCaller to use a new feature called the PCR indel model to weed out false positive indels more or less aggressively depending on how much you care about sensitivity vs. specificity.

This feature too is highly experimental, so play with it at your own risk. And stay tuned, because we've already got some ideas on how to improve it further.

VariantRecalibrator gets an oil change and free tire rotation

Variant recalibration is one of the most challenging parts of the Best Practices workflow, and not just for users! We've been wrestling with some of its internal machinery to produce better, more consistent modeling results, especially with call sets that are on the lower end of the size scale.

One of the breakthroughs we made was separating the parameters for the positive and negative training models. You know (or should know) that the VariantRecalibrator builds two separate models: one to model what "good" variants (i.e. true positives) look like (the positive model), and one to model what "bad" variants (i.e. false positives) look like (the negative model). Until now, we applied parameters the same way to both, but we've now realized that it makes more sense to treat them differently.

Because of how relative amounts of good and bad variants tend to scale differently with call set size, we also realized it was a bad idea to have the selection of bad variants be based on a percentage (as it has been until now) and instead switched it to a hard number. You can change this setting with the --numBadVariants argument, which replaces the now-deprecated --percentBadVariants argument.

Finally, we also found that the order of annotations matters. Now, instead of applying the annotation dimensions to the training model in the order that they were specified at the command line, VariantRecalibrator first reorders them based on their standard deviation. This stabilizes the training model and produces much more consistent results.

New arguments and tools for finer control of data

Some of you have been clamoring for more flexibility in handling individual BAM files and samples without losing the convenience of processing them in batches. In response, we've added the following:

  • For general GATK use, the -sample_rename_mapping_file engine argument allows you to rename samples on-the-fly at runtime. It takes a file that maps bam files to sample names. Note that this does require that your BAM files contain single samples only, although multiple read groups are allowed.

  • For variant calling, the -onlyEmitSamples argument allows you to tell the UnifiedGenotyper to only emit calls for specific samples among a cohort that you're calling in multisample mode, without emitting the calls for the rest of the cohort. Keep in mind however that the calculations will still be made on the entire cohort, and the annotation values emitted for those calls will reflect that.

  • For VQSR, the --excludeFiltered flag tells the ApplyRecalibration tool not to emit sites that are filtered out by recalibration (i.e. do not write them to file).

And some of you went ahead and added the features you wanted yourselves!

  • Yossi Farjoun contributed a patch to enable allele-biased downsampling with different per-sample values for the HaplotypeCaller, emulating the equivalent functionality that was already available in the UnifiedGenotyper.

  • Louis Bergelson contributed a new read filter, LibraryReadFilter, which allows you to use only reads from a specific library in your analysis. This is the opposite (and somewhat more specific) functionality compared to the existing engine argument, --read_group_black_list , which allows you to exclude read groups based on specific tags (including but not limited to LB).

Better diagnostics when things go wrong

We have a new diagnostic tool, QualifyMissingIntervals, that allows you to collect metrics such as GC content, mapping quality etc. for a list of intervals of interest. This is something you'd typically want to use if you found (through other tools) that you're missing calls in certain intervals, and you want to find out what's going wrong in those regions.

FPGA support for the pairHMM model in UG and HC

Finally, those of you who have access to more sophisticated computing platforms, heads up! Version 2.7 comes with a version of the PairHMM algorithm (aka the bit that takes forever to run in HaplotypeCaller) that is optimized for running on FPGA chips. Credit goes to the fine folks at Convey Computer and Green Mountain Computing Systems who teamed up to develop this optimized version of the PairHMM, with a little help from our very own Tech Dev team. We're told further optimizations may be in store; in the meantime, they're seeing up to 300-fold speedups of HaplotypeCaller runs on Convey's platform. Not bad!

Created 2013-06-24 15:17:56 | Updated 2013-06-24 16:47:36 |

Comments (8)

It's finally summer here in New England -- time for cave-dwelling developers to hit the beach and do the lobster dance (those of us who don't tan well anyway). We leave you with a new version of the GATK that includes a new(ish) plotting tool, some more performance improvements to the callers, a lot of feature tweaks and quite a few bug fixes. Be sure to check out the full list in the 2.6 Release Notes.

Highlights are below as usual, enjoy. There's one thing that we need to point out with particular emphasis: we have moved to Java 7, so you may need to update your system's Java version. Full explanation at the end of this document because it's a little long, but be sure to read it.

New(ish) plotting tool for Base Recalibration results

GATK old-timers may remember a tool called AnalyzeCovariates, which was part of the BQSR process in 1.x versions, many moons ago. Well, we've resurrected it to take over the plotting functionality of the BaseRecalibrator, to make it easier and faster to plot and compare the results of base recalibration. This also prevents issues with plot generation in scatter-gather mode. We'll update our docs on the BQSR workflow in the next few days, but in the meantime you can find full details of how to use this tool here.

HaplotypeCaller now so sensitive, it cries at the movies

We know you don't want to miss a single true variant, so for this release, we've put a lot of effort into making the HaplotypeCaller more sensitive. And it's paying off: in our tests, the HaplotypeCaller is now more sensitive than the UnifiedGenotyper for calling both SNPs and indels when run over whole genome datasets.

[graph to illustrate, coming soon]

UnifiedGenotyper: not out of the race yet

You might think all our focus is on improving the HaplotypeCaller these days; you would be wrong. The UnifiedGenotyper is still essential for calling large numbers of samples together, for dealing with exotic ploidies, and for calling pooled samples. So we've given it a turbo boost that makes it go twice as fast for calling indels on multiple samples.

The key change here is the updated Hidden Markov Model used by the UG. You can see on the graph that as the number of exomes being called jointly increases, the new HMM keeps runtimes down significantly compared to the old HMM.

Version tracking in the VCF header

Don’t you hate it when you go back to a VCF you generated some months ago, and you have no idea which version of GATK you used at the time? (And yes, versions matter. Sometimes a lot.) We sure do, so we added a function to add the GATK version number in the header of the VCFs generated by GATK.

Migration to Java 7

Speaking of software versions... As you probably know, the GATK runs on Java -- specifically, until now, version 6 of the Runtime Environment (which translates to version 1.6 if you ask java -version at the command prompt). But the Java language has been evolving under our feet; version 7 has been out and stable for some time now, and version 8 is on the horizon. We were happy as clams with Java 6… but now, newer computers with recent OS versions ship with Java 7, and on MacOS X once you update the system it is difficult to go back to using Java 6. And since Java 7 is not fully backwards compatible, people have been running into version problems.

So, we have made the difficult but necessary decision to follow the tide, and migrate the GATK to Java 7. Starting with this release, GATK will now require Java 7 to run. If you try to run with Java 6, you will probably get an error like this:

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/sting/gatk/CommandLineGATK : Unsupported major.minor version 51.0

If you're not sure what version of Java you are currently using, you can find out very easily by typing the following command:

java -version

which should return something like this:

java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

If not, you'll need to update your java version. If you have any difficulty doing this, please don’t ask us in the forum -- you’ll get much better, faster help if you ask your local IT department.

Created 2013-05-09 15:57:50 | Updated 2013-09-16 20:43:47 |

Comments (1)


This is going to be a short one, folks. The 2.5 release is pretty much all about bug fixes, with a couple of exceptions that we'll cover below.

Bug fixes

Remember how we said that version 2.4 was going to be the least buggy ever? Well, that might have been a bit optimistic. We had a couple of stumpers in there -- and a flurry of little ones that were probably not novel (i.e. not specific to version 2.5) but finally bubbled up to the surface. We're not going to go over the bug fixes in detail, since the release notes include a comprehensive list. Basically, those are all fixed.

Actual features!

Well, not exactly new features, but noteworthy improvements to existing tools.

- ReduceReads turns the squeeze dial up to eleven

In addition to countless bug fixes, we've made drastic improvements to ReduceReads' compression algorithm, so you can now achieve much better compression rates without compromising on the retention of informative data. Keep in mind of course that as always, you'll see much bigger gains on certain types of data sets -- the higher the coverage in your original BAM files, the bigger the savings in file size and performance of the downstream tools.

- HaplotypeCaller is faster and more accurate! No, really!

We say this every time, and every time it's true: we've made some more improvements to the HaplotypeCaller that make it faster and more accurate. Well, it's still slower than the UnifiedGenotyper, in case you were going to ask (of course you were). But on the accuracy front, we say this without reservation or caveat: HC is now just as accurate as the UG for calling SNPs, and it is in a league of its own for calling indels. If you are even remotely interested in indels you should absolutely take it out for a spin. Go. Now.

- DiagnoseTargets, all grown up

Say goodbye to the mood swings and the pimples; it looks like this tool's awkward teenager phase is finally over. We've entirely reworked how DiagnoseTargets functions so it now uses a plugin system, which we think is much more convenient. This plugin system will be explained in detail in a forthcoming documentation article.

- Functional annotation recovers some functionality

You may be aware that we had imposed a freeze of sorts on the annotation database version that could be used with the snpEff annotation. Well, we're happy to report that the author of the snpEff software package has made some significant upgrades, including a feature called GATK compatibility mode. As a result there is no longer any version constraint. We'll be updating our documentation on using snpEff with GATK soon (-ish), but in the meantime, feel free to go forth and annotate away. Just make sure to consult the snpEff manual for relevant information on using it with GATK.

Deprecation alerts

Even as the dev team giveth, the dev team taketh away.

A few annotations were removed from the VariantAnnotator stables (as listed in the release notes), mainly because they didn't work properly. With all the caveats about how GATK is research software, we're still committed to providing quality tools that do something close to what they're advertised to do, at the bare minimum. If something doesn't fulfill that requirement, it's out.

We've also disabled the auto-generation of fai/dict files for fasta references. I can hear some of you groaning all the way from here. Yes, it was convenient -- but far too buggy. Come on people, it's a one-liner using Picard. Oh, and we're no longer allowing the use of compressed (.gz) references either -- also too buggy. The space savings were simply not worth the headaches.

Created 2013-02-27 22:11:57 | Updated 2016-08-21 03:52:36 |

Comments (3)


We are very proud (and more than a little relieved) to finally present version 2.4 of the GATK! It's been a long time coming, but we're certain you'll find it well worth the wait. This release is bursting at the seams with new features and improvements, as you'll read below. It is also very probably going to be our least-buggy initial release yet, thanks to the phenomenal effort that went into adding extensive automated tests to the codebase.

Important note: Keep in mind that this new release comes with a brand new license, as we announced a few weeks ago here. Be sure to at least check out the figure that explains the different packages we (and our commercial partner Appistry) offer, and get the one that is appropriate for your use of the GATK.

With that disclaimer out of the way, here are the feature highlights of version 2.4!

Better, faster, more productive

Let's start with what everyone wants to hear about: improvements in speed and accuracy. There are in fact far more improvements in accuracy than are described here, again because of the extensive test coverage we've added to the codebase. But here are the ones that we believe will have the most impact on your work.

- Base Quality Score Recalibration gets a Bayesian boost

We realized that even though BaseRecalibrator was doing a fabulous job in general, the calculation for the empirical quality of a bin (e.g. all bases at the 33rd cycle of a read) was not always accurate. Specifically, we would draw the same conclusions from bins with many or few observations -- but in the latter case that was not necessarily correct (we were seeing some Q6s get recalibrated up to Q30s, for example). We changed this behavior so that the BaseRecalibrator now calculates a proper Bayesian estimate of the empirical quality. As a result, for bins with very little data, the likelihood is dwarfed by a prior probability that tends towards the original quality; there is no effect on large bins, which were already fine. This brings noticeable improvements in the genotype likelihoods being produced from the genotypes, in particular for the heterozygous state (as expected).

- HaplotypeCaller catching up to UnifiedGenotyper on speed, gets ahead on accuracy

You may remember that in the highlights for version 2.2, we were excited to announce that the HaplotypeCaller was no longer operating on geological time scales. Well, now the HC has made another big leap forward in terms of speed -- and it is now almost as fast as the UnifiedGenotyper. If you were reluctant to move from the UG to the HC based on runtime, that shouldn't be an issue anymore! Or, if you were  unconvinced by the merits of the new calling algorithm,  you'll be interested to know that our internal tests show that the HaplotypeCaller is now more accurate in calling variants (SNPs as well as Indels) than the UnifiedGenotyper.

How did we make this happen? There are too many changes to list here, but one of the key modifications that makes the HaplotypeCaller much faster (without sacrificing any accuracy!) is that we've greatly optimized how local Smith-Waterman re-assembly is applied. Previously, when the HC encountered a region where reassembly was needed, it performed SW re-assembly on the entire region, which was computationally very demanding. In the new implementation, the HC generates a "bubble" (yes, that's the actual technical term) around each individual haplotype, and applies the SW re-assembly only within that bubble. This brings down the computational challenge by orders of magnitude.

New tools, extended capabilities

We're not just fluffing up the existing tools -- we're also adding new tools to extend the capabilities of our toolkit.

- New filtering options to better control your data  

A new Read Filter, ReassignOneMappingQualityFilter, allows you to -- well, it's in the name -- reassign one mapping quality. This is useful for example to process data output by programs like TopHat which use MAPQ = 255 to convey meaningful information. The GATK would normally ignore any reads with that mapping quality. With the new filter, you can selectively reassign that quality to something else so that those reads will get utilized, without affecting the rest of your dataset.

In addition, the recently introduced contamination filter gets upgraded with the option to apply decontamination individually per sample.  

- Useful tool options get promoted to standalone tools

Version 2.4 includes several new tools that grew out of existing tool options. The rationale for making them standalone tools is that they represent particularly useful capabilities that merit expansion, and expanding them within their "mother tool" was simply too cumbersome.

  • GenotypeConcordance graduates from being a module of VariantEval, to being its own fully-fledged tool. This comes with many bug fixes and an overhaul of how the concordance results are tabulated, which we hope will cause less confusion than it has in the past!

  • RegenotypeVariants takes over -- and improves upon -- the functionality previously provided by the --regenotype option of SelectVariants. This tool allows you to refresh the genotype information in a VCF file after samples have been added or removed.

And we're also adding CatVariants, a tool to quickly combine multiple VCF files whose records are non-overlapping (e.g. as produced during scatter-gather using Queue). This should be a useful alternative to CombineVariants, which is primarily meant for more complex combination operations.

Nightly builds

Going forward, we have decided to provide nightly automated builds from our development tree. This means that you can get the very latest development version -- no need to wait weeks for bug fixes or new features anymore! However, this comes with a gigantic caveat emptor: these are bleeding-edge versions that are likely to contain bugs, and features that have never been tested in the wild. And they're automatically generated at night, so we can't even guarantee that they'll run. All we can say of any of them is that the code was able to compile -- beyond that, we're off the hook. We won't answer support questions about the new stuff. So in short: you want to try the nightlies, you do so at your own risk.

If any of the above scares or confuses you, no problem -- just stay well clear of the owl and you won't get bitten.

But hey, if you're feeling particularly brave or lucky, have fun :)

Documentation upgrades

The release of version 2.4 also coincides with some upgrades to the documentation that are significant enough to merit a brief mention.

- Every release gets a versioned Guide Book PDF

From here on, every release (including minor releases, such as 2.3-9) will be accompanied by the generation of a PDF Guide Book that contains the online documentation articles as they are at that time. It will not only allow you to peruse the documentation offline, but it will also serve as versioned documentation. This way, if in the future you need to go back and examine results you obtained with an older version of the GATK, you can find easily find the documentation that was valid at that time. Note that the Technical Documentation (which contains the exhaustive lists of arguments for each tool) is not included in the Guide Book since it can be generated directly from the source code.  

- Technical Documentation gets a Facelift

Speaking of the Technical Documentation, we are happy to announce that we've enriched those pages with additional information, including  available parallelization options and default read filters for each tool, where applicable. We've also reorganized the main categories in the Technical Documentation index to make it easier to browse tools and find what you need.

Developer alert

Finally, a few words for developers who have previous experience with the GATK codebase. The VariantContext and related classes have been moved out of the GATK codebase and into the Picard public repository. The GATK now uses the resulting Variant.jar as an external library (currently version 1.85.1357). We've also updated the Picard and Tribble jars to version 1.84.1337.

Created 2012-12-18 23:38:33 | Updated 2015-10-14 20:54:21 |

Comments (7)


Release version 2.3 is the last before the winter holidays, so we've done our best not to put in anything that will break easily. Which is not to say there's nothing important - this release contains a truckload of feature tweaks and bug fixes (see the release notes in the next tab for full list). And we do have one major new feature for you: a brand-spanking-new downsampler to replace the old one.

Feature improvement highlights

- Sanity check for mis-encoded quality scores

It has recently come to our attention that some datasets are not encoded in the standard format (Q0 == ASCII 33 according to the SAM specification, whereas in some datasets including older Illumina data, encoding starts at ASCII 64). This is a problem because the GATK assumes that it can use the quality scores as they are. If they are in fact encoded using a different scale, our tools will make an incorrect estimation of the quality of your data, and your analysis results will be off. To prevent this from happening, we've added a sanity check of the quality score encodings that will abort the program run if they are not standard. If this happens to you, you'll need to run again with the flag --fix_misencoded_quality_scores (-fixMisencodedQuals). What will happen is that the engine will simply subtract 31 from every quality score as it is read in, and proceed with the corrected values. Output files will include the correct scores where applicable.

- Overall GATK performance improvement

Good news on the performance front: we eliminated a bottleneck in the GATK engine that increased the runtime of many tools by as much as 10x, depending on the exact details of the data being fed into the GATK. The problem was caused by the internal timing code invoking expensive system timing resources far too often. Imagine you looked at your watch every two seconds -- it would take you ages to get anything done, right? Anyway, if you see your tools running unusually quickly, don't panic! This may be the reason, and it's a good thing.

- Co-reducing BAMs with ReduceReads (Full version only)

You can now co-reduce separate BAM files by passing them in with multiple -I or as an input list. The motivation for this is that samples that you plan to analyze together (e. g. tumor-normal pairs or related cohorts) should be reduced together, so that if a disagreement is triggered at a locus for one sample, that locus will remain unreduced in all samples. You will therefore conserve the full depth of information for later analysis of that locus.

Downsampling, overhauled

The downsampler is the component of the GATK engine that handles downsampling, i. e. the process of removing a subset of reads from a pileup. The goal of this process is to speed up execution of the desired analysis, particularly in genome regions that are covered by excessive read depth.

In this release, we have replaced the old downsampler with a brand new one that extends some options and performs much better overall.

- Downsampling to coverage for read walkers

The GATK offers two different options for downsampling:

  • --downsample_to_coverage (-dcov) enables you to set the maximum amount of coverage to keep at any position
  • --downsample_to_fraction (-dfrac) enables you to remove a proportional amount of the reads at any position (e. g. take out half of all the reads)

Until now, it was not possible to use the --downsample_to_coverage (-dcov) option with read walkers; you were limited to using --downsample_to_fraction (-dfrac). In the new release, you will be able to downsample to coverage for read walkers.

However, please note that the process is a little different. The normal way of downsampling to coverage (e. g. for locus walkers) involves downsampling over the entire pileup of reads in one take. Due to technical reasons, it is still not possible to do that exact process for read walkers; instead the read-walker-compatible way of doing it involves downsampling within subsets of reads that are all aligned at the same starting position. This different mode of operation means you shouldn't use the same range of values; where you would use -dcov 100 for a locus walker, you may need to use -dcov 10 for a read walker. And these are general estimates - your mileage may vary depending on your dataset, so we recommend testing before applying on a large scale.

- No more downsampling bias!

One important property of the downsampling process is that it should be as random as possible to avoid introducing biases into the selection of reads that will be kept for analysis. Unfortunately our old downsampler - specifically, the part of the downsampler that performed the downsampling to coverage - suffered from some biases. The most egregious problem was that as it walked through the data, it tended to privilege more recently encountered reads and displaced "older" reads. The new downsampler no longer suffers from these biases.

- More systematic testing

The old downsampler was embedded in the engine code in a way that made it hard to test in a systematic way. So when we implemented the new downsampler, we reorganized the code to make it a standalone engine component - the equivalent of promoting it from the cubicle farm to its own corner office. This has allowed us to cover it much better with systematic tests, so we have better assessment of whether it's working properly.

- Option to revert to the old downsampler

The new downsampler is enabled by default and we are confident that it works much better than the old one. BUT as with all brand-spanking-new features, early adopters may run into unexpected rough patches. So we're providing a way to disable it and use the old one, which is still in the box for now: just add -use_legacy_downsampler to your command line. Obviously if you use this AND -dcov with a read walker, you'll get an error, since the old downsampler can't downsample to coverage for read walkers.

Created 2012-10-30 03:48:34 | Updated 2013-01-24 05:59:32 |

Comments (7)


We're very excited to present release version 2.2 to the public. As those of you who have been with us for a while know, it's been a much longer time than usual since the last minor release (v 2.1). Ah, but don't let the "minor" name fool you - this release is chock-full of major improvements that are going to make a big difference to pretty much everyone's use of the GATK. That's why it took longer to put together; we hope you'll agree it was worth the wait!

The biggest changes in this release fall in two categories: enhanced performance and improved accuracy. This is rounded out by a gaggle of bug fixes and updates to the resource bundle.

Performance enhancements

We know y'all have variants to call and papers to publish, so we've pulled out all the stops to make the GATK run faster without costing 90% of your grant in computing hardware. First, we're introducing a new multi-threading feature called Nanoscheduler that we've added to the GATK engine to expand your options for parallel processing. Thanks to the Nanoscheduler, we're finally able to bring multi-threading back to the BaseRecalibrator. We've also made some seriously hard-core algorithm optimizations to ReduceReads and the two variant callers, UnifiedGenotyper and HaplotypeCaller, that will cut your runtimes down so much you won't know what to do with all the free time. Or, you'll actually be able to get those big multisample analyses done in a reasonable amount of time…

- Introducing the Nanoscheduler

This new multi-threading feature of the GATK engine allows you to take advantage of having multiple cores per machine, whether in your desktop computer or on your server farm. Basically, the Nanoscheduler creates clones of the GATK, assigns a subset of the job to each and runs it on a different core of the machine. Usage is similar to the -nt mode you may already be familiar with, except you call this one with the new -nct argument. Note that the Nanoscheduler currently reserves one thread for itself, which acts like a manager (it bosses the other threads around but doesn't get much work done itself) so to see any real performance gain you'll need to use at least -nct 3, which yields two "worker" threads. This is a limitation of the current implementation which we hope to resolve soon. See the updated document on [Parallelism with the GATK (v2)]() (link coming soon) for more details of how the Nanoscheduler works, as well as recommendations on how to optimize parallelization for each of the main GATK tools.

- Multi-threading power returns to BaseRecalibrator

Many of you have complained that the rebooted BaseRecalibrator in GATK2 takes forever to run. Rightly so, because until now, you couldn't effectively run it in multi-threaded mode. The reason for that is fairly technical, but in essence, whenever a thread started working on a chunk of data it locked down access to the rest of the dataset, so any other threads would have to wait for it to finish working before they could begin. That's not really multi-threading, is it? No, we didn't think so either. So we rewrote the BaseRecalibrator to not do that anymore, and we gave it a much saner and effective way of handling thread safety: each thread locks down just the chunk of data it's assigned to process, not the whole dataset. The graph below shows the performance gains of the new system over the old one. Note that in practice, this is operated by the Nanoscheduler (see above); so remember, if you want to parallelize BaseRecalibrator, use -nct, not -nt, and be sure to assign three or more threads.

- Reduced runtimes for ReduceReads (Full version only)

Without going into the gory technical details, we optimized the underlying compression algorithm that powers ReduceReads, and we're seeing some very significant improvements in runtime. For a "best-case scenario" BAM file, i.e. a well-formatted BAM with no funny business, the average is about a three-fold decrease in runtime. Yes, it's three times faster! And if that doesn't impress you, you may be interested to know that for "worst-case scenario" BAM files (which are closer to what we see in the wild, so to speak, than in our climate-controlled test facility) we see orders of magnitude of difference in runtimes. That's tens to hundreds of times faster. To many of you, that will make the difference between being able to reduce reads or not. Considering how reduced BAMs can help bring down storage needs and runtimes in downstream operations as well -- it's a pretty big deal.

- Faster joint calling with UnifiedGenotyper

Ah, another algorithm optimization that makes things go faster. This one affects the EXACT model that underlies how the UG calls variants. We've modified it to use a new approach to multiallelic discovery, which greatly improves scalability of joint calling for multi-sample projects. Previously, the relationship between the number of possible alternate alleles and the difficulty of the calculation (which directly impacts runtime) was exponential. So you had to place strict limits on the number of alternate alleles allowed (like 3, tops) if you wanted the UG run to finish during your lifetime. With the updated model, the relationship is linear, allowing the UG to comfortably handle around 6 to 10 alternate alleles without requiring some really serious hardware to run on. This will mostly affect projects with very diverse samples (as opposed to more monomorphic ones).

- Making the HaplotypeCaller go Whoosh! (Full version only)

The last algorithm optimization for this release, but certainly not the least (there is no least, and no parent ever has a favorite child), this one affects the likelihood model used by the HaplotypeCaller. Previously, the HaplotypeCaller's HMM required calculations to be made in logarithmic space in order to maintain precision. These log-space calculations were very costly in terms of performance, and took up to 90% of the runtime of the HaplotypeCaller. Everyone and their little sister has been complaining that it operates on a geological time scale, so we modified it to use a new approach that gets rid of the log-space calculations without sacrificing precision. Words cannot express how well that worked, so here's a graph.

This graph shows runtimes for HaplotypeCaller and UnifiedGenotyper before (left side) and after (right side) the improvements described above. Note that the version numbers refer to development versions and do not map directly to the release versions.

Accuracy improvements

Alright, going faster is great, I hear you say, but are the results any good? We're a little insulted that you asked, but we get it -- you have responsibilities, you have to make sure you get the best results humanly possible (and then some). So yes, the results are just as good with the faster tools -- and we've actually added a couple of features to make them even better than before. Specifically, the BaseRecalibrator gets a makeover that improves indel scores, and the UnifiedGenotyper gets equipped with a nifty little trick to minimize the impact of low-grade sample contamination.

- Seeing alternate realities helps BaseRecalibrator grok indel quality scores (Full version only)

When we brought multi-threading back to the BaseRecalibrator, we also revamped how the tool evaluates each read. Previously, the BaseRecalibrator accepted the read alignment/position issued by the aligner, and made all its calculations based on that alignment. But aligners make mistakes, so we've rewritten it to also consider other possible alignments and use a probabilistic approach to make its calculations. This delocalized approach leads to improved accuracy for indel quality scores.

- Pruning allele fractions with UnifiedGenotyper to counteract sample contamination (Full version only):

In an ideal world, your samples would never get contaminated by other DNA. This is not an ideal world. Sample contamination happens more often than you'd think; usually at a low-grade level, but still enough to skew your results. To counteract this problem, we've added a contamination filter to the UnifiedGenotyper. Given an estimated level of contamination, the genotyper will downsample reads by that fraction for each allele group. By default, this number is set at 5% for high-pass data. So in other words, for each allele it detects, the genotyper throws out 5% of reads that have that allele.

We realize this may raise a few eyebrows, but trust us, it works, and it's safe. This method respects allelic proportions, so if the actual contamination is lower, your results will be unaffected, and if a significant amount of contamination is indeed present, its effect on your results will be minimized. If you see differences between results called with and without this feature, you have a contamination problem.

Note that this feature is turned ON by default. However it only kicks in above a certain amount of coverage, so it doesn't affect low-pass datasets.

Bug fixes

We've added a lot of systematic tests to the new tools and features that were introduced in GATK 2.0 and 2.1 (Full versions), such as ReduceReads and the HaplotypeCaller. This has enabled us to flush out a lot of the "growing pains" bugs, in addition to those that people have reported on the forum, so all that is fixed now. We realize many of you have been waiting a long time for some of these bug fixes, so we thank you for your patience and understanding. We've also fixed the few bugs that popped up in the mature tools; these are all fixed in both Full and Lite versions of course.

Details will be available in the new Change log shortly.

Resource bundle updates

Finally, we've updated the resource bundle with a variant callset that can be used as a standard for setting up your variant calling pipelines. Briefly, we generated this callset from the raw BAMs of our favorite trio (CEU Trio) according to our Best Practices (using the UnifiedGenotyper on unreduced BAMs). We additionally phased the calls using PhaseByTransmission. We've also updated the HapMap VCF.

Note that from now on, we plan to generate a new callset with each major and minor release, and the numbering of the bundle versions will follow the GATK version numbers to avoid any confusion.

 Note: There are no version highlights available for versions earlier than 2.2.

These are the release notes issued for all major and minor version releases (for example, 3.4). At this time, we do not provide release notes for subversion changes (for example, 3.4-46) but they are typically accompanied by a blog post, and you can view the latest changes in the Change log (next tab).

Created 2016-12-12 14:55:14 | Updated 2016-12-12 17:06:37 |

Comments (0)

GATK 3.7 was released on December 12, 2016. Itemized changes are listed below. For more details, see the user-friendly version highlights.

HaplotypeCaller + GGVCFs

  • 39da22b - Changes to use the median rather than the second best likelihood for the NON_REF allele
  • a8797f2 - Fixed merging of GVCF blocks by fixing rounding of GQ values in ReferenceConfidenceModel
  • ce4ed1f - Remove NON_REF from allSites VCF output
  • ba21b22 - Do not emit GVCF block definitions in the header of the final VCF emitted by GenotypeGVCFs
  • 6670d12 - Changed maximum allowed GQB value to 100
  • e7bd143 - Added exception for GQB values greater than MAX_GENOTYPE_QUAL and tests
  • 9ae9b26 - Deprecate -stand_emit_conf
  • 04a70bb - Remove -stand_emit_conf argument
  • 5b8bf1c - Change default value of STANDARD_CONFIDENCE_FOR_CALLING to 10
  • a8db074 - Backport new AFCalculator
  • ad3e4f4 - Backport numerics changes in new qual
  • dc0fa3f - Fixes NaN issue in new Qual calculator


  • d7f1a9c - Lots of small improvements to Mutect2 code
  • 408d31d - More small refactorings of Mutect2 code
  • ff1e3a3 - Finish porting MuTect1 clustered read position filter
  • ce7d4bd - Port the strand bias filter from M1 and refactored code around SomaticGenotypingEngine; added a new integration test
  • 1350c0e - Add new annotator for M1 clustered read position filter and M1 strand bias filter
  • 095a469 - Cleaned up SomaticGenotypingEngine::callMutations and added some TODOs
  • e6d0318 - Expose downsampling arguments in Mutect

Allele prioritization and culling

  • a557ff3 - RCM Variant sites merger won't output PL when there are too many alleles in order to avoid memory issues with large cohort runs
  • 7296dbf - Remove alt alleles, when genotype count is explosively large, based on alleles' highest supporting haplotype score; max tolerable genotype count is controlled by a default value overridable by user
  • 5b09639 - Impose a maximum allele list message length
  • 7cd8a66 - Make sure that multi-allelic uninformative PLs (0,0,...,0) stay uninformative after biallelization
  • f182fc1 - Make alt allele removal by likelihoods robust to ref allele indices
  • 1709765 - Fixed a max priority Q error while removing alt alleles when faced with high ploidy and allele count

Misc annotations

  • 1f4fa57 - Change to max value of ExcessHet
  • 186d616 - Fix for int overflow in RankSum calculation
  • 5931311 - Remove RankSumTest and RMSAnnotation from hom-ref sites
  • 4c1365f - BaseCountsBySample counting bases at a particular position
  • f0874d1 - Bypass spanning deletions in Rank Sum tests
  • a327c24 - Makes Fisher's exact test match R and GATK4 results
  • 92a5aad - Fixed logic error and tidied AlleleBalance and AlleleBalanceBySample

Misc tools

  • 41c9fed - Add read group identifier to column names in ReadLengthDistribution
  • d96c02b - Allows GatherBqsrReports to accept a .list file as input
  • 22a3008 - SelectVariants works with non-diploids
  • 0dd96ac - Added TreeReduce interface to VariantFiltration
  • b85fea3 - Fix for genotype filters issue in VariantFiltration
  • 4341a2e - Assorted documentation fixes

Engine options

  • 69af359 - Added option to merge GenomeLocs that are abutting (contiguous) rather than actually overlapping
  • 541243b - Make interval padding work for "exclude intervals"
  • 4e4ac94 - Enable control of reporting periodicity
  • 68f1822 - Set HTSJDK log level
  • 359f078 - Throw an exception if the BQSR input covariates file is not found

Under the hood

  • 975c17a - Throw an exception for invalid Picard intervals
  • ce1d4c8 - Upgrade Apache Commons Collections to version 3.2.2
  • 355d053 - Move htsjdk and picard to version 2.5.0
  • 08b8eab - Move htsjdk to ver 2.6.1 and picard to ver 2.6.0
  • 4d45102 - Move htsjdk to ver 2.8.1 and picard to ver 2.7.2
  • f410451 - Change HashMap to LinkedHashMap for predictable ordering
  • 5fac5c8 - Added support for directly reading SRA runs
  • a5cc81a - Remove SRA group
  • 899453b - Write saved WARN messages to stderr instead of stdout
  • bed804d - Replace SAMFileReader with calls to SamReaderFactory
  • a48d36d - Assign correct ambiguity code for * allele
  • 496564f - Removed spanning deletions if the deletion was removed when subsetting
  • a37886f - Fix adapter boundary for positive strand in handling of overlapping read pairs
  • f6b18c8 - Fix issue where VCF files with "bcf" in the name were output to BCF
  • 4d6d207 - Replace VariantContextWriterFactory with VariantContextWriterBuilder
  • 6b8740e - Make exit system file type message generic


  • e6d34af - Make ReadPosRankSumTest.isUsableRead() account for deletions
  • e4786ed - Added regression test for genotyping of spanning deletions in GenotypeGCVFs
  • f709898 - Add integration test using -maxNumPLValues for GenotypeGVCFs
  • dfcec64 - Fix BetaTestingAnnotation group Add test
  • f7ff6b8 - Change a truth VCF in VQSR tests
  • d5ad2f0 - Make getElementForRead() in RankSumTest robust

Created 2016-06-01 10:58:20 | Updated 2016-12-12 15:02:12 |

Comments (3)

GATK 3.6 was released on June 1, 2016. Itemized changes are listed below. For more details, see the user-friendly version highlights.

Variant calling features

  • HaplotypeCaller will now emit a no-call (./.) for any sample where GQ is zero, in both normal and GVCF modes, instead of emitting a specific genotype in which we have zero confidence.

  • GenotypeGVCFs will now emit a QUAL value for hom-ref sites when run in -allSites mode.

  • Implemented tracking of dropped reads by HaplotypeCaller and MuTect2 (see highlights for details).

  • Assorted optimizations to the joint calling code, expected to speed up genotyping (not the overall tool run) by about 10 percent.

  • Enabled MuTect2 to annotate all the same regular (non-AS) annotations as HaplotypeCaller on request.

Assorted new functionality

  • New ranksum annotations (allele-specific insert size and MQ of mate).
  • New -AS mode to run VQSR in an allele-specific manner (both VariantRecalibrator and ApplyRecalibration) (still experimental).
  • VariantRecalibrator can now output the recalibration model to a file (in GATKReport format — use the R library gsalib for reading).
  • Added ability to have VariantRecalibrator retry building the recalibration model if it fails initially. Meant as a workaround for runs on small datasets that fail randomly because the model isn't robust enough. Default behavior remains a single try. Contributed by @depristo / Mark DePristo.
  • ValidateVariants can now perform validation checks specific to GVCFs with the option --gvcf.
  • VariantsToTable now determines each allele's type when -F TYPE and -SMA are specified together.
  • LeftAlignAndTrimVariants now retains genotypes that remain valid after splitting with —splitMultiallelics (previously all were discarded).
  • SelectVariants can now select sites based on the number or fraction of samples that have no-call genotypes (./.) using —maxNOCALLnumber and —maxNOCALLfraction, respectively.
  • DepthOfCoverage now supports collecting coverage statistics for overlapping exons/genes. Contributed by @seru71 / Pawel Sztromwasser.

Assorted bug fixes

  • Handling of allele depths when the NON_REF allele is non-zero (see highlights for details)
  • A sample ploidy check that may have minor performance implications
  • Threshold evaluation in the max alt alleles filter of MuTect2
  • MQ annotation calculation when processing BP resolution GVCFs
  • RankSum calculations on small sample sizes
  • PrintReads’ ability to emit a @PG header record
  • Writing GVCFs to stdout instead of to file
  • Order of column headers in sample_gene_summary reports output by DepthOfCoverage
  • MNP-merging behavior of ReadBackedPhasing: treatment of spanning deletions and consecutive SNPs
  • SelectVariants and VariantFiltration’s ability to update genotype summary annotations (AC, AN and AF)
  • Subsetting alleles from StrandAlleleCountsBySample annotation

Workarounds for weird sites

  • Added an argument to HaplotypeCaller and GenotypeGVCFs, -maxNumPLValues, that controls the maximum number of PL values that can be emitted for a given site. If the number of PLs resulting from the combination of observed alleles and ploidy exceeds this value, no PLs will be emitted. This will cause subsetting errors in SelectVariants but empowers the user to identify and work around difficult sites where this happens.

  • Extended the functionality of the engine-level argument —reference_window_stop to set the reference window size used by VariantAnnotator when annotating hompolymers through the HomopolymerRun annotation. This makes it possible to deal with the problem of homopolymer stretches that are longer than the default window size.

Deleted functionality

  • Removed Phone Home usage tracking system (see highlights for details)
  • Deprecated GenotypeAndValidate tool which was massively outdated and had no unit or integration tests

Tools moved to the open-source core of GATK

  • IndelRealigner and RealignerTargetCreator
  • Post-IR MQ reverter filter to public
  • Moved BQSRGatherer and dependencies to the public module

Core / engine functionality

  • Enabled Java 8 support (see highlights for details)
  • Updated htsjdk & picard to version 2.4.1
  • Tweaks to the genome coordinates parsing system and contig names to support the Hg38 reference
  • Assorted improvements in the handling of errors, warnings and log output. The engine will now output a summary of WARN messages encountered during a run so you don’t have to parse the full log to see if anything worrying-but-not-fatal happened.


  • Expose time between checks for whether new jobs can be submitted as a user-settable parameter on CLi. Useful when testing pipelines to make idle time shorter. Contributed by @dakl / Daniel Klevebring.

  • Remove mem_free from resident memory request params for Queue because it doesn't work and wouldn't actually reserve memory.

Tool documentation

  • Improvements and clarifications to many tool docs
  • Refreshed organization and naming of tool categories
  • Fixed display of default values for arguments
  • Switched default doc output to html to make the tool docs provided for nightly builds more readable

Created 2015-11-25 07:10:45 | Updated 2016-02-17 06:37:17 |

Comments (7)

GATK 3.5 was released on November 25, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.

New tools

  • MuTect2: somatic SNP and indel caller based on HaplotypeCaller and the original MuTect.
  • ContEst: estimation of cross-sample contamination (primarily for use in somatic variant discovery).
  • GatherBqsrReports: utility to gather recalibration tables from scatter-parallelized BaseRecalibrator runs.

Variant Context Annotations

  • Added allele-specific version of existing annotations: AS_BaseQualityRankSumTest, AS_FisherStrand, AS_MappingQualityRankSumTest, AS_RMSMappingQuality, AS_RankSumTest, AS_ReadPosRankSumTest, AS_StrandOddsRatio, AS_QualByDepth and AS_InbreedingCoeff.

  • Added BaseCountsBySample annotation. Intended to provide insight into the pileup of bases used by HaplotypeCaller in the calling process, which may differ from the pileup observed in the original bam file because of the local realignment and additional filtering performed internally by HaplotypeCaller. Can only be requested from HaplotypeCaller, not VariantAnnotator.

  • Added ExcessHet annotation. Estimates excess heterozygosity in a population of samples. Related to but distinct from InbreedingCoeff, which estimates evidence for inbreeding in a population. ExcessHet scales more reliably to large cohort sizes.

  • Added FractionInformativeReads annotation. Reports the number of reads that were considered informative by HaplotypeCaller (over all samples).

  • Enforced calculating GenotypeAnnotations before InfoFieldAnnotations. This ensures that the AD value is available to use in the QD calculation.

  • Reorganized standard annotation groups processing to ensure that all default annotations always get annotated regardless of what is specified on the command line. This fixes a bug where default annotations were getting dropped when the command line included annotation requests.

  • Made GenotypeGVCFs subset StrandAlleleCounts intelligently, i.e. subset the SAC values to the called alleles. Previously, when the StrandAlleleCountsBySample (SAC) annotation was present in GVCFs, GenotypeGVCFs carried it over to the final VCF essentially unchanged. This was problematic because SAC includes the counts for all alleles originally present (including NON-REF) even when some are not called in the final VCF. When the full list of original alleles is no longer available, parsing SAC could become difficult if not impossible.

  • Added new MQ jittering functionality to improve how VQSR handles MQ. Note that HaplotypeCaller now calculates a new annotation called RAW_MQ per-sample, which is then integrated per-cohort by GenotypeGVCFs to produce the MQ annotation.

  • VariantAnnotator can now annotate FILTER field from an external resource. Usage: --resource:foo resource.vcf --expression foo.FILTER

  • VariantAnnotator can now check allele concordance when annotating with an external resource. Usage: --resourceAlleleConcordance

  • Bug fix: The annotation framework was improved to allow for the collection of sufficient statistics during GVCF creation which are then used to compute the final annotation during the genotyping. This avoids the use of median as the representative annotation from the collection of values (one from each sample). TL;DR annotations will be more accurate when using the GVCF workflow for joint discovery.

Variant manipulation tools

  • Allowed overriding hard-coded cutoff for allele length in ValidateVariants and in LeftAlignAndTrimVariants. Usage: --reference_window_stop N where N is the desired cutoff.

  • Also in LeftAlignAndTrimVariants, trimming multiallelic alleles is now the default behavior.

  • Fixed ability to mask out snps with --snpmask in FastaAlternateReferenceMaker.

  • Also in FastaAlternateReferenceMaker, fixed merging of contiguous intervals properly, and made the tool produce more informative contig names.

  • Fixed a bug in CombineVariants that occurred when one record has a spanning deletion and needs a padded reference allele.

  • Added a new VariantEval evaluation module, MetricsCollection, that summarizes metrics from several EV modules.

  • Enabled family-level stratification in MendelianViolationEvaluator of VariantEval (if a ped file is provided), making it possible to count Mendelian violations for each family in a callset with multiple families.

  • Added the ability to SelectVariants to enforce 4.2 version output of the VCF spec when processing older files. Use case: the 4.2 spec specifies that GQ must be an integer; by default we don’t enforce it (so if reading an older file that used decimals, we don’t change it) but the new argument --forceValidOutput converts the values on request. Not made default because of some performance slowdown -- so writing VCFs is now fast by default, compliant by choice.

  • Improved VCF sequence dictionary validation. Note that as a side effect of the additional checks, some users have experienced an error that starts with "ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant." that is due to unintentional activation of a check that is not necessary. This will be fixed in the next release; in the meantime -U ALLOW_SEQ_DICT_INCOMPATIBILITY can be used (with caution) to override the check.

GVCF tools

  • Various improvements to the tools’ performance, especially HaplotypeCaller, by making the code more efficient and cutting out crud.

  • GenotypeGVCFs now emits a no-call (./.) when the evidence is too ambiguous to make a call at all (e.g. all the PLs are zero). Previously this would have led to a hom-ref call with RGQ=0.

  • Fixed a bug in GenotypeGVCFs that sometimes generated invalid VCFs for haploid callsets. The tool was carrying over the AD from alleles that had been trimmed out, causing field length mismatches.

  • Changed the genotyping implementation for haploid organisms to address performance problems reported when running GenotypeGVCFs on haploid callsets. Note that this change may lead to a slight loss of sensitivity at low-coverage sites -- let us know if you observe anything dramatic.

Genotyping engine tweaks

  • Ensured inputPriors get used if they are specified to the genotyper (previously they were ignored). Also improved docs on --heterozygosity and --indel_ heterozygosity priors.

  • Fixed bug that affected the --ignoreInputSamples behavior of CalculateGenotypePosteriors.

  • Limited emission of the scary warning message about max number of alleles (“this tool is set to genotype at most x alleles but we found more; only x will be used”) to a single occurrence unless DEBUG logging mode is activated. Otherwise it fills up our output logs.

Miscellaneous tool fixes

  • Added option to OverclippedReadFilter to not require soft-clips on both ends. Contributed by Jacob Silterra.

  • Fixed a bug in IndelRealigner where the tool was incorrectly "fixing" mates when supplementary alignments are present. The patch involves ignoring supplementary alignments.

  • Fixed a bug in CatVariants. Previously, VCF files were being sorted solely on the base pair position of the first record, ignoring the chromosome. This can become problematic when merging files from different chromosomes, especially if you have multiple VCFs per chromosome. Contributed by John Wallace.

Engine-level behaviors and capabilities

  • Support for reading and writing CRAM files. Some improvements are still expected in htsjdk. Contributed by Vadim Zalunin at EBI and collaborators at the Sanger Institute.

  • Made interval-list output format dependent on the file extension (for RealignerTargetCreator). If the extension is .interval_list, output will be formatted as a proper Picard interval list (with sequence dictionary). Otherwise it will be a basic GATK interval list as previously.

  • Adding static binning capability for base recalibration (BQSR).


  • Added a new JobRunner called ParallelShell that will run jobs locally on one node concurrently as specified by the DAG, with the option to limit the maximum number of concurrently running jobs using the flag maximumNumberOfJobsToRunConcurrently. Contributed by Johan Dahlberg.

  • Updated extension for Picard CalculateHsMetrics to include PER_TARGET_COVERAGE argument and added extension for Picard CollectWgsMetrics.

Deprecation notice


  • BeagleOutputToVCF, VariantsToBeagleUnphased, ProduceBeagleInput. These are tools for handling Beagle data. The latest versions of Beagle support VCF input and output, so there is no longer any reason for us to provide converters.
  • ReadAdaptorTrimmer and VariantValidationAssessor. These were experimental tools which we think are not useful and not operating on a sufficiently sound basis.
  • BaseCoverageDistribution and CoveredByNSamplesSites. These tools were redundant with DiagnoseTargets and/or DepthOfCoverage.
  • LiftOverVariants, FilterLiftedVariants and The Picard liftover tool LiftoverVCF works better and is easier to operate.
  • Use Picard SortVCF instead.
  • ListAnnotations. This was intended as a utility for listing annotations easily from command line, but it has not proved useful.


  • Made various documentation improvements.
  • Updated date and street address in license text.
  • Moved htsjdk & picard to version 1.141

Created 2015-05-15 04:52:05 | Updated 2015-11-25 07:08:50 |

Comments (27)

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.

New tool

  • ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

HaplotypeCaller & GenotypeGVCFs

  • Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
  • Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
  • Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
  • Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
  • Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
  • Initialize annotations so that --disableDithering actually works.
  • Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
  • Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
  • Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
  • Removed GenotypeSummaries from default annotations.
  • Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
  • Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
  • Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
  • Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
  • Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
  • Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
  • Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
  • Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).


  • Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
  • Fixed a bug that caused the tool to stop processing after the first contig.
  • Fixed a bug where the wrong REF allele was output to the combined gVCF.


  • Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
  • VQSR VCF header command line now contains annotations and tranche levels.


  • Added -trim argument to trim (simplify) alleles to a minimal representation.
  • Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
  • Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.
  • Added --keepOriginalDP argument.


  • Improvements to the allele trimming functionalities.
  • Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.


  • Fixed user-reported bug featuring "trio" family with two children, one parent.
  • Added error handling for genotypes that are called but have no PLs.

Various tools

  • BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
  • DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
  • SplitSamFile: Fixed a bug that caused a NullPointerException.
  • SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
  • IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
  • CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
  • VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
  • GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
  • ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
  • CatVariants: Now allows different input / output file types, and spaces in directory names.
  • VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

Read Filters

  • Added erroneous CIGAR length to criteria for BadCigarFilter.
  • Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
  • Handle X and = CIGAR operators appropriately
  • Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).


  • Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
  • Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
  • Changed GC Content value type from Integer to Float.
  • Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
  • Made annotators emit a warning if they can't be applied.

GATK Engine & common features

  • Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
  • Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
  • Handle X and = CIGAR operators appropriately.
  • Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
  • Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
  • Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
  • Fixed BCF writing when FORMAT annotations contain arrays.


  • Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
  • Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
  • Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.


  • Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

For developers

  • Upgraded SLF4J to allow new convenient logging syntaxes.
  • Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
  • Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
  • Collected VCF IDs and header lines into one place (GATKVCFConstants).
  • Made various changes that lead to reduced build times.

Created 2014-10-23 18:53:52 | Updated 2015-05-12 17:24:14 |

Comments (2)

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

Haplotype Caller

  • Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
  • Physical phasing information is output by default in new sample-level PID and PGT tags.
  • Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
  • Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
  • Fixed IndexOutOfBounds error associated with tail merging.

Variant Recalibrator

  • New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.


  • Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
  • Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
  • Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
  • Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

RNAseq analysis

  • Bug fixes for working with unmapped reads.


  • New annotation for low- and high-confidence possible de novos (only annotates biallelics).
  • FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
  • Restricted population priors based on discovered allele count to be valid for 10 or more samples.


  • Fixed rare bug triggered by hash collision between sample names.


  • Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.


  • Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
  • The performance penalty associated with filtering by read group has been essentially eliminated.


  • StrandOddsRatio is now a standard annotation that is output by default.
  • We used to output zero for FS if there was no data available at a site, now we omit FS.
  • Extensive rewrite of the annotation documentation.


  • Fixed Queue bug with bad localhost addresses.
  • Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
  • Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)


  • We now generate PHP files instead of HTML.
  • We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.


  • Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
  • htsjdk updated to version 1.120.1620

Created 2014-07-15 03:54:06 | Updated 2014-10-23 17:58:36 |

Comments (13)

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.

Haplotype Caller

  • Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
  • Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
  • The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
  • Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
  • Fixed small bug where indel discovery was inconsistent based on the active-region size.
  • Removed scary warning messages for "VectorPairHMM".
  • Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
  • When we subset PLs because alleles are removed during genotyping we now also subset the AD.
  • Fixed bug where reference sample depth was dropped in the DP annotation.

Variant Recalibrator

  • The -mode argument is now required.
  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.


  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

Variant Annotator

  • SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

Genotype GVCFs

  • Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
  • Fixed recurring error about missing PLs.
  • No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
  • --includeNonVariantSites should now be working.

Select Variants

  • The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

Indel Realigner

  • Now throws a user error when it encounters reads with I operators greater than the number of read bases.
  • Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.


  • Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
  • Now computes genotype posteriors using likelihoods from all members of the trio.
  • Added annotations for calling potential de novo mutations.
  • Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

Cat Variants

  • Can now process .list files with -V.
  • Can now handle BCF and Block-Compressed VCF files.

Validate Variants

  • Now works with gVCF files.
  • By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.


  • Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.


  • Refactored maven directories and java packages replacing "sting" with "gatk".
  • Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
  • Added a new read transformer that refactors NDN cigar elements to one N element.
  • Now a Tabix index is created for block-compressed output formats.
  • Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
  • Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
  • We now check that output files are specified in a writeable location.
  • We now allow blank lines in a (non-BAM) list file.
  • Added legibility improvements to the Progress Meter.
  • Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
  • Made IntervalSharder respect the IntervalMergingRule specified on the command line.
  • Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

Created 2014-03-17 16:52:43 | Updated 2014-03-19 15:13:51 |

Comments (0)

GATK 3.1 was released on March 18, 2014. Highlights are listed below. Read the detailed version history overview here:

Haplotype Caller

  • Added new capabilities to the Haplotype Caller to use hardware-based optimizations. Can be enabled with --pair_hmm_implementation VECTOR_LOGLESS_CACHING. Please see the 3.1 Version Highlights for more details about expected speed ups and some background on the collaboration that made these possible.
  • Fixed bugs in computing the weights of edges in the assembly graph. This was causing bad genotypes to be output when running the Haplotype Caller over multiple samples simultaneously (as opposed to creating gVCFs in the new recommended pipeline, which was working as expected).

Variant Recalibrator

  • Fixed issue where output could be non-deterministic with very large data sets.


  • Fixed several bugs where bad input were causing the tool to crash instead of gracefully exiting with an error message.


  • RandomlySplitVariants can now output splits comprised of more than 2 output files.
  • FastaAlternateReferenceMaker can now output heterozygous sites using IUPAC ambiguity encoding.
  • Picard, Tribble, and Variant jars updated to version 1.109.1722.

Created 2014-03-04 06:37:01 | Updated 2014-03-17 22:48:07 |

Comments (2)

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here:

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.


  • This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

Haplotype Caller

  • Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
  • Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
  • Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
  • Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
  • Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.


  • This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.


  • This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.


  • This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

Unified Genotyper

  • Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

Variant Recalibrator

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Reduce Reads

  • Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

Variant Annotator

  • Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
  • Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
  • In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

Combine Variants

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Select Variants

  • Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
  • Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.


  • Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
  • It no longer strips out the AD annotation from genotypes.
  • AC/AF/AN counts are updated after fixing genotypes.
  • Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

Indel Realigner

  • Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

Read Backed Phasing

  • Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.


  • Thanks to Phillip Dexheimer for several Queue related fixes and patches.
  • Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
  • Providing an empty BAM list no results in a user error.
  • Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
  • Logger output now goes to STDERR instead of STDOUT.
  • Picard, Tribble, and Variant jars updated to version 1.107.1683.

Created 2013-12-06 19:04:39 | Updated 2014-03-17 22:00:14 |

Comments (2)

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here:

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.

Unified Genotyper

  • Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

  • Improved the indexing scheme for gVCF outputs using the reference calculation model.
  • The reference calculation model now works with reduced reads.
  • Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
  • Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

  • Disable tranche plots in INDEL mode.
  • Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

Reduce Reads

  • Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

  • Added calculation for GC content.
  • Added an option to filter the bases based on their quality scores.

Combine Variants

  • Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

  • Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.


  • SplitSamFile now produces an index with the BAM.
  • Length metric updates to QualifyMissingIntervals.
  • Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
  • Picard jar updated to version 1.104.1628.
  • Tribble jar updated to version 1.104.1628.
  • Variant jar updated to version 1.104.1628.

Created 2013-08-21 21:15:21 | Updated 2014-02-08 20:09:15 |

Comments (2)

GATK 2.7 was released on August 21, 2013. Highlights are listed below. Read the detailed version history overview here:

Reduce Reads

  • Changed the underlying convention of having unstranded reduced reads; instead there are now at least 2 compressed reads at every position, one for each strand (forward and reverse). This allows us to maintain strand information that is useful for downstream filtering.
  • Fixed bug where representative depths were arbitrarily being capped at 127 (instead of the expected 255).
  • Fixed bug where insertions downstream of a variant region weren't triggering a stop to the compression.
  • Fixed bug when using --cancer_mode where alignments were being emitted out of order (and causing the tool to fail).

Unified Genotyper

  • Added --onlyEmitSamples argument that, when provided, instructs that caller to emit only the selected samples into the VCF (even though the calling is performed over all samples present in the provided bam files).
  • FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
  • Added a (very) experimental argument (allSitePLs) that will have the caller emit PLs for all sites (including reference sites). Note that this does not give a fully accurate reference model because it models only SNPs. Full a proper handling of the reference model, please use the Haplotype Caller.

Haplotype Caller

  • Added a still somewhat experimental PCR indel error model to the Haplotype Caller. By default this modeling is turned on and is very useful for removing false positive indel calls associated with PCR slippage around short tandem repeats (esp. homopolymers). Users have the option (with the --pcr_indel_model argument) of turning it off or making it even more aggressive (at the expense of losing some true positives too).
  • Added the ability to emit accurate likelihoods for non-variant positions (i.e. what we call a "reference model" that incorporates indels as well as SNP confidences at every position). The output format can be either a record for every position or use the gVCF style recording of blocks. See the --emitRefConfidence argument for more details; note that this replaces the use of "--output_mode EMIT_ALL_SITES" in the HaplotypeCaller.
  • Improvements to the internal likelihoods that are generated by the Haplotype Caller. Specifically, this tool now uses a tri-state correction like the Unified Genotyper, corrects for overlapping read pairs (from the same underlying fragment), and does not run contamination removal (allele-biased downsampling) by default.
  • Several small runtime performance improvements were added (although we are still hard at work on larger improvements that will allow calling to scale to many samples; we're just not there yet).
  • Fixed bug in how adapter clipping was performed (we now clip only after reverting soft-clipped bases).
  • FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
  • Improved the "dangling tail" recovery in the assembly algorithm, which allows for higher sensitivity in calling variants at the edges of coverage (e.g. near the ends of targets in an exome).
  • Added the ability to run allele-biased downsampling with different per-sample values like the Unified Genotyper (contributed by Yossi Farjoun).

Variant Annotator

  • Fixed bug where only the last -comp was being annotated at a site.

Indel Realigner

  • Fixed bug that arises because of secondary alignments and that was causing the tool not to update the alignment start of the mate when a read was realigned.

Phase By Transmission

  • Fixed bug where multi-allelic records were being completely dropped by this tool. Now they are emitted unphased.

Variant Recalibrator

  • General improvements to the Gaussian modeling, mostly centered around separating the parameters for the positive and negative training models.
  • The percentBadVariants argument has been replaced with the numBad argument.
  • Added mode to not emit (at all) variant records that are filtered out.
  • This tool now automatically orders the annotation dimensions by their standard deviation instead of the order they were specified on the command-line in order to stabilize the training and have it produce optimal results.
  • Fixed bug where the tool occasionally produced bad log10 values internally.


  • General performance improvements to the VCF reading code contributed by Michael McCowan.
  • Error messages are much less verbose and "scary."
  • Added a LibraryReadFilter contributed by Louis Bergelson.
  • Fixed the ReadBackedPileup class to represent mapping qualities as ints, not (signed) bytes.
  • Added the engine-wide ability to do on-the-fly BAM file sample renaming at runtime (see the documentation for the --sample_rename_mapping_file argument for more details).
  • Fixed bug in how the GATK counts filtered reads in the traversal output.
  • Added a new tool called Qualify Intervals.
  • Fixed major bug in the BCF encoding (the previous version was producing problematic files that were failing when trying to be read back into the GATK).
  • Picard/sam/tribble/variant jars updated to version 1.96.1534.

Created 2013-06-17 14:41:43 | Updated 2013-06-20 03:43:19 |

Comments (0)

GATK 2.6 was released on June 20, 2013. Highlights are listed below. Read the detailed version history overview here:

Important note: with this release the GATK has officially moved to using Java 7.

Reduce Reads

  • Small runtime performance improvements contributed by Michael McCowan.
  • Added fix for the "Removed too many insertions, header is now negative" bug.
  • Fixed bug that arises in multi-sample mode and causes the tool to crash.
  • Added --cancer_mode argument to force the user to explicitly enable multi-sample mode.

Unified Genotyper

  • Runtime performance improvements when calling indels; calling indels in a single sample is almost 2x faster in our tests.
  • Fixed bug for bad AD values in some cases.
  • Fixed bug for GENOTYPE_GIVEN_ALLELES mode where it silently fails to genotype indels in some cases.

Haplotype Caller

  • We have been working hard to reduce the number of false negatives (i.e. missed sites) for the Haplotype Caller and as such added a bunch of improvements to this tool. The sensitivity is now better than that of the Unified Genotyper is all of our whole genome tests for both SNPs and indels. Feel free to peruse the detailed version history for more information.
  • The Haplotype Caller now annotates IDs from dbSNP properly.
  • The Haplotype Caller now emits per-sample DP.
  • Fixed bug for bad AD values in some cases.
  • Fixed bug with error: "Only one of refStart or refStop must be < 0, not both" that arose from soft-clipped reads at the beginning of contigs.
  • Implemented a much improved version of GENOTYPE_GIVEN_ALLELES mode in the Haplotype Caller that works so much better.

Indel Realigner

  • Fixed bug where secondary alignments were not being handled correctly.

Genotype Concordance

  • Added an overall genotype concordance metric to the output.
  • Fixed a bug in the printout of molten data in how it treated the genotypes.

Diagnose Targets

  • Diagnose Targets now has an option to output missing intervals.
  • Fixed bug where sometimes intervals were emitted out of order.

Base Recalibrator

  • Fixed bug for reads with indel CIGAR operators (I or D) at the start/end of the read.
  • Introduced a new tool, AnalyzeCovariates, to generate the BQSR quality assessment plots as a separate step, instead of doing it through the BaseRecalibrator.

Combine Variants

  • We no longer add PASS to the FILTER field of unfiltered records.

Variant Annotator

  • The RMSMappingQuality annotation now works properly with reduced reads.
  • The various rank sum tests no longer use reduced reads in their calculations (because those reads do not represent distinct observations).
  • Fixed bug in the BaseQualityRankSumTest annotation where it was not actually using the base qualities.
  • Added a new annotation DepthPerSampleHC that is used by default in the HaplotypeCaller.


  • James Warren contributed a patch to have references with non-suffix ".fa" parse correctly.
  • We now emit the GATK version number in the header of VCFs that we produce.
  • Fixed bug in the up front downsampling used by the GATK: reduced reads are no longer allowed to be eliminated during downsampling.
  • dbSNP rsID matching is now smarter: variants are considered matching if they have the same reference allele and at least 1 common alternative allele.
  • We now warn users about using the GATK with RNA-seq data.
  • We now check that -compress arguments are within allowable range 0-9.
  • -rf ReassignMappingQuality can now be used to reassign mapping qualities to 60 before the engine filters them out with MappingQualityUnassigned.
  • Fixed bug where requesting gzip VCF output with multi-threading was causing the GATK to fail.
  • We now require a minimum -dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage.
  • Zero-length and repeated cigar elements are collapsed down by default in the engine.
  • -ds option removed from PrintReads because it was redundant with the engine-level -dfrac argument.
  • Fixed bug where the --defaultBaseQualities argument didn't always work.
  • The engine now produces much more accurate read counts for Read traversals.
  • Count Reads now uses a Long instead of an Integer for counts to prevent overflows.
  • Locus Walkers now only try to clip adaptors when both reads of the pair are on opposite strands.
  • Fixed VCF issue where PLs were capped at 32767.
  • Picard/Tribble/Variant jars updated to version 1.91.1453.

Created 2013-04-30 20:18:26 | Updated 2013-05-06 15:51:39 |

Comments (4)

GATK 2.5 was released on April 30, 2013. Highlights are listed below. Read the detailed version history overview here:

Reduce Reads

  • DRASTIC improvements in the compression algorithm plus myriad bug fixes. Too many to list here; see detailed version history for more information.

Unified Genotyper

  • Fixed bug for indel calling with really long reads (assigning the wrong genotypes).
  • Automatic contamination fixing now works on reduced reads.
  • Fixed rare bug in the general ploidy SNP likelihood model when there are no informative reads in a pileup.
  • Fixed bug where haplotypes with 0 bases were being created.
  • Fixed problem where our internal PairHMM was generating positive likelihoods.

Haplotype Caller

  • Comprehensive performance improvements to the accuracy of calling both SNPs and indels; runtime is also much improved (but still slower than the Unified Genotyper; we expect it to be faster than UG in the next release though). See detailed version history for more information.
  • Fixed bug for calling on reduced reads (counts were not being assigned correctly).
  • Fixed problem where our internal PairHMM was generating positive likelihoods.
  • Can now write BAMs showing the assembled haplotypes.

Diagnose Targets

  • Significantly refactored this tool; it now works with a "plugin" system (see documentation for more information).
  • Fixed bug where LOW_MEDIAN_COVERAGE was output when no reads are covering the interval.
  • Fixed bug where intervals were skipped when they were not covered by any reads.

Base Recalibrator

  • Fixed the tool to work correctly with empty BQSR tables.
  • Fixed issue where Print Reads was running out of disk space when using the -BQSR option even for small bam files.
  • Fixed bug for RNA seq alignments with Ns.

Select Variants

  • Fixed bug where using the --exclude_sample_file argument was giving bad results.
  • Fixed bug when using the --keepOriginalAC argument which caused it to emit bad VCFs.
  • Fixed bug where maxIndelSize argument wasn't getting applied to deletions.

Variant Annotator

  • Added support for snpEff "GATK compatibility mode".
  • Can now list available annotations by doing java -cp GenomeAnalysisTK.jar
  • QualByDepth remaps QD values > 40 to a gaussian around 30.
  • Removed several deprecated annotations (AverageAltAlleleLength, MappingQualityZeroFraction, and TechnologyComposition) and others are no longer marked as experimental.

Variant Filtration

  • Don't allow users to specify keys and IDs that contain angle brackets or equals signs (which are not allowed in the VCF specification).
  • Added feature that allows one to filter sites outside of a given mask.

Left Align Variants

  • Renamed to LeftAlignAndTrimVariants.
  • Added ability to trim common bases in front of indels before left-aligning.
  • Added ability to split multiallelic records and then left align them.


  • We removed the auto-creation of fai/dict files for fasta references because it was too buggy.
  • Fixed bug where we could fail to find the intersection of unsorted/missorted interval lists.
  • Fixed @PG tag uniqueness issue with BAMs we were producing.
  • Fixed rare bug in GenotypeConcordance for multi-allelic sites.
  • Added check for reads without stored bases (i.e. that use '*') which we do not support.
  • Added support to reduce reads to CallableLoci.
  • Added a new walker to split MNPs into their allelic primitives (SNPs).
  • We no longer allow the use of compressed (.gz) references in the GATK.
  • Picard/Tribble/Variant jars updated to version 1.90.1442.

Created 2013-02-25 16:03:09 | Updated 2016-08-21 03:55:35 |

Comments (7)

GATK 2.4 was released on February 26, 2013. Highlights are listed below. Read the detailed version history overview here:

Important note 1 for this release: with this release comes an updated licensing structure for the GATK. Different files in our public repository are protected with different licenses, so please see the text at the top of any given file for details as to its particular license.

Important note 2 for this release: the GATK team spent a tremendous amount of time and engineering effort to add extensive tests for many of our core tools (a process that will continue into future releases). Unsurprisingly, as part of this process many small (and some not so small) bugs were uncovered during testing that we subsequently fixed. While we usually attempt to enumerate in our release notes all of the bugs fixed during a given release, that would entail quite a Herculean effort for release 2.4; so please just be aware that there were many smaller fixes that may be omitted from these notes.

Base Quality Score Recalibration

  • The underlying calculation of the recalibration has been improved and generalized so that the empirical quality is now calculated through a Bayesian estimate. This radically improves the accuracy in particular for bins with small numbers of observations.
  • Added many run time improvements so that this tool now runs much faster.
  • Print Reads writes a header when used with the -BQSR argument.
  • Added a check to make sure that BQSR is not being run on a reduced bam (which would be bad).
  • The --maximum_cycle_value argument can now be specified during the Print Reads step to prevent problems when running on bams with extremely long reads.
  • Fixed bug where reads with an existing BQ tag and soft-clipped bases could cause the tool to error out.

Unified Genotyper

  • Fixed the QUAL calculation for monomorphic (homozygous reference) sites (the math for previous versions was not correct).
  • Biased downsampling (i.e. contamination removal) values can now be specified as per-sample fractions.
  • Fixed bug where biased downsampling (i.e. contamination removal) was not being performed correctly in the presence of reduced reads.
  • The indel likelihoods calculation had several bugs (e.g. sometimes the log likelihoods were positive!) that manifested themselves in certain situations and these have all been fixed.
  • Small run time improvements were added.

Haplotype Caller

  • Extensive performance improvements were added to the Haplotype Caller. This includes run time enhancements (it is now much faster than previous versions) plus improvements in accuracy for both SNPs and indels. Internal assessment now shows the Haplotype Caller calling variants more accurately than the Unified Genotyper. The changes for this tool are so extensive that they cannot easily be enumerated in these notes.

Variant Annotator

  • The QD annotation is now divided by the average length of the alternate allele (weighted by the allele count); this does not affect SNPs but makes the calculation for indels much more accurate.
  • Fixed Fisher Strand annotation where p-values sometimes summed to slightly greater than 1.0.
  • Fixed Fisher Strand annotation for indels where reduced reads were not being handled correctly.
  • The Haplotype Score annotation no longer applies to indels.
  • Added the Variant Type annotation (not enabled by default) to annotate the VCF record with the variant type.
  • The DepthOfCoverage annotation has been renamed to Coverage.

Reduce Reads

  • Several small run time improvements were added to make this tool slightly faster.
  • By default this tool now uses a downsampling value of 40x per start position.

Indel Realigner

  • Fixed bug where some reads with soft clipped bases were not be realigned.

Combine Variants

  • Run time performance improvements added where one uses the PRIORITIZE or REQUIRE_UNIQUE options.

Select Variants

  • The --regenotype functionality has been removed from SelectVariants and transferred into its own tool: RegenotypeVariants.

Variant Eval

  • Removed the GenotypeConcordance evaluation module (which had many bugs) and converted it into its own tested, standalone tool (called GenotypeConcordance).


  • The VariantContext and related classes have been moved out of the GATK codebase and into Picard's public repository. The GATK now uses the variant.jar as an external library.
  • Added a new Read Filter to reassign just a particular mapping quality to another one (see the ReassignOneMappingQualityFilter).
  • Added the Regenotype Variants tool that allows one to regenotype a VCF file (which must contain likelihoods in the PL field) after samples have been added/removed.
  • Added the Genotype Concordance tool that calculates the concordance of one VCF file against another.
  • Bug fix for VariantsToVCF for records where old dbSNP files had '-' as the reference base.
  • The GATK now automatically converts IUPAC bases in the reference to Ns and errors out on other non-standard characters.
  • Fixed bug for the DepthOfCoverage tool which was not counting deletions correctly.
  • Added Cat Variants, a standalone tool to quickly combine multiple VCF files whose records are non-overlapping (e.g. as produced during scatter-gather).
  • The Somatic Indel Detector has been removed from our codebase and moved to the Broad Cancer group's private repository.
  • Fixed Validate Variants rsID checking which wasn't working if there were multiple IDs.
  • Picard jar updated to version 1.84.1337.
  • Tribble jar updated to version 1.84.1337.
  • Variant jar updated to version 1.85.1357.

Created 2012-12-17 14:56:06 | Updated 2012-12-18 20:21:23 |

Comments (2)

GATK 2.3 was released on December 17, 2012. Highlights are listed below. Read the detailed version history overview here:

Base Quality Score Recalibration

  • Soft clipped bases are no longer counted in the delocalized BQSR.
  • The user can now set the maximum allowable cycle with the --maximum_cycle_value argument.

Unified Genotyper

  • Minor (5%) run time improvements to the Unified Genotyper.
  • Fixed bug for the indel model that occurred when long reads (e.g. Sanger) in a pileup led to a read starting after the haplotype.
  • Fixed bug in the exact AF calculation where log10pNonRefByAllele should really be log10pRefByAllele.

Haplotype Caller

  • Fixed the performance of GENOTYPE_GIVEN_ALLELES mode, which often produced incorrect output when passed complex events.
  • Fixed the interaction with the allele biased downsampling (for contamination removal) so that the removed reads are not used for downstream annotations.
  • Implemented minor (5-10%) run time improvements to the Haplotype Caller.
  • Fixed the logic for determining active regions, which was a bit broken when intervals were used in the system.

Variant Annotator

  • The FisherStrand annotation ignores reduced reads (because they are always on the forward strand).
  • Can now be run multi-threaded with -nt argument.

Reduce Reads

  • Fixed bug where sometime the start position of a reduced read was less than 1.
  • ReduceReads now co-reduces bams if they're passed in toghether with multiple -I.

Combine Variants

  • Fixed the case where the PRIORITIZE option is used but no priority list is given.

Phase By Transmission

  • Fixed bug where the AD wasn't being printed correctly in the MV output file.


  • A brand new version of the per site down-sampling functionality has been implemented that works much, much better than the previous version.
  • More efficient initial file seeking at the beginning of the GATK traversal.
  • Fixed the compression of VCF.gz where the output was too big because of unnecessary call to flush().
  • The allele biased downsampling (for contamination removal) has been rewritten to be smarter; also, it no longer aborts if there's a reduced read in the pileup.
  • Added a major performance improvement to the GATK engine that stemmed from a problem with the NanoSchedule timing code.
  • Added checking in the GATK for mis-encoded quality scores.
  • Fixed downsampling in the ReadBackedPileup class.
  • Fixed the parsing of genome locations that contain colons in the contig names (which is allowed by the spec).
  • Made ID an allowable INFO field key in our VCF parsing.
  • Multi-threaded VCF to BCF writing no longer produces an invalid intermediate file that fails on merging.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar updated to version 119.

Created 2012-10-31 00:05:44 | Updated 2012-11-19 13:41:24 |

Comments (2)

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here:

Base Quality Score Recalibration

  • Improved the algorithm around homopolymer runs to use a "delocalized context".
  • Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
  • Fixed bug where the tool failed for reads that begin with insertions.
  • Fixed bug in the scatter-gather functionality.
  • Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

Unified Genotyper

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
  • Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
  • Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
  • Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
  • Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
  • The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
  • Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
  • Generalized ploidy model now handles reference calls correctly.

Haplotype Caller

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Now requires at least 10 samples to merge variants into complex events.

Variant Annotator

  • Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

Reduce Reads

  • Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
  • Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
  • Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

Variant Filtration

  • Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

Variant Eval

  • AlleleCount stratification now supports records with ploidy other than 2.

Combine Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Now outputs the first non-missing QUAL, not the maximum.

Select Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Removed the -number argument because it gave biased results.

Validate Variants

  • Added option to selectively choose particular strict validation options.
  • Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
  • improved the error message around unused ALT alleles.

Somatic Indel Detector

  • Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).


  • New CPU "nano" parallelization option (-nct) added GATK-wide (see docs for more details about this cool new feature that allows parallelization even for Read Walkers).
  • Fixed raw HapMap file conversion bug in VariantsToVCF.
  • Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
  • Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
  • Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
  • Fixed bug in BCF2 writer for case where all genotypes are missing.
  • Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
  • Fixed bug in Phase By Transmission when there are no likelihoods present.
  • Fixed bug in fasta .fai generation.
  • Updated and improved version of the BadCigar read filter.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.

Created 2012-08-20 18:52:48 | Updated 2012-08-23 14:11:29 |

Comments (0)

Base Quality Score Recalibration

  • Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
  • Implemented support for SOLiD no call strategies other than throwing an exception.
  • Fixed smoothing in the BQSR bins.
  • Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.

Unified Genotyper

  • Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
  • UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
  • Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
  • In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
  • Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.

Haplotype Caller

  • Added LowQual filter to the output when appropriate.
  • Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
  • Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
  • Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
  • Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
  • Fixed bug where non-standard bases from the reference would cause errors.
  • Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.

Reduce Reads

  • Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
  • Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
  • Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.

Variant Eval

  • Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
  • Fixed incorrect allele counting in IndelSummary evaluation.

Combine Variants

  • Now outputs the first non-MISSING QUAL, instead of the maximum.
  • Now supports multi-threaded running (with the -nt argument).

Select Variants

  • Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
  • No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
  • If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).


  • Updated and improved the BadCigar read filter.
  • GATK now generates a proper error when a gzipped FASTA is passed in.
  • Various improvements throughout the BCF2-related code.
  • Removed various parallelism bottlenecks in the GATK.
  • Added support of X and = CIGAR operators to the GATK.
  • Catch NumberFormatExceptions when parsing the VCF POS field.
  • Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
  • Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
  • We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
  • Added support for handling complex events in ValidateVariants.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.

Created 2012-07-23 19:16:29 | Updated 2012-08-10 00:07:47 |

Comments (0)

The GATK 2.0 release includes both the addition of brand-new (and often still experimental) tools and updates to the existing stable tools.

New Tools

  • Base Recalibrator (BQSR v2), an upgrade to CountCovariates/TableRecalibration that generates base substitution, insertion, and deletion error models.
  • Reduce Reads, a BAM compression algorithm that reduces file sizes by 20x-100x while preserving all information necessary for accurate SNP and indel calling. ReduceReads enables the GATK to call tens of thousands of deeply sequenced NGS samples simultaneously.
  • HaplotypeCaller, a multi-sample local de novo assembly and integrated SNP, indel, and short SV caller.
  • Plus powerful extensions to the Unified Genotyper to support variant calling of pooled samples, mitochondrial DNA, and non-diploid organisms. Additionally, the extended Unified Genotyper introduces a novel error modeling approach that uses a reference sample to build a site-specific error model for SNPs and indels that vastly improves calling accuracy.

Base Quality Score Recalibration

  • IMPORTANT: the Count Covariates and Table Recalibration tools (which comprise BQSRv1) have been retired! Please see the BaseRecalibrator tool (BQSRv2) for running recalibration with GATK 2.0.

Unified Genotyper

  • Handle exception generated when non-standard reference bases are present in the fasta.
  • Bug fix for indels: when checking the limits of a read to clip, it wasn't considering reads that may already have been clipped before.
  • Now emits the MLE AC and AF in the INFO field.
  • Don't allow N's in insertions when discovering indels.

Phase By Transmission

  • Multi-allelic sites are now correctly ignored.
  • Reporting of mendelian violations is enhanced.
  • Corrected TP overflow.
  • Fixed bug that arose when no PLs were present.
  • Added option to output the father's allele first in phased child haplotypes.
  • Fixed a bug that caused the wrong phasing of child/father pairs.

Variant Eval

  • Improvements to the validation report module: if eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.
  • If present, the AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC).
  • Fixed bugs in the VariantType and IndelSize stratifications.

Variant Annotator

  • FisherStrand annotation no longer hard-codes in filters for bases/reads (previously used MAPQ > 20 && QUAL > 20).
  • Miscellaneous bug fixes to experimental annotations.
  • Added a Clipping Rank Sum Test to detect when variants are present on reads with differential clipping.
  • Fixed the ReadPos Rank Sum Test annotation so that it no longer uses the un-hardclipped start as the alignment start.
  • Fixed bug in the NBaseCount annotation module.
  • The new TandemRepeatAnnotator is now a standard annotation while HRun has been retired.
  • Added PED support for the Inbreeding Coefficient annotation.
  • Don't compute QD if there is no QUAL.

Variant Quality Score Recalibration

  • The VCF index is now created automatically for the recalFile.

Variant Filtration

  • Now allows you to run with type unsafe JEXL selects, which all default to false when matching.

Select Variants

  • Added an option which allows the user to re-genotype through the exact AF calculation model (if PLs are present) in order to recalculate the QUAL and genotypes.

Combine Variants

  • Added --mergeInfoWithMaxAC argument to keep info fields from the input with the highest AC value.

Somatic Indel Detector

  • GT header line is now output.

Indel Realigner

  • Automatically skips Ion reads just like it does with 454 reads.

Variants To Table

  • Genotype-level fields can now be specified.
  • Added the --moltenize argument to produce molten output of the data.

Depth Of Coverage

  • Fixed a NullPointerException that could occur if the user requested an interval summary but never provided a -L argument.


  • BCF2 support in tools that output VCFs (use the .bcf extension).
  • The GATK Engine no longer automatically strips the suffix "Walker" after the end of tool names; as such, all tools whose name ended with "Walker" have been renamed without that suffix.
  • Fixed bug when specifying a JEXL expression for a field that doesn't exist: we now treat the whole expression as false (whereas we were rethrowing the JEXL exception previously).
  • There is now a global --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends).
  • Removed all code associated with extended events.
  • Algorithmically faster version of DiffEngine.
  • Better down-sampling fixes edge case conditions that used to be handled poorly. Read Walkers can now use down-sampling.
  • GQ is now emitted as an int, not a float.
  • Fixed bug in the Beagle codec that was skipping the first line of the file when decoding.
  • Fixed bug in the VCF writer in the case where there are no genotypes for a record but there are genotypes in the header.
  • Miscellaneous fixes to the VCF headers being produced.
  • Fixed up the BadCigar read filter.
  • Removed the old deprecated genotyping framework revolving around the misordering of alleles.
  • Extensive refactoring of the GATKReports.
  • Picard jar updated to version 1.67.1197.
  • Tribble jar updated to version 110.

 Note: There are no release notes available for versions earlier than 2.0.

These are the latest commit messages logged in the Github repository. Commit messages are short summaries that describe the changes made to the codebase. You can view the complete development history here.

Commit dateSummary
1st December 2016 Update pom versions for the 3.7 release
1st December 2016 Merge remote-tracking branch 'unstable/master'
1st December 2016 Merge pull request #1533 from broadinstitute/lb_update_jexl_behavior
1st December 2016 Move htsjdk to ver 2.8.1 and picard to ver 2.7.2
30th November 2016 Merge pull request #1529 from broadinstitute/ms_fs
30th November 2016 Merge pull request #1528 from broadinstitute/rhl_validate_vcf.gz
30th November 2016 Output files with the vcf.gz extension are gzipped, containing .bcf n…
30th November 2016 Merge pull request #1496 from broadinstitute/db_m2_downsampling
30th November 2016 Merge pull request #1520 from broadinstitute/gvda_more_docfixes_for_3.7
30th November 2016 Makes Fisher's exact test match R and GATK4
30th November 2016 Add downsampling arguments to Mutect
30th November 2016 Documentation fixes
3rd November 2016 Merge pull request #1527 from broadinstitute/rhl_vc_writer_factory_to…
3rd November 2016 Replace VariantContextWriterFactory with VariantContextWriterBuilder
2nd November 2016 Merge pull request #1522 from broadinstitute/rhl_sam_error_1427
2nd November 2016 Make exit system file type message generic
1st November 2016 Merge pull request #1517 from broadinstitute/rhl_adapter_boundary_err…
1st November 2016 Merge pull request #1354 from broadinstitute/db_issue_1351
1st November 2016 Fix adapter bounday for positive strand
1st November 2016 Fixed logic error and tidied AlleleBalance and AlleleBalanceBySample

Return to top

Click here to view older changes on Github

For 2.x releases, we collected the Guide documentation into a Guide Book in PDF format. Each Guide Book release was versioned, so if you performed some analyses with an older version of the GATK, you can go back and look at the documentation that matched that version exactly. Note however that the Tool Documentation (GATKDocs, containing detailed argument lists) was not included in the versioned Guide Book since it can be generated directly from the source code. We currently have no immediate plans to generate PDF Guide Books for 3.x versions due to low apparent interest from the community, but let us know if you think we should resurrect this feature.

 Note: There are no PDF files of the Guide Book available for versions earlier than 2.3-9.