What better way to start the summer than with a new GATK release?
Umm no don't answer that, there's loads of good options. You could have a barbecue, eat some ice cream, go on a hike if that's the sort of thing that floats your kayak... Or you might live somewhere where winter is just starting and everything I just said there was terribly insensitive. Sorry.
Ahem. As I mentioned in my recent sneak preview blog post, the bulk of our development effort (speed! copy number! unicorns!) is now going into the GATK4 project. Accordingly, development in the GATK3 framework is winding down, so this release consists mainly of bug fixes, added convenience functions, and relatively minor behavior tweaks.
That being said we do have a few new experimental features in the VQSR tools (which haven't yet been fully ported to GATK4, hence the ongoing development in GATK3) that are pushing the envelope of allele-specific filtering. So that's interesting, if not yet fully documented (someone should really get on that). And you'll probably care about some of those tweaks I casually mentioned above -- in fact I guarantee that at least one of these things will matter to you in some way. If you read through the whole thing and don't find anything relevant to you, tell me in the comments that I was wrong. That's what the internet is for.
As usual, here I go over the changes that matter the most / to the most; consider going through the release notes as well for a full list of changes.
Possibly the most sweeping change in this version is that it introduces support for Java 8. As noted recently, when we switched our test framework to Java 8 we encountered multiple failures in GATK 3.5 tests, which I discussed here. We fixed the underlying issues, so from version 3.6 onward GATK now runs reliably on Java 8. As a nice bonus, this puts us back into sync with HTSJDK and the Picard toolkit, which had been running on Java 8 for a few months already. If you were doing it right, you had both versions of Java installed and ran each toolkit, GATK and Picard, on the appropriate version. How much hassle? Too much hassle! — Now you can run everything on Java 8.
Here's another change that will affect everyone regardless of use case, in a good way: we removed the Phone Home usage reporting system. It served its purpose for many years, but we've outgrown it. So we ripped it all out. If you previously used a key to deactivate it with the
-et NO_ET and
—key <your_key> arguments, you can stop and take those out. Or if you're just too busy, leave them in -- the Phone Home code is all gone but we left in the argument definitions, so the parser shouldn't fail if you leave them in your commands. This shouldn't break any scripts.
In the next few days we'll be making some updates to our Best Practices documentation to reflect updates to our production pipelines, and one thing you'll notice is that local realignment around indels is no longer included in the pre-processing part of the pipelines. More on that in a follow-up post; in a nutshell, indel realignment is just not useful enough anymore when you're calling variants with haplotype-based tools like HaplotypeCaller and MuTect2. This does mean we will no longer support the indel realignment tools as actively; but since others may care more about preserving and possibly even expanding this functionality, we've decided to move the relevant tools, IndelRealigner and RealignerTargetCreator, and related classes to the part of the GATK that is open-sourced under the MIT license.
In response to popular demand, we made two frequently requested enhancements to the output of the variant discovery tools.
First, HaplotypeCaller and GenotypeGVCFs will now emit a no-call (./.) for any sample where GQ is zero, in both normal and GVCF modes, instead of emitting a specific genotype in which we have zero confidence. Note that these do not necessarily indicate that the variant call itself is bad, but that we are unable to choose between two (or more) genotypes for that particular sample.
Second, GenotypeGVCFs will now emit a QUAL value for invariant sites (i.e. where all samples are genotyped homozygous-variant) when run in
-allSites mode. By the way, we're aware that many of you would like to see a GVCF-like output option for the final VCF, in which you'd get essentially the same informational content as an all-sites VCF but with some form of interval-block compression. We still do not provide that functionality, and we do not have any immediate plans to work on implementing it in GATK3; however we are talking about how to make something like that happen in GATK4. Why the feet-dragging? Well, while it's not hard to do something like that badly, doing it correctly is quite a bit harder, so it will take some non-trivial development effort —- and since it's not something we currently need at Broad for our production needs, it's difficult to prioritize. But hey, the more you tell us you want this, the more we can justify putting it in the roadmap, so leave a comment below if you care about this.
Today's bug of infamy concerns the handling of allele depths (AD) in the joint variant discovery workflow, when the AD of the NON_REF allele is non-zero. This was an interesting little edge case. First we have HaplotypeCaller emitting a GVCF where, at a given position, the AD for the NON_REF allele is non-zero. This is fine, even expected. The problem arises during GVCF merging or joint genotyping, when CombineGVCFs or GenotypeGVCFs takes this NON_REF AD count and -- this is the bug -- copies it over to one of the alleles seen in another sample. This results in observing sum(AD) > DP, which is obviously not good. The new behavior, post-fix, is that we still record the NON_REF AD in the single-sample GVCF, but we don't copy it over to concrete alleles in CombineGVCFs or GenotypeGVCFs. Because, duh.
Some days it seems like the number one preoccupation of GATK forum denizens is tracking down what happened to every last read in a pileup. That's probably because HaplotypeCaller and MuTect2, the two stars of our little show, can be awfully picky about what reads they include at key stages of the calling process. Their (totally justified) tendency to discount reads that they don't consider useful often causes confusion and/or concerns that reads are somehow being dropped for no reason. That's a concern we'd like to help alleviate. HaplotypeCaller and MuTect2 both already have an option to output a bam file showing the post-assembly read alignments and candidate haplotypes, which we call a bamout and by default shows the state of the data used by HC and M2 to make their calls. Now we've added an option to include in that output any reads that were not considered informative or were dropped for any reason internal to the caller (separate from any read filtering done by the engine). The option is called
--emitDroppedReads and should be used in conjunction with
The last GATK 3.x release of the year 2015 has arrived!
The major feature in GATK 3.5 is the eagerly awaited MuTect2 (beta version), which brings somatic SNP and Indel calling to GATK. This is just the beginning of GATK’s scope expansion into the somatic variant domain, so expect some exciting news about copy number variation in the next few weeks! Meanwhile, more on MuTect2 awesomeness below.
In addition, we’ve got all sorts of variant context annotation-related treats for you in the 3.5 goodie bag -- both new annotations and new capabilities for existing annotations, listed below.
In the variant manipulation space, we enhanced or fixed functionality in several tools including LeftAlignAndTrimVariants, FastaAlternateReferenceMaker and VariantEval modules. And in the variant calling/genotyping space, we’ve made some performance improvements across the board to HaplotypeCaller and GenotypeGVCFs (mostly by cutting out crud and making the code more efficient) including a few improvements specifically for haploids. Read the detailed release notes for more on these changes. Note that GenotypeGVCFs will now emit no-calls at sites where RGQ=0 in acknowledgment of the fact that those sites are essentially uncallable.
We’ve got good news for you if you’re the type who worries about disk space (whether by temperament or by necessity): we finally have CRAM support -- and some recommendations for keeping the output of BQSR down to reasonable file sizes, detailed below.
Finally, be sure to check out the detailed release notes for the usual variety show of minor features (including a new Queue job runner that enables local parallelism), bug fixes and deprecation notices (a few tools have been removed from the codebase, in the spirit of slimming down ahead of the holiday season).
MuTect2 is the next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller.
The original MuTect (Cibulskis et al., 2013) was built on top of the GATK engine by the Cancer Genome Analysis group at the Broad Institute, and was distributed as a separate package. By all accounts it did a great job calling somatic SNPs, and was part of the winning entries for multiple DREAM challenges (including some submitted by groups outside the Broad). However it was not able to call indels; and the less said about the indel caller that accompanied it (first named SomaticIndelDetector then Indelocator) the better.
This new incarnation of MuTect leverages much of the HaplotypeCaller’s internal machinery (including the all-important graph assembly bit) to call both SNPs and indels together. Yet it retains key parts of the original MuTect’s internal genotyping engine that allow it to model somatic variation appropriately. This is a major differentiation point compared to HaplotypeCaller, which has expectations about ploidy and allele frequencies that make it unsuitable for calling somatic variants.
As a convenience add-on to MuTect2, we also integrated the cross-sample contamination estimation tool ContEst into GATK 3.5. Note that while the previous public version of this tool relied on genotyping chip data for its operation, this version of the tool has been upgraded to enable on-the-fly genotyping for the case where genotyping data is not available. Documentation of this feature will be provided in the near future. Both MuTect2 and ContEst are now featured in the Tool Documentation section of the Guide. Stay tuned for pipeline-level documentation on performing somatic variant discovery, to be added to the Best Practices docs in the near future.
Please note that this release of MuTect2 is a beta version intended for research purposes only and should not be applied in production/clinical work. MuTect2 has not yet undergone the same degree of scrutiny and validation as the original MuTect since it is so new. Early validation results suggest that MuTect2 has a tendency to generate more false positives as compared to the original MuTect; for example, it seems to overcall somatic mutations at low allele frequencies, so for now we recommend applying post-processing filters, e.g. by hard-filtering calls with low minor allele frequencies. Rest assured that data is being generated and the tools are being improved as we speak. We’re also looking forward to feedback from you, the user community, to help us make it better faster.
Finally, note also that MuTect2 is distributed under the same restricted license as the original MuTect; for-profit users are required to seek a license to use it (please email firstname.lastname@example.org). To be clear, while MuTect2 is released as part of GATK, the commercial licensing has not been consolidated under a single license. Therefore, current holders of a GATK license will still need to contact our licensing office if they wish to use MuTect2.
Whew that was a long wall of text on MuTect2, wasn’t it. Let’s talk about something else now. Annotations! Not functional annotations, mind you -- we’re not talking about e.g. predicting synonymous vs. non-synonymous mutations here. I mean variant context annotations, i.e. all those statistics calculated during the variant calling process which we mostly use to estimate how confident we are that the variants are real vs. artifacts (for filtering and related purposes).
So we have two new annotations, BaseCountsBySample (what it says on the can) and ExcessHet (for excess heterozygosity, i.e. the number of heterozygote calls made in excess of the Hardy-Weinberg expectations), as well as a set of new annotations that are allele-specific versions of existing annotations (with
AS_ prefix standing for Allele-Specific) which you can browse here. Right now we’re simply experimenting with these allele-specific annotations to determine what would be the best way to make use of them to improve variant filtering. In the meantime, feel free to play around with them (via e.g. VariantsToTable) and let us know if you come up with any interesting observations. Crowdsourcing is all the rage, let’s see if it gets us anywhere on this one!
We also made some improvements to the StrandAlleleCountsBySample annotation, to how VQSR handles MQ, and to how VariantAnnotator makes use of external resources -- and we fixed that annoying bug where default annotations were getting dropped. All of which you can read about in the detailed release notes.
CRAM support! Long-awaited by many, lovingly implemented by Vadim Zalunin at EBI and colleagues at the Sanger Institute. We haven’t done extensive testing, and there are a few tickets for improvements that are planned at the htsjdk level -- but it works well enough that we’re comfortable releasing it under a beta designation. Meaning have fun with it, but do your own thorough testing before putting it into production or throwing out your old BAMs!
Static binning of base quality scores. In a nutshell, binning (or quantizing) the base qualities in a BAM file means that instead of recording all possible quality values separately, we group them into bins represented by a single value (by default, 10, 20, 30 or 40). By doing this we end up having to record fewer separate numbers, which through the magic of BAM compression yields substantially smaller files. The idea is that we don’t actually need to be able to differentiate between quality scores at a very high resolution -- if the binning scheme is set up appropriately, it doesn’t make any difference to the variant discovery process downstream. This is not a new concept, but now the GATK engine has an argument to enable binning quality scores during the base recalibration (BQSR) process using a static binning scheme that we have determined produces optimal results in our hands. The level of compression is of course adjustable if you’d like to set your own tradeoff between compression and base quality resolution. We have validated that this type of binning (with our chosen default parameters) does not have any noticeable adverse effect on germline variant discovery. However we are still looking into some possible effects on somatic variant discovery, so we can’t yet recommend binning for that application.
Disable indel quality scores. The Base Recalibration process produces indel quality scores in addition to the regular base qualities. They are stored in the BI and BD tags of the read records, taking up a substantial amount of space in the resulting BAM files. There has been a lot of discussion about whether these indel quals are worth the file size inflation. Well, we’ve done a lot of testing and we’ve now decided that no, for most use cases the indel quals don’t make enough of a difference to justify the extra file size. The one exception to this is when processing PacBio data, it seems that indel quals may help model the indel-related errors of that technology. But for the rest, we’re now comfortable recommending the use of the
--disable_indel_quals argument when writing out the recalibrated BAM file with PrintReads.
Folks, I’m all out of banter for this one, so let’s go straight to the facts. GATK 3.4 contains a shedload of improvements and bug fixes, including some new functionality that we hope you’ll find useful. The full list is available in the detailed release notes.
None of the recent changes involves any disruption to the Best Practice workflow (I hear some sighs of relief) but you’ll definitely want to check out the tweaks we made to the joint discovery tools (HaplotypeCaller, CombineGVCFs and GenotypeGVCFs), which are rapidly maturing as they log more flight time at Broad and in the wild.
Let’s start at the very beginning with HaplotypeCaller (a very good place to start). On the usability front, we’ve finally given in to the nigh-universal complaint about the required variant indexing arguments (
--variant_index_type LINEAR --variant_index_parameter 128000) being obnoxious and a waste of characters. So, tadaa, they are no longer required, as long as you name your output file with the extension
.g.vcf so that the engine knows what level of compression to use to write the gVCF index (which leads to better performance in downstream tools). We think this naming convention makes a lot of sense anyway, as it’s a great way to distinguish gVCFs from regular VCFs on sight, so we hope most of you will adopt it. That said, we stopped short of making this convention mandatory (for now…) so you don’t have to change all your scripts and conventions if you don’t want to. All that will happen (assuming you still specify the variant index parameters as previously) is that you’ll get a warning in the log telling you that you could use the new convention.
Where we’ve been a bit more dictatorial is that we’ve completely disabled the use of
-dcov with HaplotypeCaller because it was causing very buggy behavior due to an unforeseen complication in how different levels of downsampling are applied in HaplotypeCaller. We know that the default setting does the right thing, and there’s almost no legitimate reason to change it, so we’re disabling this for the greater good pending a fix (which may be a long time coming due to the complexity of the code involved).
Next up, CombineGVCFs gets a new option to break up reference blocks at every N sites. The new argument
--breakBandsAtMultiplesOf Nwill ensure that no reference blocks in the combined gVCF span genomic positions that are multiples of N. This is meant to enable scatter-gather parallelization of joint genotyping on whole-genome data, as a workaround to some annoying limitations of the GATK engine that make it unsafe to use
-L intervals that might start within the span of a block record. For exome data, joint genotyping can easily be parallelized by scatter-gathering across exome capture target intervals, because we know that there won’t be any hom-ref block records spanning the target interval boundaries. In contrast, in whole-genome data, there is no equivalent predictable termination of block records, so it’s not possible to know up front where it would be safe to set scatter-gather interval start and end points -- until now!
And finally, GenotypeGVCFs gets an important bug fix, and a very useful new annotation.
The bug is something that has arisen mostly (though not exclusively) from large cohort studies. What happened is that, when a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, GenotypeGVCFs would emit a homozygous reference genotype for sample B at that position -- which is obviously incorrect. The fix is that now, sample B will be genotyped as having a symbolic
<*:DEL> allele representing the deletion.
The new annotation is called
RGQ for Reference Genotype Quality. It is a new sample-level annotation that will be added by GenotypeGVCFs to monomorphic sites if you use the
-allSites argument to emit non-variant sites to the output VCF. This is obviously super useful for evaluating the level of confidence of those sites called homozygous-reference.
This new coverage analysis tool is designed to count read depth in a way that is appropriate for allele-specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. The default output format produced by this tool is a structured text file intended to be consumed by the statistical analysis toolkit MAMBA. A paper by Stephane Castel and colleagues describing the complete ASE analysis workflow is available as a preprint on bioarxiv.
We’ve added two new documentation resources to the Guide.
One is a new category of documentation articles called Common Problems, to cover topics that are a specialized subset of FAQs: problems that many users encounter, which are typically due to misunderstandings about input requirements or about the expected behavior of the tools, or complications that arise from certain experimental designs. This category is being actively worked on and we welcome suggestions of additional topics that it should cover.
The second is an Issue Tracker that lists issues that have been reported as well as features or enhancements that have been requested. If you encounter a problem that you think might be a bug (or you have a feature request in mind), you can check this page to see if it’s something we already know about. If you have submitted a bug report, you can use the issue tracker to check whether your issue is in the backlog, in the queue, or is being actively worked on. In future we’ll add some functionality to enable voting on what issues or features should be prioritized, so stay tuned for an announcement on that!
Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.
-ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.
Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.
As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.
This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the
-ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the
-ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the
You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:
But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).
In a nutshell, phased records will look like this:
1 1372243 . T <NON_REF> . . END=1372267 <snip> <snip> 1 1372268 . G A,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip> 1 1372269 . G T,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip> 1 1372270 . C <NON_REF> . . END=1372299 <snip> <snip>
You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).
The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.
Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).
We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the
--recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.
Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.
Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as
-ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.
The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).
You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as
--includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).
The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the
-bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.
We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.
We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.
In a nutshell, the base name of the codebase has changed from
gatk (which hopefully makes more sense), and the most common effect is that
sting.gatk classpath segments are now
gatk.tools. This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.
This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.
So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.
We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!
Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.
Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command:
Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.
|Linux kernel version||Architecture / Processor||Expected speedup||Instruction set|
|Any 64-bit Linux||Any x86 64-bit||1-1.5x||Non-vector|
|Linux 2.6 or newer||Penryn (Core 2 or newer)||1.3-1.8x||SSE 4.1|
|Linux 2.6.30 or newer||SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer)||2-2.5x||AVX|
To find out exactly which processor is in your machine, you can run this command in the terminal:
$ cat /proc/cpuinfo | grep "model name" model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions.
Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware:
Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures:
SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above).
The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want.
Note: There are no version highlights available for versions earlier than 2.2.
GATK 3.6 was released on June 1, 2016. Itemized changes are listed below. For more details, see the user-friendly version highlights.
HaplotypeCaller will now emit a no-call (./.) for any sample where GQ is zero, in both normal and GVCF modes, instead of emitting a specific genotype in which we have zero confidence.
GenotypeGVCFs will now emit a QUAL value for hom-ref sites when run in
Implemented tracking of dropped reads by HaplotypeCaller and MuTect2 (see highlights for details).
Assorted optimizations to the joint calling code, expected to speed up genotyping (not the overall tool run) by about 10 percent.
-ASmode to run VQSR in an allele-specific manner (both VariantRecalibrator and ApplyRecalibration) (still experimental).
-SMAare specified together.
—splitMultiallelics(previously all were discarded).
Added an argument to HaplotypeCaller and GenotypeGVCFs,
-maxNumPLValues, that controls the maximum number of PL values that can be emitted for a given site. If the number of PLs resulting from the combination of observed alleles and ploidy exceeds this value, no PLs will be emitted. This will cause subsetting errors in SelectVariants but empowers the user to identify and work around difficult sites where this happens.
—reference_window_stopto set the reference window size used by VariantAnnotator when annotating hompolymers through the HomopolymerRun annotation. This makes it possible to deal with the problem of homopolymer stretches that are longer than the default window size.
Expose time between checks for whether new jobs can be submitted as a user-settable parameter on CLi. Useful when testing pipelines to make idle time shorter. Contributed by @dakl / Daniel Klevebring.
mem_freefrom resident memory request params for Queue because it doesn't work and wouldn't actually reserve memory.
GATK 3.5 was released on November 25, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.
Added allele-specific version of existing annotations: AS_BaseQualityRankSumTest, AS_FisherStrand, AS_MappingQualityRankSumTest, AS_RMSMappingQuality, AS_RankSumTest, AS_ReadPosRankSumTest, AS_StrandOddsRatio, AS_QualByDepth and AS_InbreedingCoeff.
Added BaseCountsBySample annotation. Intended to provide insight into the pileup of bases used by HaplotypeCaller in the calling process, which may differ from the pileup observed in the original bam file because of the local realignment and additional filtering performed internally by HaplotypeCaller. Can only be requested from HaplotypeCaller, not VariantAnnotator.
Added ExcessHet annotation. Estimates excess heterozygosity in a population of samples. Related to but distinct from InbreedingCoeff, which estimates evidence for inbreeding in a population. ExcessHet scales more reliably to large cohort sizes.
Added FractionInformativeReads annotation. Reports the number of reads that were considered informative by HaplotypeCaller (over all samples).
Enforced calculating GenotypeAnnotations before InfoFieldAnnotations. This ensures that the AD value is available to use in the QD calculation.
Reorganized standard annotation groups processing to ensure that all default annotations always get annotated regardless of what is specified on the command line. This fixes a bug where default annotations were getting dropped when the command line included annotation requests.
Made GenotypeGVCFs subset StrandAlleleCounts intelligently, i.e. subset the SAC values to the called alleles. Previously, when the StrandAlleleCountsBySample (SAC) annotation was present in GVCFs, GenotypeGVCFs carried it over to the final VCF essentially unchanged. This was problematic because SAC includes the counts for all alleles originally present (including NON-REF) even when some are not called in the final VCF. When the full list of original alleles is no longer available, parsing SAC could become difficult if not impossible.
Added new MQ jittering functionality to improve how VQSR handles MQ. Note that HaplotypeCaller now calculates a new annotation called RAW_MQ per-sample, which is then integrated per-cohort by GenotypeGVCFs to produce the MQ annotation.
VariantAnnotator can now annotate FILTER field from an external resource. Usage:
--resource:foo resource.vcf --expression foo.FILTER
VariantAnnotator can now check allele concordance when annotating with an external resource. Usage:
Allowed overriding hard-coded cutoff for allele length in ValidateVariants and in LeftAlignAndTrimVariants. Usage:
--reference_window_stop N where N is the desired cutoff.
Also in LeftAlignAndTrimVariants, trimming multiallelic alleles is now the default behavior.
Fixed ability to mask out snps with
--snpmask in FastaAlternateReferenceMaker.
Also in FastaAlternateReferenceMaker, fixed merging of contiguous intervals properly, and made the tool produce more informative contig names.
Fixed a bug in CombineVariants that occurred when one record has a spanning deletion and needs a padded reference allele.
Added a new VariantEval evaluation module, MetricsCollection, that summarizes metrics from several EV modules.
Enabled family-level stratification in MendelianViolationEvaluator of VariantEval (if a ped file is provided), making it possible to count Mendelian violations for each family in a callset with multiple families.
Added the ability to SelectVariants to enforce 4.2 version output of the VCF spec when processing older files. Use case: the 4.2 spec specifies that GQ must be an integer; by default we don’t enforce it (so if reading an older file that used decimals, we don’t change it) but the new argument
--forceValidOutput converts the values on request. Not made default because of some performance slowdown -- so writing VCFs is now fast by default, compliant by choice.
-U ALLOW_SEQ_DICT_INCOMPATIBILITYcan be used (with caution) to override the check.
Various improvements to the tools’ performance, especially HaplotypeCaller, by making the code more efficient and cutting out crud.
GenotypeGVCFs now emits a no-call (./.) when the evidence is too ambiguous to make a call at all (e.g. all the PLs are zero). Previously this would have led to a hom-ref call with RGQ=0.
Fixed a bug in GenotypeGVCFs that sometimes generated invalid VCFs for haploid callsets. The tool was carrying over the AD from alleles that had been trimmed out, causing field length mismatches.
Ensured inputPriors get used if they are specified to the genotyper (previously they were ignored). Also improved docs on
--indel_ heterozygosity priors.
Fixed bug that affected the
--ignoreInputSamples behavior of CalculateGenotypePosteriors.
Added option to OverclippedReadFilter to not require soft-clips on both ends. Contributed by Jacob Silterra.
Fixed a bug in IndelRealigner where the tool was incorrectly "fixing" mates when supplementary alignments are present. The patch involves ignoring supplementary alignments.
Support for reading and writing CRAM files. Some improvements are still expected in htsjdk. Contributed by Vadim Zalunin at EBI and collaborators at the Sanger Institute.
Made interval-list output format dependent on the file extension (for RealignerTargetCreator). If the extension is
.interval_list, output will be formatted as a proper Picard interval list (with sequence dictionary). Otherwise it will be a basic GATK interval list as previously.
Added a new JobRunner called ParallelShell that will run jobs locally on one node concurrently as specified by the DAG, with the option to limit the maximum number of concurrently running jobs using the flag
maximumNumberOfJobsToRunConcurrently. Contributed by Johan Dahlberg.
PER_TARGET_COVERAGEargument and added extension for Picard CollectWgsMetrics.
GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.
--mergeVariantsViaLDargument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
allowNonUniqueKmersInRefso that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
.g.vcffile extension. See Highlights for more details.
-uniquifySamplesto GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
-dcovsetting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
--breakBandsAtMultiplesOf Nwill ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way
-Lworks in the GATK for VCF records with the END tag). See Highlights for more details.
-trimargument to trim (simplify) alleles to a minimal representation.
-trimAlternatesargument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
-noTrimargument to preserve original alleles.
-fixNDNflag fully functional.
-SMAis specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.
-drfargument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).
-qsub-broadargument. When -qsub-broad is specified instead of
-qsub, Queue will use the
h_vmemparameter instead of
h_rssto specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
slf4j-log4j12version (contributed by user Biocyberman).
GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.
--sample_nameargument. This is a shortcut for people who have multi-sample BAMs but would like to use
-ERC GVCFmode with a particular one of those samples.
--ignore_all_filtersoption. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.
--keepOriginalAC functionalityin SelectVariants to work for sites that lose alleles in the selection.
read_grouparguments no longer appear in the header.
--bcffor VCF files, and
--generate_md5for BAM files moved to the engine level.
GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.
We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.
optfunctions to work with recent versions of the ggplot2 R library.
optfunctions to work with recent versions of the ggplot2 R library.
Note: There are no release notes available for versions earlier than 2.0.
|31st May 2016||Update pom versions for the 3.6 release|
|31st May 2016||Merge remote-tracking branch 'unstable/master'|
|31st May 2016||Merge pull request #1394 from broadinstitute/gvda_add_colt_dependency|
|31st May 2016||Add colt > cern.jet.normal dependency|
|31st May 2016||Merge pull request #1391 from broadinstitute/rhl_htsjdk_picard_2.4.1|
|3rd May 2016||Move Move htsjdk and picard to version 2.4.1|
|3rd May 2016||Merge pull request #1392 from broadinstitute/gvda_vqsr_retries_mdp|
|3rd May 2016||Ability to retry building VQSR model (contributed by mdp)|
|3rd May 2016||Merge pull request #1387 from broadinstitute/gvda_gatkdocs_enhancements|
|2nd May 2016||GATKDocs overhaul|
|2nd May 2016||Merge pull request #1388 from broadinstitute/gvda_disable_phone_home|
|2nd May 2016||Remove Phone Home|
|2nd May 2016||Merge pull request #1374 from broadinstitute/gvda_move_ReassignOriginalMQAfterIndelRealignment_toPublic_#1364|
|2nd May 2016||Add M2 to the HC annotations check|
|2nd May 2016||Improved article linking in error messages|
|2nd May 2016||Hide deletion arguments in ASEReadCounter|
|2nd May 2016||Fixed warnings in LAATV and ValidateVariants|
|2nd May 2016||Move indel realignment to public|
|2nd May 2016||Moved post-IR MQ reverter filter to public|
|2nd May 2016||Merge pull request #1377 from broadinstitute/vrr_non_ref_likelihood_recal|
Note: There are no PDF files of the Guide Book available for versions earlier than 2.3-9.