Previously, we covered the spirit of GATK 3.0 (what our intentions are for this new release, and what we’re hoping to achieve). Let’s now have a look at the top three features you can look forward to in 3.0, in no particular order:

  1. Optimized PairHMM algorithm to make GATK run faster
  2. Single-sample pipeline for joint variant discovery
  3. Best practices for calling variants on RNAseq data

1. Optimized PairHMM algorithm to make HaplotypeCaller faster

At this point everyone knows that the HaplotypeCaller is fabulous (you know this, right?) but beyond a certain number of samples that you’re trying to call jointly, it just grinds to a crawl, and any further movement is on the scale of continental drift. Obviously this is a major obstacle if you’re trying to do any kind of work at scale beyond a handful of samples, and that’s why it hasn’t been used in recent large-cohort projects despite showing best-in-class performance in terms of discovery power.

The major culprit in this case is the PairHMM algorithm, which takes up the lion’s share of HC runtime. With the help of external collaborators (to be credited in a follow-up post) we rewrote the code of the PairHMM to make it orders of magnitude faster, especially on specialized hardware like GPU and FPGA chips (but you’ll still see a speedup on “regular” hardware).

We plan to follow up on this by doing similar optimizations on the other “slowpoke” algorithms that are responsible for long runtimes in GATK tools.

2. Single-sample pipeline for joint variant discovery

Some problems in variant calling can’t be solved by Daft Punk hardware upgrades (better faster stronger) alone. Beyond the question of speed, a major issue with multi-sample variant discovery is that you have to wait until all the samples are available to call variants on them. Then, if later you want to add some more samples to your cohort, you have to re-call all of them together, old and new. This, also known as the “N+1 problem”, is a huge pain in the anatomy.

The underlying idea of the “single-sample pipeline for joint variant discovery” is to decouple the two steps in the variant calling process: identifying evidence of variation, and interpreting the evidence. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and a heck of a lot faster) on one sample at a time.

The new pipeline allows us to process each sample as it comes off the sequencing machine, up to the first step of variant calling. Cumulatively, this will produce a database of per-sample, per-site allele frequencies. Then it’s just a matter of running a joint analysis on the database, which can be done incrementally each time a new sample is added or at certain intervals or timepoints, depending on the research needs, because this step runs quickly and cheaply.

We’ll go into the details of exactly how this works in a follow-up post. For now, the take-home message is that it’s a “single-sample pipeline” because you do the heavy-lifting per-sample (and just once, ever), but you are empowered to perform “joint discovery” because you interpret the evidence from each sample in light of what you see in all the other samples, and you can do this at any point in the project timeline.

3. Best practices for calling variants on RNAseq

Our Best Practices recommendations for calling variants on DNA sequence data have proved to be wildly popular with the scientific community, presumably because it takes a lot of the guesswork out of running GATK, and provides a large degree of reproducibility.

Now, we’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences the early data processing steps, as well as in the calling step. We do not yet have RNAseq-specific recommendations for variant filtering/recalibration, but will be developing those in the coming weeks.

We’ll go into the details of the RNAseq Best Practices in a follow-up post, but in a nutshell, these are the key differences: use STAR for alignment, add an exon splitting and cleanup step, and tell the variant caller to take the splits into account. The latter involves some new code added to the variant callers; it is available to both HaplotypeCaller and UnifiedGenotyper, but UG is currently missing a whole lot of indels, so we do recommend using only HC in the immediate future.

Keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing. We will be improving these recommendations progressively as we go, and we hope that the researcher community will help us by providing feedback of their experiences applying our recommendations to their data.

Return to top


Am I correct in my reading of your poster that you've rewritten a fair ol' bit of haplotype caller in c++?

Mon 24 Feb 2014


You'll have to ask @Carneiro for the gory details (or wait for the upcoming in-detail post), but as far as I know it took quite a bit of re-implementing things in c++, yes. Though strictly speaking it's the PairHMM code that was rewritten (which is also used by UG), not the HaplotypeCaller overall.

Mon 24 Feb 2014


Is there any special option that I need to specify in order to take advantage of the GPU optimization?

Mon 24 Feb 2014


Apologies but the GPU stuff didn't make it into the 3.0 release. It's *this* close to being ready though so we are contemplating making a 3.1 release really soon with just this stuff added. Will let you know more when we figure out our plans.

Mon 24 Feb 2014


"GPU stuff ... close to being ready"
Does GPU stuff also include the AVX stuff on Mauricio's poster? If I understand AVX correctly, recent versions of Intel Xeon chips coupled with a recent version of Linux will see a performance boost, without any user intervention.

Mon 24 Feb 2014


Yup, that's it. Although initially we might ask users to opt into these performance boosts instead of having them on by default (because it's always safer to let our super-users try things out first and give us feedback).

Mon 24 Feb 2014

At a glance

Follow us on Twitter

GATK Dev Team


RT @BroadGenomics: Missed #GATK WKSP at #ASHG16? 10am meet Geraldine at booth 329 - Broad’s GATK Guru @gatk_dev #BroadGenomicsExperts https…
20 Oct 16
RT @NJL_NGS: Broad Institute Workbench workshop now at #ashg16.
19 Oct 16
RT @konradjk: We're rebranding a bit! gnomAD now adds WGS regions, as well as doubling the data depth and increasing diversity (5K ASJ!) #A…
19 Oct 16
RT @NJL_NGS: now launched!
19 Oct 16
RT @dgmacarthur: For those who missed it: we just posted variant frequencies from 126,216 exomes and 15,136 genomes at
19 Oct 16

Our favorite tweets from others

Asked a question, speaker jokes not to edit your BAM header and expect GATK to let you get away with it.
19 Oct 16
My new hobby: finding incomprehensible diagrams on office whiteboards and adding alarming conclusions to them
30 Sep 16
I've easily written my first custom ReadFilter for GATK. The @gatk_dev 's toolkit is a great example of programming.
21 Sep 16
@gatk_dev "make it so"
8 Sep 16
it's the nightly build owl for GATK :D
11 Aug 16
See more of our favorite tweets...
Search blog by tag

ad appistry ashg ashg16 benchmarks best-practices bug bug-fixed cancer cloud cluster cnv commandline commandlinegatk community compute conferences cram cromwell denovo depthofcoverage diagnosetargets error fix forum gatk3 gatk4 genotype genotype-refinement genotypegvcfs google grch38 gvcf haploid haplotypecaller hg38 holiday hts htsjdk ibm java8 job job-offer jobs license meetings mendelianviolations multithreading mutect mutect2 ngs nt outreach pairhmm parallelism patch performance phone-home picard pipeline plans ploidy polyploid poster presentations printreads profile promote release release-notes rnaseq runtime saas script search selectvariants sequencing service slides snow speed status sting support syntax talks team terminology third-party-tools topstory trivia troll tutorial unifiedgenotyper vcf-gz version-highlights versions vqsr wdl webinar workflow workshop