Previously, we covered the spirit of GATK 3.0 (what our intentions are for this new release, and what we’re hoping to achieve). Let’s now have a look at the top three features you can look forward to in 3.0, in no particular order:

  1. Optimized PairHMM algorithm to make GATK run faster
  2. Single-sample pipeline for joint variant discovery
  3. Best practices for calling variants on RNAseq data

1. Optimized PairHMM algorithm to make HaplotypeCaller faster

At this point everyone knows that the HaplotypeCaller is fabulous (you know this, right?) but beyond a certain number of samples that you’re trying to call jointly, it just grinds to a crawl, and any further movement is on the scale of continental drift. Obviously this is a major obstacle if you’re trying to do any kind of work at scale beyond a handful of samples, and that’s why it hasn’t been used in recent large-cohort projects despite showing best-in-class performance in terms of discovery power.

The major culprit in this case is the PairHMM algorithm, which takes up the lion’s share of HC runtime. With the help of external collaborators (to be credited in a follow-up post) we rewrote the code of the PairHMM to make it orders of magnitude faster, especially on specialized hardware like GPU and FPGA chips (but you’ll still see a speedup on “regular” hardware).

We plan to follow up on this by doing similar optimizations on the other “slowpoke” algorithms that are responsible for long runtimes in GATK tools.

2. Single-sample pipeline for joint variant discovery

Some problems in variant calling can’t be solved by Daft Punk hardware upgrades (better faster stronger) alone. Beyond the question of speed, a major issue with multi-sample variant discovery is that you have to wait until all the samples are available to call variants on them. Then, if later you want to add some more samples to your cohort, you have to re-call all of them together, old and new. This, also known as the “N+1 problem”, is a huge pain in the anatomy.

The underlying idea of the “single-sample pipeline for joint variant discovery” is to decouple the two steps in the variant calling process: identifying evidence of variation, and interpreting the evidence. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and a heck of a lot faster) on one sample at a time.

The new pipeline allows us to process each sample as it comes off the sequencing machine, up to the first step of variant calling. Cumulatively, this will produce a database of per-sample, per-site allele frequencies. Then it’s just a matter of running a joint analysis on the database, which can be done incrementally each time a new sample is added or at certain intervals or timepoints, depending on the research needs, because this step runs quickly and cheaply.

We’ll go into the details of exactly how this works in a follow-up post. For now, the take-home message is that it’s a “single-sample pipeline” because you do the heavy-lifting per-sample (and just once, ever), but you are empowered to perform “joint discovery” because you interpret the evidence from each sample in light of what you see in all the other samples, and you can do this at any point in the project timeline.

3. Best practices for calling variants on RNAseq

Our Best Practices recommendations for calling variants on DNA sequence data have proved to be wildly popular with the scientific community, presumably because it takes a lot of the guesswork out of running GATK, and provides a large degree of reproducibility.

Now, we’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences the early data processing steps, as well as in the calling step. We do not yet have RNAseq-specific recommendations for variant filtering/recalibration, but will be developing those in the coming weeks.

We’ll go into the details of the RNAseq Best Practices in a follow-up post, but in a nutshell, these are the key differences: use STAR for alignment, add an exon splitting and cleanup step, and tell the variant caller to take the splits into account. The latter involves some new code added to the variant callers; it is available to both HaplotypeCaller and UnifiedGenotyper, but UG is currently missing a whole lot of indels, so we do recommend using only HC in the immediate future.

Keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing. We will be improving these recommendations progressively as we go, and we hope that the researcher community will help us by providing feedback of their experiences applying our recommendations to their data.


Return to top

TechnicalVault


Am I correct in my reading of your poster that you've rewritten a fair ol' bit of haplotype caller in c++?

Mon 24 Feb 2014

Geraldine_VdAuwera


You'll have to ask @Carneiro for the gory details (or wait for the upcoming in-detail post), but as far as I know it took quite a bit of re-implementing things in c++, yes. Though strictly speaking it's the PairHMM code that was rewritten (which is also used by UG), not the HaplotypeCaller overall.

Mon 24 Feb 2014

blakeoft


Is there any special option that I need to specify in order to take advantage of the GPU optimization?

Mon 24 Feb 2014

ebanks


Apologies but the GPU stuff didn't make it into the 3.0 release. It's *this* close to being ready though so we are contemplating making a 3.1 release really soon with just this stuff added. Will let you know more when we figure out our plans.

Mon 24 Feb 2014

Oprah


"GPU stuff ... close to being ready"
Does GPU stuff also include the AVX stuff on Mauricio's poster? If I understand AVX correctly, recent versions of Intel Xeon chips coupled with a recent version of Linux will see a performance boost, without any user intervention.

Mon 24 Feb 2014

ebanks


Yup, that's it. Although initially we might ask users to opt into these performance boosts instead of having them on by default (because it's always safer to let our super-users try things out first and give us feedback).

Mon 24 Feb 2014





At a glance



Follow us on Twitter

GATK Dev Team

@gatk_dev

@geoffjentry @PatriciaMBrent Indel realn gone now; much faster BQSR + Sparkified tools in GATK4a. And wider scope of application/ use cases.
9 Dec 16
@geoffjentry @PatriciaMBrent B fair, other slide says results similar. Not uncommon for benchmark tests. But speed comparison v outdated 1/2
9 Dec 16
RT @geoffjentry: Talk by Stavros Papadopoulos about TileDB from @intel - used by @BroadGenomics to power our joint genotyping https://t.co/
8 Dec 16
@iskander @NJL_NGS Hm. Can you please post this in the forum?
8 Dec 16
@iskander @NJL_NGS Treat them like any other read, no special handling. Problems arise in CIGAR functions that don’t know how to handle Ns.
7 Dec 16

Our favorite tweets from others

Currently in a time-out for saying that duck fat had a certain "je ne sais quack" at the thanksgiving dinner table.
25 Nov 16
@dgmacarthur @BioMickWatson @StevenNHart @splon There's even a shop near Broad that apparently fixes Hail code erro… https://t.co/IZ4BcgRZYE
19 Nov 16
I'm very happy to be at GATK Workshop hands on with @gatk_dev ! There's always something to learn. Tip: there is free coffee. Lol
8 Nov 16
Have recently begun to think of slide editing as ‘downsampling my slides’. I suspect this indicates something is wrong with me.
1 Nov 16
See more of our favorite tweets...
Search blog by tag

appistry ashg ashg16 benchmarks best-practices bug bug-fixed cancer cloud cluster cnv collaboration commandline commandlinegatk community compute conferences cram cromwell denovo depthofcoverage diagnosetargets error fix forum gatk3 gatk4 genotype genotype-refinement genotypegvcfs google grch38 gvcf haploid haplotypecaller hg38 holiday hts htsjdk ibm intel java8 job job-offer jobs license meetings mendelianviolations mutect mutect2 ngs outreach pairhmm parallelism patch performance phone-home picard pipeline plans ploidy polyploid poster presentations printreads profile promote release release-notes rnaseq runtime saas script search sequencing service slides snow speed status sting support syntax talks team terminology third-party-tools topstory troll tutorial unifiedgenotyper vcf-gz version-highlights versions vqsr wdl webinar workflow workshop