Unboxing GATK4

Posted by Geraldine_VdAuwera on 24 May 2017 (7)

This is one of two posts announcing the imminent beta release of GATK4; for details about the open-source licensing, see this other post.

You've probably heard it by now: we are on the cusp of releasing GATK4 into beta status (targeting mid-June), and we plan to push out a general release shortly thereafter (targeting midsummer). That's great. So what's in the box?

Over two years of active development have gone into producing GATK4, and I'm happy to say we have plenty to show for it. Specifically, we've pushed the evolution of GATK on three fronts: (1) technical performance, i.e. speed and scalability; (2) new functionality and expanded scope of analysis, e.g. we can do CNVs now; and (3) openness to collaboration, through open-sourcing as well as general developer-friendliness (documented code! consistent APIs! clear contribution guidelines!).

Want more detail? Let me give you a tour of the highlights, using slides from the presentation I gave at Bio-IT earlier today (code reuse: it's not just for code anymore).

It's basically a Tesla now

Let's talk about the engine, i.e. the part of the GATK that does all the boring things like reading and writing files, understanding formats, applying multithreading, applying read filters, extracting data, doing basic math -- all the functionality that's common to many tools. The old ("classic"?) GATK engine was waaaay over-engineered, too complex for its own good really, and that's a big reason why GATK has always been a bit on the slow side.

So we rewrote it from scratch to be much more efficient, and it was faster, so that was cool. Then we gave it superpowers.

Like what? Well, how about some major speed enhancements through components like the Intel Genomics Kernel Library (GKL), which provides optimized code for things like file compression and decompression, as well as algorithms like PairHMM (a heavily-used component of the HaplotypeCaller's genotyping code). A hot new datastore component --also contributed by Intel-- called GenomicsDB that constitutes a quantum leap in scalability for joint genotyping (more on that in a bit). Built-in support for Apache Spark to take advantage of industry-standard, robust parallelism, including multi-threading that doesn't suck (good riddance to -nt/-nct!). Functionality to submit jobs to Google's Dataproc spark execution service, and more generally to read/write files directly from Google Cloud Storage. And finally, increased flexibility in terms of how the engine can traverse the data, which opens up new types of analyses that weren't possible before.

Show me the numbers

We're putting together systematic benchmarks that we'll publish with the release to support all these wild claims of dramatically enhanced performance; and in the meantime let me just show you a few preliminary numbers to give you a sense of scale of the GATK3 vs GATK4 evolution. Keep in mind this is run on ordinary hardware with bog-standard parameters, so nothing fancy -- it's just meant to be a baseline comparison to illustrate what migrating to GATK4 gets you, at the very least, before you even start leveraging advanced speed-freak features like Spark.

The takeaway? Over a full pipeline we see up to 5x speedup. Not too shabby! Looking at a subset of tools that are traditionally long-running, and therefore have an outsize effect on overall runtime, we see that those that were ported over from GATK3 (BaseRecalibrator, HaplotypeCaller) ran about twice as fast on average from a combination of the engine enhancements and optimizations/cleanup made to the tools themselves, which is nice -- but almost pales in comparison to the massive 6x speedups we get from re-implementing key Picard tools in the pipeline, like MarkDuplicates and SortSam (which becomes generic Sort in GATK4).

Speed isn't everything of course; with collaborators like Daniel MacArthur and his crazy data aggregation projects (no offense, we love you Daniel) we find ourselves constantly scrambling to scale to larger cohort sizes. The latest incarnation of this, gnomAD, involved joint-calling 15,000 whole genomes. Now you may be thinking, meh, ExAC was what, 120,000 exomes? Yeah but a whole genome is about 100x larger than an exome, so you do the math. It's a lot of data. Anyway, we couldn't have done it with the existing GATK3 tools. During testing, we ran into a wall at about 3,000 WGS samples, which took 6 weeks to run and totally maxed out Eric Banks' credit cards. So it's a good thing we have friends in smart places, specifically our Intel collaborators who built us a new datastore called GenomicsDB. It allows GATK4 to run the joint genotyping computations far more efficiently than we could do in the old framework, and it's what enabled us to sidestep the wall and scale to the 15,000 WGS samples of gnomAD. Oh, and it only took 2 weeks to run. That was a nice touch.

Is it scope creep if it's on purpose?

In the GATK3 world, we have just one fully decked out Best Practices workflow (HaplotypeCaller GVCF for germline short variants) and maaaybe a couple of half-hearted wannabe-BPs (MuTect2-based and RNAseq) if you want to be generous and count them.

In GATK4, we still have our flagship HaplotypeCaller GVCF workflow of course, but we also have three new workflows that are either already Best Practices-grade (GATK CNV + ACNV for somatic CNVs and allelic CNVs) or getting there (MuTect2 for somatic short variants, much improved after the port, and GATK gCNV for germline CNVs which is brand new and still piping hot). Finally, coming down the pipe we also have tools and an eventual Best Practices workflow for Structural Variants. I guess this is what happens when you let your Pokemon get wet after midnight.

Share ALL THE THINGS (including workflows)

It's worth mentioning at this point that the GATK Best Practices have a dual nature. Their primary purpose is to be a “platonic ideal”; a set of guidelines that describe, for a particular use case like “discover germline SNPs and indels” what processing/analysis should be applied to get the best possible results. But they also serve a secondary purpose, which is to describe in practice how you actually implement that workflow correctly as a pipeline – Are there any tweaks to be made depending on the data type? What can you parallelize, and how? and from there, even more tricky questions arise, like what kind of hardware will you need, how much memory and so on. To assist the community in getting this stuff working, we're committing to sharing our own implementations as a reference, in the form of WDL scripts. We already have a few published in the WDL repository's scripts section, including the "raw reads to GVCF" single-sample workflow we use in our production pipeline to process the Broad's genomes.

We're now getting ready to tackle the task of whipping our GATK4 Best Practices workflows into a publishable state (document ALL THE THINGS), with the goal of having them ready for public consumption when GATK4 itself goes into general release. We encourage you to adopt them if they fit your use case and needs. When everyone uses methods that are the same or similar enough (i.e. functionally equivalent), that makes analysis results more comparable across studies, and as fellow humans, we all stand to benefit from that.

So that concludes our whirlwind tour of GATK4 highlights. There are various things I didn't cover, including what exactly is happening with Picard (we'll cover that separately, and soon) and whether there are changes at the command line level. To address the latter (spoiler: yes, yes there are), we are developing a comprehensive migration guide to help ease the transition. Make no mistake, this is a big change. But it's a change for the better at every level that matters.

hyeonlee17

Hi, Geraldine, I am very exiting to test and use your improved version of GATK. I've tested the speed improvements on MarkDuplication and BaseRecalibrator of GATK4 beta version, but the HaplotypeCaller was slower than the previous version, GATK3. Could you let me know the test environment and options that you used for the baseline benchmarking? Even though it is beta release, I would like to check the speed improvement roughly.

Sheila

@hyeonlee17 Hi, Sorry for the delay. I will confirm with the team and get back to you. What was the exact command you ran? -Sheila

Geraldine_VdAuwera

Hi @hyeonlee17, iirc this was done on Intel hardware with the latest GKL optimizations in GATK4. I don't have the exact specs -- we'll have thoroughly detailed benchmarks available when we move to general release (September timeframe) but until then I'm afraid we don't have the bandwidth to provide any in-depth info.

pepe

Hi, Thanks for the great work you're doing developing GATK and also for now releasing it as open source. We will be updating our infrastructure in the near future, and I'm looking at different options. It seems that AMD-based systems will become a cost-effective alternative to Intel. Intel's interest in helping optimizing GATK to run well on it's cpus is understandable. However, do you know how general these optimizations are? Will Epyc cpus be heavily penalized? Have you been able to test GATK4 on Epyc? Kind regards, Pär Larsson, Umeå University/Umeå University Hospital, Sweden

Geraldine_VdAuwera

Hi @pepe, the optimizations are specific to chip architectures as far as I understand, so AMD cpus would currently not gain any benefits. It's possible someone may contribute additional optimized libraries (there are some available from IBM for their POWER8 chipsets) but I'm not aware of anything being developed to that effect for AMD/Epyc.

sklages

Hi, so you are actually optimizing for Intel (only). I wouldn't care too much if tools like haplotype caller would scale better and more stable on multi-cpu server. That would be the main gain in speed for me, as many tools scale horribly on multiple cpus. Do you already have some experience to share on scalability/multithreading? best, Sven

Geraldine_VdAuwera

@sklages To be clear the optimizations are contributed by Intel, so yes it’s specific to their chips. We welcome optimizations by other manufacturers of course — we just don’t have the expertise or bandwidth to do that ourselves. The expectation is that using Spark will enable gains for multi-cpu architectures in general; we have some benchmarking efforts in progress but aren’t at the stage where we can present results yet.

Wed 24 May 2017

GATK4 at Bio-IT: Luncheon with Intel and...

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

GATK Dev Team

@gatk_dev

Last chance to win one of 100 prizes including $50 Amazon gift cards and up to$500 in FireCloud compute credits! S… https://t.co/KKnqcrsWot
22 Nov 17
Still a 1-in-3 chance of winning one of the 100 prizes we're giving to survey respondents! Tell your friends and la… https://t.co/afHJFgMuV5
17 Nov 17
@ctsa11 @strnr Fair enough; now that we have 280 characters to play with we can say != "GVCF as we define it" (with… https://t.co/eD686VenjI
17 Nov 17
#GATK HaplotypeCaller paper on biorxiv https://t.co/fISg0KM12f #BetterLateThanNever
17 Nov 17
Thanks again for inviting us to @marshallu, we had a great time and enjoyed the very active group of participants! https://t.co/8ymsXvXDhQ

- Our favorite tweets from others

Wanna be a baller, HaplotypeCaller 20K genotypes in the VCF file Caller, gettin' phased tonight https://t.co/bOGSI4UL23
17 Nov 17
This amazing genomics toolkit helps researchers find insights that save lives - I know! GATK users - please provide… https://t.co/gdY4FDPX8K
2 Nov 17
using GATK to identify SNPs while handing out candy... Happy Halloween! @broadinstitute @gatk_dev #bioinformatics #researchisfun #Halloween
31 Oct 17
Although it made me cry sometimes, I owe them a lot and love them much more. https://t.co/vUj0cBllgn
16 Oct 17
Round of applause at #BOSC2017 for GATK4 being open sourced. https://t.co/WRhTeKtKTX
23 Jul 17
See more of our favorite tweets...