A while back, I posted this article about work done by the Intel Bio Team to benchmark the speed and resource utilization of each step in the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF). They published their results as a white paper on the Intel Life Sciences website, which has a section dedicated to GATK (which makes us feel all warm and tingly).

Now the Intel team has published an updated version of the white paper here that extends the work, originally done on a WGS trio, to a cohort of 50 exomes and adds the joint analysis segment of the pipeline (GenotypeGVCFs to VQSR; GVCFs to filtered multisample VCF) for both datasets.

As previously, the paper does a great job of showing where are the performance bottlenecks and where you can get the biggest speed increases by parallelizing execution.

My commentary from the previous post still applies pretty much equally to this updated version, except now we have performance profiles for GenotypeGVCFs and the VQSR tools as well, which I'll comment on briefly (using WGS but the profiles are similar for exomes).

The biggest takeaway here is that the runtime of GenotypeGVCFs scales down almost linearly with how widely you parallelize it, which is obviously great news if you're in a rush and you have access to lots of machines. But pay attention here to the meaning of "thread count" in the context of the paper! As a reminder, most of the parallelization it presents is achieved through scatter-gather (parallelizing over predetermined genomic intervals), not by multithreading using -nt and/or nct. In our own production pipelines we don't use -nt/-nct multithreading at all, and in GATK4 we're abandoning them and replacing the functionality with Spark support wherever it makes sense. Why am I pointing this out here? Because we're finding that GenotypeGVCFs is especially difficult to parallelize through multithreading, due to the complexity of dealing with overlapping events across multiple samples (the occurrence of which increases with cohort size). In the recent GATK 3.7 release, we added some functionality to deal better with overlapping deletions -- and now we're getting reports that this breaks when multithreading is turned on (cue the poop emoji). The safest way to deal with this? Don't use multithreading with GenotypeGVCFs; use scatter-gather instead (ask me how in the comments).

Also, don't parallelize VQSR. Look at the graph; it's not worth it. VQSR needs to see all of the things most of the time.

Finally, I should add that having the exome numbers to compare to the WGS numbers is a big upgrade -- it really gives you sense of scale of the practical implications of choosing to work with one datatype versus the other. All other sciencey considerations being equal (which they're not, but let's pretend) the computational resource commitment is massively different. Which is hardly news to our Ops team that processed Daniel MacArthur's ludicrously large gnomAD dataset, let me tell you -- for reference, the final joint VCF on that was ~22TB for 20K genomes. That's a big part of why we run our whole genomes on the cloud. It's real Big Data, no hype needed.

Return to top

ying_sheng_1 on 5 Jan 2017

Nice presentation. Is it possible to have these data for GATK4?

Geraldine_VdAuwera on 5 Jan 2017

We haven't done this for GATK4 yet but we plan to do it once we have the full pipeline validated in GATK4. Stay tuned for an upcoming announcement of our calendar for GATK4 release.

Gossie on 5 Jan 2017

Thanks for your presentation. I have two questions: 1. How to parallelize GenotypeGVCFs by scatter-gather? (I guess you want us to ask ;) ) 2. In *White paper* , it used the following arguments: ``` GenotypeGVCFs: Merges gVCF's to create a genotyped VCF: -nt "NUMTHREADS" -R "GENOMEREF" -D "DBSN VCF" -V "input" ``` It used `-nt` which is different with your presentation, so I am confused. Looking forward to your reply, Thank you!

Sheila on 5 Jan 2017

@Gossie Hi, I will ask Geraldine to get back to you soon. -Sheila

Gossie on 5 Jan 2017

Thank you @Sheila

Sheila on 5 Jan 2017

@Gossie Hi again, Another teammate jumped in. It seems "For the Joint Analysis pipeline, VariantRecalibrator is currently unable to conduct process level parallelism and a comparison between both thread and process level parallelism techniques for the rest of the tools showed no significant improvement in time. Thus, all the tools in the Joint Analysis portion uses GATK's integrated -nt argument to apply thread level parallelism." This is from the [white paper](https://www.intel.com/content/www/us/en/healthcare-it/solutions/documents/deploying-gatk-best-practices-paper.html). I hope that helps. -Sheila

Geraldine_VdAuwera on 5 Jan 2017

To clarify @Sheila's comment, we realized that there was indeed a contradiction between my blog post and the white paper. It seems I misunderstood a communication I had at the time with the authors of the white paper. We had discussed the steps earlier in the pipeline that are parallelized using scatter-gather, and I thought that applied to all steps, but the joint calling part (including GenotypeGVCFs) is in fact parallelized using -nt. I will amend my blog post accordingly. My apologies for the confusion, and thank you for pointing out this error!

Gossie on 5 Jan 2017

I see. Thanks for your replies. @Sheila @Geraldine_VdAuwera

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


@wbsimey Happy to hear you’ve found the resources we provide helpful!
30 Jul 19
New crop of GATK workshop videos now available on YouTube! Updated for the GATK4/2019 version of the Best Practices… https://t.co/Wfgq5YKBFg
25 Jul 19
Don't miss this #GATK workshop -- we've got a great crew lined up and the location isn't half bad either :) https://t.co/b0fL8ZLwzn
23 Jul 19
@Brunods1001 It’s been updated to use GATK4, which addresses the invalid bam output issue that affected the GATK3 v… https://t.co/AUlbjmHKmm
11 Jul 19
Wrapping up the #GATK workshop in Cambridge, UK -- it's been a blast. Great group of participants and fantastic hos… https://t.co/bvwGTU7lYq
11 Jul 19

- Our favorite tweets from others

In spite of their stated mission to support human health through genomics, many GATK pipelines are applicable to no… https://t.co/FKQTouZjbv
29 Jul 19
Me: driving myself insane over what data to keep and what to not bother with for thesis and also frantically trying… https://t.co/er2klIcw5i
18 Jul 19
@RareSeas first attempt at teaching the GATK course, do I look puzzled up there? https://t.co/4mqkHbWJy4
11 Jul 19
Can you spot CDGP PhD student, Dr. Alice Denyer, brushing up on the latest bioinformatics tools from @gatk_dev? The… https://t.co/KAbdlWLbcb
10 Jul 19
GATK workshop materials available online! Learn it in your own time with @ProjectJupyter notebooks. ^MT https://t.co/IKDa6SGwaU
8 Jul 19

See more of our favorite tweets...