By Laura Gauthier, lead GATK developer for germline short variant discovery

Q: What, there's a HaplotypeCaller paper?

A: Yes! We are super excited to announce the long-awaited release of The HaplotypeCaller Paper -- or rather, the preprint in bioRxiv. (Actually we announced it on Twitter a while back but we understand not everyone enjoys such an old-school way of keeping up with the news). Hopefully you’re as excited as we are, if not more so, but we understand that this probably raises a few questions for some of you, so we tried to address some of those below.

Q: Why did it take so long?!

A: Our mission is to develop the tools that get used by others to do groundbreaking scientific research. Benchmarking and validation are important parts of our prototyping and development cycle, but given that we’re not subject to the “publish or perish” culture of a research lab, submitting manuscripts presenting those results wasn’t a high priority for us.

Q: Are you going to submit it to a peer-reviewed journal?

A: Probably not.

Q: Why not?

A: Our main motivation for posting the HaplotypeCaller manuscript to bioRxiv was to provide something recent/reasonable to cite and to make more details of the methods public. Submitting to a peer-reviewed journal usually involves a lot of time working on revisions that we’d rather put towards working on further improvements to the tools.

Q: Is it still a preprint if it's never intended to go to print?

A: You tell us.

Q: What version of HaplotypeCaller does the paper describe?

A: The paper describes the GATK 3.4 version of the HaplotypeCaller (yes we started this a while back) but the HaplotypeCaller has not changed significantly in later 3.x versions so it's fair to say the paper covers up to version 3.8 completely.

Q: How do these results compare to GATK4?

A: At time of writing, the GATK4 version of HaplotypeCaller is still considered a beta version. The team is actively working on validating the GATK4 version to make sure that it’s guaranteed to be as good as or better than the GATK3 version described in the paper.

Q: How does the methodology compare to GATK4?

A: The GATK engine that parses the BAM and “shards” the data to pass to the tools has been rewritten for improved efficiency over GATK3, and the HaplotypeCaller code has been refactored for better organization and readability. So there's a lot that is different in terms of software implementation. However the algorithms and equations presented in the manuscript remain the same, so overall the paper's description of how the HaplotypeCaller operates also applies to the GATK4 beta version, and it is appropriate to use it as a citation for results derived from versions up to the current beta (4.beta.6).

Q: Does the release of this paper hint at a change in how the team prioritizes publication?

A: To some extent. The developers of the somatic variant caller Mutect2 and related tools have put in a lot of effort to prepare white papers on the methods involved (Mutect2 itself, the assembly process and the pairHMM algorithm), some of which are shared with the HaplotypeCaller. They hope to release a manuscript featuring Mutect2 somatic SNV and INDEL variant calling results in the near future. Additionally, the GATK development team as a whole aims to make more of our internal benchmarking and validation efforts more transparent and available to other tool developers; an effort that our colleague Yossi Farjoun kicked off in style in his blog post about the new "SynDip" benchmark last week.

Return to top

Wed 13 Dec 2017
Comment on this article

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


Our very own Andrey Smirnov is delighted to answer questions about the #GATK4 pipeline for common and rare germline…
17 Oct 18
@cpj3131 Depending on the tool you're using, some filters get applied that may modify what ends up getting counted.…
15 Oct 18
@cpj3131 It's a variant context annotation that counts the number of A, C, G, and T bases across all samples, in th…
13 Oct 18
Big day for #GATK users on AWS -- this means you can now run our #OpenWDL Best Practices workflows a lot more easil…
1 Oct 18
@PoisonEcology @macmanes Hmm yeah that sounds potentially problematic, will take a look on Monday.
29 Sep 18

- Our favorite tweets from others

#ASHG18 VA: call with GATK @gatk_dev. Look for pathogenic / likely pathogenic. leverage ClinVar.
17 Oct 18
If you think your fascination with #GATK hit the roof wait until you meet @gatk_dev team! Has been a wonderful week…
21 Sep 18
@xdopazo @gatk_dev @ClinicalBioinfo @FProgresoysalud @INB_Official @CIBERER @jpflorido Thank you very much for such amazing time!
21 Sep 18
Workshop "From reads to disease variants". Big thanks to @gatk_dev staff for sharing #GATK4 variant calling apps' e…
21 Sep 18

See more of our favorite tweets...