HaplotypeCaller --output-mode EMIT_ALL_SITES does not emit all sites
open | Created 2019-07-26 | Last updated 2019-09-11| Posted by tfenne | See in Github


HaplotypeCaller bug


Bug Report

Affected tool(s) or class(es)

HaplotypeCaller --output-mode EMIT_ALL_SITES

Affected version(s)

  • Latest public release version [version?]
  • Latest master branch as of [date of test?]

Description

I'm trying to generate a VCF (not a gVCF) that contains calls spanning all the sites in my regions. Each region is small, and is more or less equivalent to a single variant. Ideally I'd use GENOTYPE_GIVEN_ALLELES, but I don't know the alleles, and in some cases the variant location is approximate (e.g. somewhere in this 10bp window).

I've been trying to use HaplotypeCaller to produce a VCF that contains calls covering my entire set of regions, but nothing seems to work. I started with just --output-mode and eventually ended up with:

gatk HaplotypeCaller \
  -R ref.fasta \
  -L regions.interval_list \
  --disable-optimizations \
  --force-active \
  --output-mode EMIT_ALL_SITES \
  -I my.bam \
  -O my.vcf.gz

This does output considerably more records, including a lot of hom-ref records, but still nowhere near to the full set of bases within my regions. E.g. in one test this emits variants spanning 3,468bp which is way better than the ~120bp I get without those options, but nowhere near the 293,570bp with the regions I'm supplying.

It would be great if --output-mode EMIT_ALL_SITES did as the documentation described, but if that's not possible, then perhaps that mode should simply be removed?

Steps to reproduce

Try calling a BAM file with HaplotypeCaller with a 100-1000bp region with --output-mode EMIT_ALL_SITES.

Expected behavior

VCF should contain records spanning the entire input region.

Actual behavior

VCF contains a minority of sites from the region.


Return to top