HaplotypeCaller --output-mode EMIT_ALL_SITES
I'm trying to generate a VCF (not a gVCF) that contains calls spanning all the sites in my regions. Each region is small, and is more or less equivalent to a single variant. Ideally I'd use
GENOTYPE_GIVEN_ALLELES, but I don't know the alleles, and in some cases the variant location is approximate (e.g. somewhere in this 10bp window).
I've been trying to use HaplotypeCaller to produce a VCF that contains calls covering my entire set of regions, but nothing seems to work. I started with just
--output-mode and eventually ended up with:
gatk HaplotypeCaller \ -R ref.fasta \ -L regions.interval_list \ --disable-optimizations \ --force-active \ --output-mode EMIT_ALL_SITES \ -I my.bam \ -O my.vcf.gz
This does output considerably more records, including a lot of hom-ref records, but still nowhere near to the full set of bases within my regions. E.g. in one test this emits variants spanning 3,468bp which is way better than the ~120bp I get without those options, but nowhere near the 293,570bp with the regions I'm supplying.
It would be great if
--output-mode EMIT_ALL_SITES did as the documentation described, but if that's not possible, then perhaps that mode should simply be removed?
Try calling a BAM file with HaplotypeCaller with a 100-1000bp region with
VCF should contain records spanning the entire input region.
VCF contains a minority of sites from the region.