This document describes the reference confidence model applied by HaplotypeCaller to generate a per-sample GVCF, invoked by
-ERC GVCF or
As explained here, HaplotypeCaller works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. At that point, we can calculate the likelihoods of each possible genotype and emit variant calls.
What that article does not explain is how HaplotypeCaller additionally estimates the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:
Based on this, we emit the genotype likelihoods (
PL) and compute the
GQ (from the
PLs) for the least confidence of these two models. We use a symbolic ALT allele,
<NON_REF>, to hold the likelihood that the site is not homozygous reference, as well as allele-specific
PL field values.
We do this at all sites in the territory covered by the analysis, including homozygous-reference sites, both inside and outside the ActiveRegions determined by HaplotypeCaller.