Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to reads to be considered for PL and AD/DP annotations.
open | Created 2018-11-19 | Last updated 2018-12-14| Posted by vruano | See in Github


HaplotypeCaller bug


Instructions

Initially reported by a user on the forum... Aparently some variants with non-zero quals have 0 AD and DPs. Other annotations are also missing from the INFO columns.

After some debugging it turns out that the criteria to determine whether a read should be considered for a variant in terms of alignment overlap are different for taking part of PL calculation and AD/DP calculation.

Where is not totally clear what is the best way to go in practice. It seems to me that we should be consistent here and both PL and AD/DP should use the same criterion. The offending code lines:

HaplotypeCallerGenotypingEngine.java ln171:

ReadLikelihoods<Allele> readAlleleLikelihoods = readLikelihoods.marginalize(alleleMapper, 
         new SimpleInterval(mergedVC).expandWithinContig(ALLELE_EXTENSION, header.getSequenceDictionary()));
            if (configuration.isSampleContaminationPresent()) {
                readAlleleLikelihoods.contaminationDownsampling(configuration.getSampleContamination());
            }

The code above decides the involvement in PL calculations. Notice that ALLELE_EXTENSION is set to 2.

For the AD/DP and so on the code responsible is in AssemblyBasedCallerGenotypingEngine.java ln366:

    // Otherwise (else part) we need to do it again.
        if (configuration.useFilteredReadMapForAnnotations || !configuration.isSampleContaminationPresent()) {
            readAlleleLikelihoodsForAnnotations = readAlleleLikelihoodsForGenotyping;
            readAlleleLikelihoodsForAnnotations.filterToOnlyOverlappingReads(loc);
        } else {
            readAlleleLikelihoodsForAnnotations = readHaplotypeLikelihoods.marginalize(alleleMapper, loc);
            if (emitReferenceConfidence) {
                readAlleleLikelihoodsForAnnotations.addNonReferenceAllele(Allele.NON_REF_ALLELE);
            }
        }

The filterToOnlyOverlappingReads(loc) is called then the overlap criterion is strict. (e.g. 0bp padding). This is also the case for the marginalize call if the conditional is false as the loc passed has not been padded.

It seems to me that setting the ALLELE_EXTENSION == 2 is a very deliberative action (so it was done for a reason) and perhaps this is the way to go... but in deed if the read really does not overlap the variant should be considered at all.

This come from a more complex discussion whether the in cases whether variants are totally linked in the assembly graph we should consider reads supporting another variant alleles as supporting this other variant linked allele or not. I think that user found it a bit strange that this would be the case and perhaps this is the reason why we are doing this read filtering in the first place.


Return to top