ReferenceConfidenceModel using the wrong break condition in calcNIndelInformativeReads()
open | Created 2019-02-05 | Last updated 2019-02-05| Posted by jamesemery | See in Github

HaplotypeCaller bug

While working on validating #5607 I noticed that at the top of the method isReadInformativeOfIndelOfSize() that there is the following breakout condition:

 if( read.getLength() - readStart < maxIndelSize || refBases.length - refStart < maxIndelSize ) {
    return false;

This says that if the readStart is too close to the read.getLenght() then it will break out and not calculate the informativeness of a read. Unfortunately readStart isn't the readbase indexed readStart, its actually the "IGV view" offset for the read generated by the pileup for a particular reference position. The actual length that matters to us is: AlignmentUtils.getBasesAlignedOneToOne(read).length which is computed later when we realign the read bases to the reference. What this means is that if a read happens to have a long deletion in it then we will end up prematurely marking the read bases as being non-informative despite there being more than enough bases to work with when doing computations. Furthermore, since we realign the read bases later in the codepath, these bases in the gap between the realigned length and read.getLength() are still used to compute mismatch likelihood for bases before that point in the read.

An example of this issue: I have a read with the cigar "77M10D24M", at position 92 of the read (the igv offset so in reality the 5th base into the last element of the cigar) the code returns false due to this condition. In reality AlignmentUtils.getBasesAlignedOneToOne(read).length - readStart value is 19, and thus comparable since there are >10 bases left in the read to test.

I have duplicated this behavior in #5607, perhaps it would be easiest to get that branch in first before tackling this issue just so validation for that refactor is easier.

Return to top