HaplotypeCaller emitting wrong calls when using --max-mnp-distance
open | Created 2019-02-20 | Last updated 2019-04-04| Posted by nh13 | See in Github


HaplotypeCaller bug


Bug Report

Affected tool(s) or class(es)

  • HaplotypeCaller with the arguments --max-mnp-distance 5.

Affected version(s)

  • gatk version 4.0.12.0

Description

For a genomes in a bottle reference sample, I have observed a genotype discordance as follows:

ref: CA
alleles: CG and TG

Callset Genotype
GIAB 1/2
Test 0/2

As you can see, the test dataset improperly calls a reference allele (CA) when it should have called the first alternate instead.

The GIAB callset has two phased records:

GIAB callset
chr2    241815307    .    C    T    50    PASS    platforms=3;platformnames=Illumina,CG,10X;datasets=4;datasetnames=HiSeqPE100x,CGnormal,10XChromium,HiSeqMatePair;callsets=6;callsetnames=HiSeqPE100xSentieon,CGnormal,HiSeqPE100xfreebayes,10XSentieonhaplo,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;callable=CS_HiSeqPE100xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE100xfreebayes_callable,CS_10XSentieonhaplo_callable;filt=CS_CGnormal_filt    GT:PS:DP:ADALL:AD:GQ    0|1:241815307_C_T:286:53,44:47,41:1082
chr2    241815308    .    A    G    50    PASS    platforms=3;platformnames=Illumina,CG,10X;datasets=4;datasetnames=HiSeqPE100x,CGnormal,10XChromium,HiSeqMatePair;callsets=6;callsetnames=HiSeqPE100xSentieon,CGnormal,HiSeqPE100xfreebayes,10XSentieonhaplo,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;callable=CS_HiSeqPE100xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE100xfreebayes_callable,CS_10XSentieonhaplo_callable;filt=CS_CGnormal_filt    GT:PS:DP:ADALL:AD:GQ    1|1:241815307_C_T:287:0,98:0,89:1

while the test callset has the following single record:

Test callset
chr2    241815307   .   CA  TG  1756.77 PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.802;ClippingRankSum=0.521;DP=85;ExcessHet=3.0103;FS=2.902;MLEAC=1;MLEAF=0.5;MQ=60.0;MQRankSum=0.0;QD=20.67;ReadPosRankSum=-1.858;SOR=0.571 GT:AD:DP:F1R2:F2R1:GQ:PL    0/1:37,48:85:16,25,0:21,23,0:99:1785,0,1406

Next, I looked at the reads from the haplotype assembly BAM supporting each of the three possible alleles in the test callset:

Allele # of reads
ref (CA) 1 read
alt1 (CG) 36 reads
alt 2 (TG) 48 reads

You can see the sum of the count of reads in the table is equal to the depth and sum of allelic depths in the VCF record. It is clear from the haplotype assembly BAM that the call should be 1/2 (CG/TG), but the finall call is 0/2 (CA/TG).

I then re-ran HaplotypeCaller without the --max-mnp-distance 5 argument and got the variant call I expected, which matches the GIAB variant call:

Test callset without `--max-mnp-distance 5`
chr2    241815307       .       C       T,<NON_REF>     1717.77 .       BaseQRankSum=-1.137;ClippingRankSum=1.185;DP=83;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=298800,83;ReadPosRankSum=-1.967 GT:AD:DP:F1R2:F2R1:GQ:PGT:PID:PL:PS:SB  0|1:36,47,0:83:15,24,0:21,23,0:99:0|1:241815307_C_T:1746,0,1375,1855,1516,3371:241815307:15,21,24,23
chr2    241815308       .       A       G,<NON_REF>     3429.77 .       BaseQRankSum=0.293;ClippingRankSum=0.601;DP=83;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQRankSum=0.000;RAW_MQandDP=298800,83;ReadPosRankSum=-1.420   GT:AD:DP:F1R2:F2R1:GQ:PGT:PID:PL:PS:SB  1|1:1,82,0:83:1,38,0:0,44,0:99:0|1:241815307_C_T:3458,206,0,3461,246,3500:241815307:1,0,38,44

I have a work-around where I can post-GATK join phased variants into an MNP, but it looks like there is a bug in this version of HaplotypeCaller.


Return to top