(How to) Call somatic mutations using GATK4 Mutect2
Tutorials | Created 2019-05-24 | Last updated 2019-07-18

Comments (41)

This tutorial is applicable to Mutect2 version and higher. Post suggestions/questions in the Ask the GATK team section.

GATK contains several major advances in the Mutect2 pipeline, which is good, but we have had to change command lines in a few places, which we usually try to avoid. Here are the major changes.

Bug Fixes

We fixed several bugs with error messages about invalid log probabilities, infinities, NaNs etc that were introduced by neglecting to account for finite precision errors. We also resolved an issue where CalculateContamination worked poorly on very small gene panels.

New Required Inputs to FilterMutectCalls

1) FilterMutectCalls now requires a reference, which should be the same fasta file input to Mutect2. 2) FilterMutectCalls also requires a stats file from Mutect2. When you run Mutect2 with output -O unfiltered.vcf, for example, it produces a file called unfiltered.vcf.stats. FilterMutectCalls automatically looks for this file, so if you are running locally no change is needed. That is,

gatk Mutect2 -R ref.fasta -I tumor.bam -O unfiltered.vcf

followed by

gatk FilterMutectCalls -R ref.fasta -V unfiltered.vcf -O filtered.vcf

works because Mutect2 creates unfiltered.vcf.stats behind the scenes and FilterMutectCalls knows to look for it. However, if you are running on a cluster or the cloud you need to keep track of the stats file. For example, you need to delocalize it from a VM, as is done in the Mutect2 WDL. You can explicitly input the stats file with the -stats argument in FilterMutectCalls. If you are scattering Mutect2 over multiple nodes you must merge the stats files with MergeMutectStats:

gatk MergeMutectStats -stats unfiltered1.vcf.stats -stats unfiltered2.vcf.stats -O merged.stats

and pass merged.stats to FilterMutectCalls.

New Filtering Strategy in FilterMutectCalls

FilterMutectCalls now filters based on a single quantity, the probability that a variant is a somatic mutation. Previously, different reasons why this might not be the case each had their own thresholds. We have removed parameters such as -normal-artifact-lod, -max-germline-posterior, -max-strand-artifact-probability, and -max-contamination-probability. Even the previously fundamental -tumor-lod is gone. Rather than replace these with a single threshold for the error probability, FilterMutectCalls replaces them with nothing at all. It automatically determines the threshold that optimizes the "F score", the harmonic mean of sensitivity and precision. In order to tweak results in favor of more sensitivity users may set -f-score-beta to a value greater than its default of 1 (beta is the relative weight of sensitivity versus precision in the harmonic mean). Setting it lower biases results toward greater precision.

You can think of these changes as doing two things. Unifying filtering thresholds puts all the filters at the same place on a ROC curve of sensitivity vs precision. Previously, one threshold might be sacrificing a lot of sensitivity for a small gain in precision while another might be doing the opposite, the result being poor sensitivity and precision that fell below the potential ROC curve. Once we’re on that ROC curve, we achieve a good balance between sensitivity and precision by optimizing the F score.

New Somatic Clustering Model in FilterMutectCalls

We had long suspected that modeling the spectrum of subclonal allele fractions would help distinguish somatic variants from errors. For example, if every somatic variant in a tumor were a het occurring in 40% of cells, we would know to reject anything with an allele fraction significantly different from 20%. In the Bayesian framework of Mutect2 this means that the likelihood for somatic variants is given by a binomial distribution. We account for an unknown number of subclones with a Dirichlet process binomial mixture model. This is still an oversimplification because CNVs, small subclones, and genetic drift of passenger mutations all contribute allele fractions that don’t match a few discrete values. Therefore, we include a couple of beta-binomials in the mixture to account for a background spread of allele fractions while still benefiting from clustering. Finally, we use these binomial and beta-binomial likelihoods to refine the tumor log odds calculated by Mutect2, which assume a uniform distribution of allele fractions.

In addition to clustering allele fractions we also learn the overall prior probabilities of somatic SNVs and indels so that we can be more skeptical of apparent variants in a quiet tumor, for example. We learn the parameters of this model with a stochastic EM approach, where the E steps consist of Chinese Restaurant Process sampling of the allele fraction clusters. In case you were wondering, we have tested this new approach on some of the off-label, non-cancer uses of Mutect2, such as mitochondria, and it works very well. FilterMutectCalls reports the learned parameters of somatic clustering in a new .filtering_stats.tsv output. This file also contains information on the probability threshold chosen to optimize the F score and the number of false positives and false negatives from each filter to be expected from this choice.

A step-by-step guide to the new Mutect2 Read Orientation Artifacts Workflow

If you suspect any of your samples of substitution errors that occur on a single strand before sequencing you should definitely use Mutect2's orientation bias filter. This applies to all FFPE tumor samples and samples sequenced on Illumina Novaseq machines, among others. In fact, with the optimizations in you can run the filter even when you're not suspicious. It won't hurt accuracy and the CPU cost is now quite small.

There are three steps to the filter. First, run Mutect2 with the --f1r2-tar-gz argument. This creates an output with raw data used to learn the orientation bias model. Previously this was done by CollectF1R2Counts. By absorbing it into Mutect2, we eliminated the cost of CollectF1R2Counts with almost no effect on the runtime of Mutect2. When multiple tumor samples are specified, you only need a single --f1r2-tar-gz output, which contains data for each tumor sample.

gatk Mutect2 -R ref.fasta \
   -L intervals.interval_list \
   -I tumor.bam \
   -germline-resource af-only-gnomad.vcf \
   -pon panel_of_normals.vcf   \
   --f1r2-tar-gz f1r2.tar.gz \
   -O unfiltered.vcf

Next, pass this raw data to LearnReadOrientationModel:

gatk LearnReadOrientationModel -I f1r2.tar.gz -O read-orientation-model.tar.gz

Run GetPileupSummaries to summarize read support for a set number of known variant sites.

gatk GetPileupSummaries \
    -I tumor.bam \
    -V chr17_small_exac_common_3_grch38.vcf.gz \
    -L chr17_small_exac_common_3_grch38.vcf.gz \
    -O getpileupsummaries.table

Estimate contamination with CalculateContamination.

gatk CalculateContamination \
    -I getpileupsummaries.table \
    -tumor-segmentation segments.table \
    -O calculatecontamination.table

Finally, pass the learned read orientation model to FilterMutectCallswith the -ob-priors argument:

gatk FilterMutectCalls -V unfiltered.vcf \
   [--tumor-segmentation segments.table] \
   [--contamination-table contamination.table] \
   --ob-priors read-orientation-model.tar.gz \
   -O filtered.vcf

Advanced note: if you are scattering Mutect2 over nodes in a cluster or on the cloud, you must input the --f1r2-tar-gz output from each Mutect2 scatter to LearnReadOrientationModel. This is done automatically in the Mutect2 wdl in the gatk repo and on Terra. For example:

for chromosome in {1..22}; do
gatk Mutect2 -R ref.fasta -I tumor.bam -L $chromosome --f1r2-tar-gz ${chromosome}-f1r2.tar.gz -O ${chromosome}-unfiltered.vcf``
all_f1r2_input=`for chromosome in {1..22}; do printf -- "-I ${chromosome}-f1r2.tar.gz "; done`
LearnReadOrientationModel $all_f1_r2_input -O read-orientation-model.tar.gz

A step-by-step guide to the new Mutect2 Panel of Normals Workflow

We rewrote CreateSomaticPanelOfNormals to use GenomicsDB for scalability. We also added the INFO fields FRACTION (the fraction of normal samples with an artifact) and BETA (the shape parameters of the beta distribution of artifact allele fractions among samples with the artifact) to the panel of normals. FilterMutectCalls doesn’t use these yet, but we hope to experiment with them in the near future. Furthermore, we never liked how germline variants, which are handled in a more principled way with our germline filter, ended up as hard-filtered pon sites, so the panel of normals workflow now optionally excludes germline events from its output, keeping only technical artifacts.

The three steps to create a panel of normals are:

Step 1: Run Mutect2 in tumor-only mode for each normal sample:

gatk Mutect2 -R reference.fasta -I normal1.bam   --max-mnp-distance 0 -O normal1.vcf.gz
gatk Mutect2 -R reference.fasta -I normal2.bam   --max-mnp-distance 0 -O normal2.vcf.gz

Step 2: Create a GenomicsDB from the normal Mutect2 calls:

gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list \
  --genomicsdb-workspace-path pon_db \
  -V normal1.vcf.gz \
  -V normal2.vcf.gz \
  -V normal3.vcf.gz

Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:

gatk CreateSomaticPanelOfNormals -R reference.fasta \
  --germline-resource af-only-gnomad.vcf.gz \
  -V gendb://pon_db \
  -O pon.vcf.gz

In the third step, the AF-only gnomAD resource is the same public GATK resource used by Mutect2 (when calling tumor samples, but not in Step 1, above).

Return to top Comment on this article