Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.
|Somatic short variants tumor-normal pair||T-N BAMs to VCF||universal||yes||b37|
|Somatic short variants PON creation||Normal BAMs to PON||universal||yes||b37|
This workflow requires BAM files for each input tumor and normal sample. Input BAMs should be pre-processed as described in the GATK Best Practices for data pre-processing.
There are two main steps to this workflow. First we generate a large set of candidate somatic variants, then we filter them to obtain a more confident set of somatic variant calls.
Like HaplotypeCaller, Mutect2 calls SNVs and indels simultaneously via local de-novo assembly of haplotypes in an active region. That is, when Mutect2 encounters a region showing signs of somatic variation, it discards the existing mapping information and completely reassembles the reads in that region in order to generate candidate variant haplotypes. Like HaplotypeCaller, Mutect2 then aligns each read to each haplotype via the Pair-HMM algorithm to obtain a matrix of likelihoods. Finally, it applies a Bayesian somatic likelihoods model to obtain the log odds for alleles to be somatic variants versus sequencing errors.
This step emits an estimate of the fraction of reads due to cross-sample contamination for each tumor sample and an estimate of the allelic copy number segmentation of each tumor sample. Unlike other contamination tools, CalculateContamination is designed to work well without a matched normal even in samples with significant copy number variation and makes no assumptions about the number of contaminating samples.
This tool uses an optional F1R2 counts output of Mutect2 to learn the parameters of a model for orientation bias. It finds prior probabilities of single-stranded substitution errors prior to sequencing for each trinucleotide context. This is extremely important for FFPE tumor samples.
Mutect2’s somatic likelihoods model assumes that read errors are independent, so that, for example, four reads each with an error probability of 1/1000 yield a log odds of roughly 1000^4 in favor of being a real variant versus a sequencing error. FilterMutectCalls accounts for correlated errors, that is, the possibility that all variant reads at a site were due to some common source of error. It accomplishes this through several hard filters to detect alignment artifacts and probabilistic models for strand and orientation bias artifacts, polymerase slippage artifacts, germline variants, and contamination. Additionally, it learns a Bayesian model for the overall SNV and indel mutation rate and allele fraction spectrum of the tumor to refine the log odds emitted by Mutect2. It then automatically sets a filtering threshold to optimize the F score, the harmonic mean of sensitivity and precision.
At this step we run tools to add information to the discovered variants in our dataset. One of those tools, Funcotator, can be used to add gene-level information to each variant. Funcotator is a functional annotation tool in the core GATK toolset and was designed to handle both somatic and germline use cases. Funcotator reads in a VCF file, labels each variant with one of twenty-three distinct variant classifications, produces gene information (e.g. affected gene, predicted variant amino acid sequence, etc.), and associations to information in datasources. Supported datasources include GENCODE (gene information and protein change prediction), dbSNP, gnomAD, and COSMIC (among others). The corpus of datasources is extensible and user-configurable and includes cloud-based datasources supported with Google Cloud Storage. Funcotator produces either a Variant Call Format (VCF) file (with annotations in the INFO field) or a Mutation Annotation Format (MAF) file.