To better understand what somatic calling entails, we contrast it to germline calling. We delve into the technical details of what is the same and what is different between Mutect2 and HaplotypeCaller, a somatic caller and a germline caller, respectively. We also provides some historical context that explains some quirks of somatic calling.
Operationally, Mutect2 works similarly to HaplotypeCaller in that they share the active region-based processing, assembly-based haplotype reconstruction and pairHMM alignment of reads to haplotypes. However, they use fundamentally different models for estimating variant likelihoods and genotypes. The HaplotypeCaller model uses ploidy in its genotype likelihood calculations. The Mutect2 model does not.
The main difference is that HaplotypeCaller is designed to call germline variants, while Mutect2 is designed to call somatic variants. Neither is appropriate for the other use case.
Germline variants are straightforward. They vary against the reference. Germline calling typically assumes a fixed ploidy and calling includes genotyping sites. HaplotypeCaller allows setting a different ploidy than diploid with the
-ploidy argument. HaplotypeCaller can call germline variants on one or multiple samples and the tool can use evidence of variation across the samples to increase confidence in a variant call. For this discussion, it is noteworthy HaplotypeCaller does not necessarily rely on a balance in the alleles in genotyping, e.g. it can call what may be considered a low allele fraction alternate allele as part of a heterozygous genotype. Furthermore, if the number of alleles at a site surpasses the ploidy assumption, then HaplotypeCaller's reference confidence mode (
-ERC GVCF) may detect and call these alleles and their respective
AD allele depths, even if the
GT genotype call uses only a subset of the alleles to fit the ploidy assumption.
Somatic variants contrast between two samples against the reference. What do we mean by somatic? The Greek word soma refers to parts of an organism other than the reproductive cells. For example, our skin cells are soma-tic and accumulate mutations from sun exposure that presumably our seed or germ cells are protected from. In this example, variants in skin cells that are not variant in the blood cells are somatic.
Mutect2 works primarily by contrasting the presence or absence of evidence for variation between two samples, the tumor and matched normal, from the same individual. The tool can run on unmatched tumors but this produces high rates of false positives. Technically speaking, somatic variants are both (i) different from the control sample and (ii) different from the reference. What this means is that if a site is variant in the control but in the somatic sample reverts to the reference allele, then it is not a somatic variant.
--af-of-alleles-not-in-resource) defines in the germline variant prior, which Mutect2 uses in likelihood calculations of a variant being germline.
--log-somatic-prior) defines the somatic variant prior, which Mutect2 uses in likelihood calculations of a variant being somatic.
--normal-lod) defines the filter threshold for variants in the tumor not being in the normal, i.e. the germline risk factor.
–-tumor-lod-to-emit) defines the cutoff for a tumor variant to appear in a callset.
Somatic calling is NOT a simple subtraction of control variant alleles from case sample variant alleles. The reason for this stems from the original intent for somatic callsets in cancer research.
Somatic callers reflect these two preferences in their stringent filtering, either upfront such that a variant call is not emitted or downstream such that a site is annotated in the FILTER column with the filter name.
A somatic caller should detect low fraction alleles, can make no explicit ploidy assumption and omits genotyping in the traditional sense. Mutect2 adheres to all of these criteria. A number of cancer sample characteristics necessitate such caller features. For one, biopsied tumor samples are commonly contaminated with normal cells, and the normal fraction can be much higher than the tumor fraction of a sample. Second, a tumor can be heterogeneous in its mutations. Third, these mutations not uncommonly include aneuploid events that change the copy number of a cell's genome in patchwork fashion.
A variant allele in the case sample is not called if the site is variant in controls. We explain an exception for GATK4 Mutect2 in a bit.
Historically, somatic callers have called somatic variants at the site-level. That is, if a variant site in the case is also variant in the matched control or in a population resource, e.g. dbSNP, even if the variant allele is different than the control or resource it is discounted from the somatic callset. This practice stems in part from cancer study designs where the control normal sample is sequenced at much lower depth than the case tumor sample. Because of the assumption mutations strike randomly, cancer geneticists view mutations at sites of common germline variation with skepticism. Remember for humans, common germline variant sites occur roughly on average one in a thousand reference bases. So if a commonly variant site accrues additional mutations, we must weigh the chance of it having arisen from a true somatic event or it being something else that will likely not add value to downstream analyses. For most sites and typical analyses, the latter is the case. The variant is unlikely to have arisen from a somatic event and more likely to be some artifact or germline variant, e.g. from mapping or cross-sample contamination.
GATK4 Mutect2 still applies this practice in part. The tool discounts variant sites shared with the panel of normals or with a matched normal control's unambiguously variant site. If the matched normal's variant allele is supported by few reads, at low allele fraction, then the tool accounts for the possibility of the site not being a germline variant.
When it comes to the population germline resource, GATK4 Mutect2 distinguishes between the variant alleles in the germline resource and the case sample. That is, Mutect2 will call a variant site somatic if the allele differs from that in the germline resource. Blog#10911 explains this in a bit more detail and explains how Mutect2 factors germline variant allele frequencies in calling.
Somatic workflows filter case sites with multiple variant alleles. By a similar logic to that outlined above, and with the assumption that common variant sites are biallelic, any site that presents multiple variant alleles in the case sample is suspect. Mutect2 still calls such sites and the contrasting variant alleles; however, in the next step of the workflow, FilterMutectCalls filters such sites with the multiallelic filter. It is possible a multiallelic site in the case sample represents a somatic event, but it is more likely the site is a germline variant site or an artifactual site.
The panel of normals helps filter systematic artifacts of sequencing. Artifacts are seeming variants in the read data that are in fact false positives. Sequencing technology's artifacts are not all random. Some artifacts come from sample preparation and present in specific sequence contexts. Other artifacts come from mapping. These artifacts often appear like low allele fraction somatic mutations. When somatic callsets are gathered in a cohort, these artifacts can present a strong signal, as they occur systematically in some fraction of samples. To remove such false signals, Mutect2 filters sites present in a given panel of normals (specified with
-pon). Typically, a PoN is constructed with germline normal samples. First, calls are made using the same sensitivity as that used in somatic calling, i.e. with Mutect2. Second, the multiple normal samples are gathered into a cohort. Finally, the panel retains sites called in two or more samples. GATK4's CreateSomaticPanelOfNormals performs these latter two steps. Use of a PoN constructed from germline normals has the added benefit of filtering common germline variant sites. This is especially useful for somatic analysis of species that lack a common germline variant resource.