BETAstate and will not be developed further. A newer and much-improved version of this tool is included in GATK4; we recommend you use that tool rather than this one. Accordingly, this document is provided as is and with no guarantees that everything will work correctly.
In this hands-on tutorial, we will call somatic mutations, both single nucleotide and indels, using GATK v3.7 MuTect2. The calling is done for a tumor sample against a matched normal sample, both of which are aligned to GRCh38, and uses a panel of normals to filter additional artifactual calls. We then subset the calls to indels to focus our manual review on a new feature of MuTect2--MuTect2 calls indels whereas MuTect1 does not.
MuTect1 is a somatic pileup caller that requires joint indel realignment of the tumor and normal sample. Because MuTect2 uses reassembly and pairHMM in calling, it is possible to skip indel realignment during preprocessing.
The tutorial gives a broad overview of the steps and touches lightly on some considerations to provide a feeling of what somatic variant calling entails. Because MuTect2 was under active development at the time of this writing, results from this tutorial pertain only to a specific version of MuTect2. Sections 1, 4, 5 and 6 are essential to the MuTect2 workflow and includes instructions on how to create a panel of normals (PoN). Sections 2 and 3 are optional and collect metrics that measure cross-sample contamination and sequence-context related sequencing artifacts, respectively.
Originally, @shlee developed the tutorial and its data in January of 2017 for presentation at the 2017 Belgium workshop. The workshop folder is labeled 1702 in the GATK Workshops folder. The data preprocessing deviated from standard practices to match and enable the use of publically available germline data for the panel of normals.
► For differences between GATK3 MuTect2 and GATK4 Mutect2, see Blog#10911. ► For a qualitatively different and pipelined workflow using both MuTect1 and MuTect2, see the FireCloud Mutation Calling workflow described in FireCloud Article#7512. The FireCloud tutorial uses the same sample data as this tutorial, HCC1143, but aligned to GRCh37. It takes the SNV calls from MuTect1 and the indel calls from MuTect2 for a composite somatic callset.
Download tutorial_9183.tar.gz, either from the the ftp site or from the GoogleDrive. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary, and index. These are available from the GATK Resource Bundle.
The data tarball contains 23 files plus a folder of 24 precomputed files. Within the file names, 1kg refers to the 1000 Genomes Project, pon to panel of normals, T to tumor and N to the tumor matched normal. HCC1143 refers to a Harvard Cancer Center patient.
Example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and we have reverted and aligned these to GRCh38 using alt-aware alignment and post-alt processing as described in Tutorial#8017. During preprocessing, we omitted the MergeBamAlignment step and included indel realignment, to match the toolchain of the PoN normals.
Tutorial example data are just that--examples we use to illustrate tool features. Data are inappropriate for deriving biological significance. Although we have aligned and post-alt processed reads to GRCh38, in this tutorial we focus only on results from the primary assembly.
The PoN allows additional filtering of calls, e.g. those that arise from technical artifacts. Therefore, it is important that the PoN consist of samples that are technically similar to and representative of the tumor samples, e.g. that were sequenced on the same platform using the same chemistry and analyzed using the same toolchain. For the tutorial, we have already created a PoN using forty 1000 Genomes Project exome samples aligned to GRCh38. We chose these randomly, downloaded CRAM files from the 1000 Genomes Project GRCh38 FTP site, converted to BAMs and indexed, and used the data directly with MuTect2 without further modification.
Within the tutorial bundle, the _1kg_40_m2ponsubset50k.vcf.gz and _1kg_40_m2pon_sitesonlysubset50k.vcf.gz files represent the 40-sample PoN that we will use in Sections 5.
We emulate the PoN creation steps below but with small substitute data.
--artifact_detection_modeto generate a per-normal-sample callset.
java -jar $GATK \ -T MuTect2 \ -I:tumor hcc1143_N_subset50k.bam \ --dbsnp dbSNP142_GRCh38_subset50k.vcf.gz \ --artifact_detection_mode \ -o 1_normalforpon.vcf.gz \ -L chr6:33,413,000-118,315,000 \ -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
We include the dbSNP resource so as to annotate sites already present in the database. Here we make the motions of this step using our matched normal. In the real world, you would perform this step on other germline samples that are NOT your tumor's matched normal.
java -jar $GATK \ -T CombineVariants \ --arg_file inputs.list \ -minN 2 \ --setKey "null" \ --filteredAreUncalled \ --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED \ -o 2_pon_combinevariants.vcf.gz \ -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \ --genotypemergeoption UNIQUIFY
We keep only the variant sites that are in at least two samples, using the
-minN 2 option. We omit filtered calls with the
--filteredAreUncalled option but ensure a site is kept if at least two samples pass the site (
-minN 2). Also, instead of supplying each of the forty VCFs with
-V, we pass in an argument file ending in
.list with the
--arg_file option. This file lists all the input arguments in the same line, e.g. “-V file1.vcf -V file2.vcf …”.
In the real world, you would NOT use the last option
--genotypemergeoption UNIQUIFY. The
--genotypemergeoption UNIQUIFY allows us to use the same normal VCF twice, as different samples, to illustrate this step with the data at hand. Without it, the command will error because the read group sample
RGSM tags are identical for our two VCFs. This generates a PoN with 13 records.
In the command in section 1.2, we could have used the
--minimalVCF option to generate a simplified sites-only VCF. However, in this tutorial we deviate from the standard practice because we are in learning-mode, and so we care to retain information when possible. Using the
--minimalVCF option with CombineVariants is faster but produces a qualitatively different VCF that removes TLOD scores. Here we generate a sites-only VCF that retains TLOD scores by removing the sample columns with Picard's MakeSitesOnlyVcf.
java -jar $PICARD MakeSitesOnlyVcf \ I=2_pon_combinevariants.vcf.gz \ O=3_pon_siteonly.vcf.gz
The command also generates an index--
3_pon_siteonly.vcf.gz.tbi. This produces records that have only eight columns. Compare the before and after using
gzcat 3_pon_siteonly.vcf.gz | grep -v '##' | less
ContEst detects cross-sample contamination and to some extent sample swaps. Even if our sample data are from cultured cell lines and we need not factor for tumor purity and tumor heterogeneity, we still need to consider contamination from other human sources and whether the extent of contamination is prohibitive or is in an acceptable range that allows us to proceed in analyzing a sample. We need to estimate contamination for both the tumor and the normal.
ContEst informs your downstream filtering. Consider tumor types in which you expect low purity. It would be good to know whether you have 0–2% or 2–5% contamination, as, depending on expected mutation rates, one result allows you to progress in analysis and the other requires planning manual review or even tossing the data.
For on-the-fly genotyping of the normal, ContEst will call any site with >80% bases showing an alternate allele with at least 50x coverage homozygous variant. If your normal sample has lower coverage overall, then you can alternately provide ContEst with a genotyped VCF using the
java -jar $GATK -T ContEst \ -I:eval hcc1143_T_subset50k.bam \ -I:genotype hcc1143_N_subset50k.bam \ --popfile hapmap_3.3_grch38_pop_stratified_af_subset50k.vcf.gz \ --interval_set_rule INTERSECTION \ -o 4_T_contest.txt \ -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
For the normal, use a genotyped VCF of the normal. Our genotypes are from HaplotypeCaller.
java -jar $GATK -T ContEst \ -I:eval hcc1143_N_subset50k.bam \ --genotypes hcc1143_N_haplotypecaller50k.vcf.gz \ --popfile hapmap_3.3_grch38_pop_stratified_af_subset50k.vcf.gz \ --interval_set_rule INTERSECTION \ -o 5_N_contest.txt \ -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
If there is insufficient data to make a meaningful estimate, the resulting file will contain a warning. Take a look at each result using
cat. Of the eight columns, the fourth column, labeled contamination, gives the percent contamination.
Converting 1.9% to a fraction gives us 0.019. So we know going forward that any calls with less than ~0.02 alternate allele fraction are suspect. For the tutorial, we omit the contamination estimate from the MuTect2 analysis.
Could this represent the fraction of a tumor subpopulation? If we estimate contamination for the normal sample using the tumor BAM to genotype on-the-fly, then ContEst estimates 42.2% contamination using 2,070 sites. Why does this happen?
The handling differs for the different versions of MuTect. MuTect1 incorporates the contamination estimate with the
--fraction_contamination option as a hard cutoff for calling in all samples. For the same parameter, v3.7 MuTect2 uses the fraction to downsample reads for each alternate allele. For example, if a site contains X coverage depth, then MuTect2 will remove X times the contamination fraction of reads for each alternate allele.
Picard’s CollectSequencingArtifactMetrics provides us with an estimate of the extent of FFPE and OxoG artifacts as well as other potential artifacts related to sequence context. These the tool categorizes as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). The tool provides a global view of the genome that empowers decision making in ways that site-specific analyses cannot. For example, the metrics can help you decide whether you should consider downstream filtering.
java -jar $PICARD CollectSequencingArtifactMetrics \ I=hcc1143_T_subset50k.bam \ O=6_T_artifact \ R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
java -jar $PICARD CollectSequencingArtifactMetrics \ I=hcc1143_N_subset50k.bam \ O=7_N_artifact \ R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
Of the multiple metrics files (five each), examine the _pre_adapter_summarymetrics using
cat. Here are the metrics for the full callsets that show similar results to that of our small data.
The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. 40 translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. If numbers are less than 30, then plan to filter your somatic calls further.
FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations.
For an explanation of OxoG see its Dictionary entry. Breifly, oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. For even more details on this artifact, see the 2013 Costello et al, article in Nucleic Acids Research. For SNVs in our MuTect2 callset, the FOXOG annotation refers to fraction oxoG. Downstream tools such as Broad CGA’s D-ToxoG use this information. For more discussion, see Article#8771 and Article#8183.
-minPruning option. The default is set to two. For our exome data, where coverage occurs in mountainous peaks and falls off steeply around the edges, what this means is that at least two reads must support a path in the assembly graph for consideration, whether in the normal or tumor. Thus, the depth of coverage at a site has implications for calling contaminant alleles as well as low fraction alleles.
For the large dataset, we call on the entirety of the primary assembly regions, without an intervals file. For our tutorial, because calling is compute-intensive, we make somatic calls for a small region only. We provide all three resource files--dbSNP, COSMIC and the PoN. The PoN is made from forty 1000 Genomes Project exomes.
java -jar $GATK -T MuTect2 \ -I:tumor hcc1143_T_subset50k.bam \ -I:normal hcc1143_N_subset50k.bam \ --dbsnp dbSNP142_GRCh38_subset50k.vcf.gz \ --cosmic CosmicCodingMuts_grch38_subset50k.vcf.gz \ --normal_panel 1kg_40_m2pon_sitesonly_subset50k.vcf.gz \ --output_mode EMIT_VARIANTS_ONLY \ -o 8_mutect2.vcf.gz \ -L chr6:33,413,000-118,315,000 \ -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
This produces a VCF with four records. Let’s examine the callset with
gzcat 8_mutect2.vcf.gz | grep -v ‘##’
We see three passing sites (PASS) and one filtered site (alt_allele_in_normal). Three of the sites are deletions and one is a SNV. Each of the tumor and normal samples have genotype calls in the format:
What do each of these metrics represent?
The callset contains both passing and filtered SNVs and indels. Let us subset out just indel calls for further scrutiny, using the full callset (n=4152).
java -jar $GATK -T SelectVariants \ -V hcc1143_cloud.vcf.gz \ -o 9_mutect2_indels.vcf.gz \ -selectType INDEL \ -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
This gives us 581 variant records. We count the passing calls with:
gzcat 9_mutect2_indels.vcf.gz | grep -v '#' | grep 'PASS' | wc -l
How many somatic indel calls do we have? How many bases is the largest indel?
The table lists the nine filters MuTect2 emits to the FILTER column. The table also counts the instances each filter occurs for indel calls, as well as the instances a filter is the sole reason for an indel site not passing.
Which filter(s) appear to have the greatest impact? Is this what you expect? What type of filtering do you trust the most? What types of calls do you think compels manual review?
Examine the passing records and pay attention to the AD and AF fields.
gzcat 9_mutect2_indels.vcf.gz | grep -v '#' | grep 'PASS' | less
What can you say about the sensitivity of the caller? For ~200Mb genome-wide coverage, how many germline SNPs and indels do you expect? Somatic SNVs and indels?
At first glance, we see additional calls in the GRCh38 callset compared to an older GRCh37 callset. However, the GRCh37 analysis restricted MuTect2 calling to target capture intervals whereas our GRCh38 calling does not. If we confine the counts of calls to the capture regions, we see MuTect2 gives similar number of passing calls for both references (Table).
The above comparison is limited. How would you compare calling on GRCh37 versus on GRCh38? Given what you know about alt-aware alignment and post-alt processing, is there an additional step that this analysis is missing? Below are the six GRCh37 passing indel calls. Do they differ from the six GRCh38 passing indel calls? If so, how?
Examining the records, we find the GRCh38 calls differ qualitatively from that of GRCh37. This simplistic comparison is meant to show how deriving a good somatic callset involves comparing callsets, e.g. from different callers, manually reviewing passing and filtered calls and, as alluded to, additional filtering.
Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.
To review read alignments, we must generate those that reflect MuTect2’s graph-assembly results. We call these bamouts for the parameter we invoke. To demonstrate, we use a small interval.
java -jar $GATK -T MuTect2 \ -I:tumor hcc1143_T_subset50k.bam \ -I:normal hcc1143_N_subset50k.bam \ --dbsnp dbSNP142_GRCh38_subset50k.vcf.gz \ --cosmic CosmicCodingMuts_grch38_subset50k.vcf.gz \ --normal_panel 1kg_40_m2pon_sitesonly_subset50k.vcf.gz \ --disableOptimizations --dontTrimActiveRegions --forceActive \ -L 8_mutect2.vcf.gz \ -ip 1000 \ -bamout 10_m2_bamout.bam \ -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
Notice the special options
--forceActive (many thanks to @Sheila for these). To easily generate bamouts for regions surrounding call sites, we use the four-record VCF from step 4 with the
-L option and add padding around the sites, e.g. 1000 bases before and after each site, with the
Similarly, we have pre-generated a larger bamout for you using a VCF of the 76 solo-filtered indel sites and 22 passing indel sites.
In the remaining exercise, we will use data subset around the 76-solo-filtered and 22 passing indel sites. We will review data in IGV. First, load GRCh38 as the reference in IGV. Then load these files in order:
Go to the SLC35F1 locus at chr6:118314029 by typing in the genomic coordinates into IGV. If these alignments seem hard to decipher, it is because we need to tweak some settings.
a. In the View>Preferences>Alignments panel, turn off downsampling and turn on the center line.
b. Center the track on the deletion you see by dragging the track so that the center line is at the leftmost edge of the deletion. You may have to scroll to find the deletion.
c. In the tracks-view, right-click on the bamout track and select the following:
- Group by sample
- Color by tag: HC
- Sort by base
What are the three grouped tracks for the bamout? How do you feel about this indel call?
Here are ten variant records to start.
For any of these sites, would you reverse a PASS or filter call? Why?
You can scroll through variant calls in IGV using a keyboard shortcut. Select the _98_m2indels.vcf.gz track and press
F to jump forward to the next variant record or
B to jump back to the previous variant record.
As you experienced, understanding sequencing technology and its artifact modes, as well as tools and their quirks, are crucial to discerning good calls from bad calls. This is ever more so important in somatic calling due to confounding factors of (i) tumor purity, (ii) tumor heterogeneity and (iii) the needle-in-a-haystack rate of somatic mutations compared to the inherent noisiness of sequencing artifacts. We have yet to develop machine-learning algorithms that can help us in this process.
What is the next step after you generate a carefully manicured somatic callset?
Take a look at the dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665. It shows somatic copy number alterations for our HCC1143 sample and its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. We cover calling somatic copy number alterations in Tutorial#9143.
Many thanks to @Geraldine_VdAuwera for the workflow diagram at top.