(How to) Call somatic copy number variants using GATK4 CNVGATK 4 Beta | Created 2017-03-08 | Last updated 2018-05-16

A more recent CNV tutorial using v4.0.1.1 has been posted in two parts elsewhere at:

The first part mostly recapitulates the workflow on this page, while the second part adds detection of allelic ratios. Although the v4.0.1.1 tutorial is under review as of May 2, 2018, we recommend you update to the official workflow, especially if performing CNV analyses on WGS data. The official workflow has algorithmic improvements to the GATK4.beta workflow illustrated here.

This demonstrative tutorial provides instructions and example data to detect somatic copy number variation (CNV) using a panel of normals (PoN). The workflow is optimized for Illumina short-read whole exome sequencing (WES) data. It is not suitable for whole genome sequencing (WGS) data nor for germline calling.

The tutorial recapitulates the GATK demonstration given at the 2016 ASHG meeting in Vancouver, Canada, for a beta version of the CNV workflow. Because we are still actively developing the CNV tools (writing as of March 2017), the underlying algorithms and current workflow options, e.g. syntax, may change. However, the presented basic approach and general concepts will still be germaine. Please check the forum for updates.

Many thanks to Samuel Lee (@slee) for developing the example data, data figures and discussion that set the backbone of this tutorial.

► For a similar example workflow that pertains to earlier releases of GATK4, see Article#6791. ► For the mathematics behind the workflow, see this whitepaper.

Different data types come with their own caveats. WGS, while providing even coverage that enables better CNV detection, is costly. SNP arrays, while the standard for CNV detection, may not be part of an analysis protocol. Being able to resolve CNVs from WES, which additionally introduces artifacts and variance in the target capture step, requires sophisticated denoising.

• This tutorial uses a beta version of the CNV workflow tools within the GATK4 gatk-protected-1.0.0.0-alpha1.2.3 pre-release (Version:0288cff-SNAPSHOT from September 2016). We previously made the program jar specially available alongside the data bundle in the workshops directory here. The original worksheets are in the 1610 folder. However, the data bundle was only available to workshop attendees. Note other tools in this program release may be unsuitable for analyses.

The example data is whole exome capture sequence data for chromosomes 1–7 of matched normal and tumor samples aligned to GRCh37. Because the data is from real cancer patients, we have anonymized them at multiple levels. The anonymization process preserves the noise inherent in real samples. The data is representative of Illumina sequencing technology from 2011.

• R (install from https://www.r-project.org/) and certain R components. After installing R, install the components with the following command.

Rscript install_R_packages.R 

We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.

• XQuartz for optional section 5. Your system may already have this installed.
• The tutorial does not require reference files. The optional plotting step that uses the PlotSegmentedCopyRatio tool plots against GRCh37 and should NOT be used for other reference assemblies.

1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage

In this step, we collect proportional coverage using target intervals and read data. We have actually pre-computed this for you and we provide the command here for reference.

We process each BAM, whether normal or tumor. The tool collects coverage per read group at each target and divides these counts by the total number of reads per sample.

java -jar gatk4.jar CalculateTargetCoverage \
-I <input_bam_file> \
-T <input_target_tsv> \
-transform PCOV \
-groupBy SAMPLE \
-targetInfo FULL \
–keepdups \
-O <output_pcov_file>
• The target file -T is a padded intervals list of the baited regions. You can add padding to a target list using the GATK4 PadTargets tool. For our example data, padding each exome target 250bp on either side increases sensitivity.
• Setting the -targetInfo option to FULL keeps the original target names from the target list.
• The –keepdups option asks the tool to include alignments flagged as duplicate.

The top plot shows the raw proportional coverage for our tumor sample for chromosomes 1–7. Each dot represents a target. The y-axis plots proportional coverage and the x-axis targets. The middle plot shows the data after a median-normalization and log2-transformation. The bottom plot shows the tumor data after normalization against its matched-normal.

For each of these progressions, how certain are you that there are copy-number events? How many copy-number variants are you certain of? What is contributing to your uncertainty?

2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals

In this step, we use two commands to create the CNV panel of normals (PoN).

The normals should represent the same sequencing technology, e.g. sample preparation and capture target kit, as that of the tumor samples under scrutiny. The PoN is meant to encapsulate sequencing noise and may also capture common germline variants. Like any control, you should think carefully about what sample set would make an effective panel. At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample, e.g. from the same sequencing center. Our current recommendation is 40 or more normal samples. Depending on the coverage depth of samples, adjust the number.

What is better, tissue-matched normals or blood normals of tumor samples? What makes a better background control, a matched normal sample or a panel of normals?

The first step combines the proportional read counts from the multiple normal samples into a single file. The -inputList parameter takes a file listing the relative file paths, one sample per line, of the proportional coverage data of the normals.

java -jar gatk4.jar CombineReadCounts \
-inputList normals.txt \
-O sandbox/combined-normals.tsv

The second step creates a single CNV PoN file. The PoN stores information such as the median proportional coverage per target across the panel and projections of systematic noise calculated with PCA (principal component analysis). Our tutorial’s PoN is built with 39 normal blood samples from cancer patients from the same cohort (not suffering from blood cancers).

java -jar gatk4.jar CreatePanelOfNormals \
-I sandbox/combined-normals.tsv \
-O sandbox/normals.pon \
-noQC \
--disableSpark \
--minimumTargetFactorPercentileThreshold 5 

This results in two files, the CNV PoN and a target_weights.txt file that typical workflows can ignore. Because we have a small number of normals, we include the -noQC option and change the --minimumTargetFactorPercentileThreshold to 5%.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers?

3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts

In this step, we normalize the raw proportional coverage (PCOV) profile using the PoN. Specifically, we normalize the tumor coverage against the PoN’s target medians and against the principal components of the PoN.

java -jar gatk4.jar NormalizeSomaticReadCounts \
-I cov/tumor.tsv \
-PON sandbox/normals.pon \
-PTN sandbox/tumor.ptn.tsv \
-TN sandbox/tumor.tn.tsv

This produces the pre-tangent-normalized file -PTN and the tangent-normalized file -TN, respectively. Resulting data is log2-transformed.

Denoising with a PoN is critical for calling copy-number variants from WES coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.

4. Segment the normalized coverage profile using PerformSegmentation

Here we segment the normalized coverage profile. Segmentation groups contiguous targets with the same copy ratio.

java -jar gatk4.jar PerformSegmentation \
-TN sandbox/tumor.tn.tsv \
-O sandbox/tumor.seg \
-LOG

For our tumor sample, we reduce the ~73K individual targets to 14 segments. The -LOG parameter tells the tool that the input coverages are log2-transformed.

View the resulting file with cat sandbox/tumor.seg.

Which chromosomes have events?

☞ I get an error at this step!

This command will error if you have not installed R and certain R components. Take a few minutes to install R from https://www.r-project.org/. Then install the components with the following command.

Rscript install_R_packages.R 

We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.

5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio

This is an optional step that plots segmented coverage.

This command requires XQuartz installation. If you do not have this dependency, then view the results in the precomputed_results folder instead. Currently plotting only supports human assembly b37 autosomes. Going forward, this tool will accommodate other references and the workflow will support calling on sex chromosomes.

java -jar gatk4.jar PlotSegmentedCopyRatio \
-TN sandbox/tumor.tn.tsv \
-PTN sandbox/tumor.ptn.tsv \
-S sandbox/tumor.seg \
-O sandbox \
-pre tumor \
-LOG

The -O defines the output directory, and the -pre defines the basename of the files. Again, the -LOGparameter tells the tool that the inputs are log2- transformed. The output folder contains seven files--three PNG images and four text files.

• Before_After.png (shown above) plots copy-ratios pre (top) and post (bottom) tangent-normalization across the chromosomes. The plot automatically adjusts the y-axis to show all available data points. Dotted lines represent centromeres.
• Before_After_CR_Lim_4.png shows the same but fixes the y-axis range from 0 to 4 for comparability across samples.
• FullGenome.png colors differential copy-ratio segments in alternating blue and orange. The horizontal line plots the segment mean. Again the y-axis ranges from 0 to 4.

Open each of these images. How many copy-number variants do you see?

☞ What is the QC value?

Each of the four text files contain a single quality control (QC) value. This value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number variants and should decrease post tangent-normalization.

• preQc.txt gives the QC value before tangent-normalization.
• postQc.txt gives the post-tangent-normalization QC value.
• dQc.txt gives the difference between pre and post QC values.
• scaled_dQc.txt gives the fraction difference (preQc - postQc)/(preQc).

6. Call segmented copy number variants using CallSegments

In this final step, we call segmented copy number variants. The tool makes one of three calls for each segment--neutral (0), deletion (-) or amplification (+). These deleted or amplified segments could represent somatic events.

java -jar gatk4.jar CallSegments \
-TN sandbox/tumor.tn.tsv \
-S sandbox/tumor.seg \
-O sandbox/tumor.called

View the results with cat sandbox/tumor.called.

Besides the last column, how is this result different from that of step 4?

7. Discussion of interest to some

☞ Why can't I use just a matched normal?

Let’s compare results from the raw coverage (top), from normalizing using the matched-normal only (middle) and from normalizing using the PoN (bottom).

What is different between the plots? Look closely.

The matched-normal normalization appears to perform well. However, its noisiness brings uncertainty to any call that would be made, even if visible by eye. Furthermore, its level of noise obscures detection of the 4th variant that the PoN normalization reveals.

☞How do the results compare to SNP6 analyses?

As with any algorithmic analysis, it’s good to confirm results with orthogonal methods. If we compare calls from the original unscrambled tumor data against GISTIC SNP6 array analysis of the same sample, we similarly find three deletions and a single large amplification.