Document is currently under review and in
BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.
This workflow is broken into two tutorials. You are currently on the second part. See Tutorial#11682 for the first part.
For this second part, at the heart is segmentation, performed by ModelSegments. In segmentation, contiguous copy ratios are grouped together into segments. The tool performs segmentation for both copy ratios and for allelic copy ratios, given allelic counts. The segmentation is informed by both types of data, i.e. the tool uses allelic data to refine copy ratio segmentation and vice versa. The tutorial refers to this multi-data approach as joint segmentation. The presented commands showcase full features of tools. It is possible to perform segmentation for each data type independently, i.e. based solely on copy ratios or based solely on allelic counts.
The tutorial illustrates the workflow using a paired sample set. Specifically, detection of allelic copy ratios uses a matched control, i.e. the HCC1143 tumor sample is analyzed using a control, the HCC1143 blood normal. It is possible to run the workflow without a matched-control. See section 8.1 for considerations in interpreting allelic copy ratio results for different modes and for different purities.
The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.
CollectAllelicCounts will tabulate counts of the reference allele and counts of the dominant alternate allele for each site in a given genomic intervals list. The tutorial performs this step for both the case sample, the HCC1143 tumor, and the matched-control, the HCC1143 blood normal. This allele-specific coverage collection is just that--raw coverage collection without any statistical inferences. In the next section, ModelSegments uses the allele counts towards estimating allelic copy ratios, which in turn the tool uses to refine segmentation.
Collect allele counts for the case and the matched-control alignments independently with the same intervals. For the matched-control analysis, the allelic count sites for the case and control must match exactly. Otherwise, ModelSegments, which takes the counts in the next step, will error. Here we use an intervals list that subsets gnomAD biallelic germline SNP sites to those within the padded, preprocessed exome target intervals .
The tutorial has already collected allele counts for full length sample BAMs. To demonstrate coverage collection, the following command uses the small BAMs originally made for Tutorial#11136 . The tutorial does not use the resulting files in subsequent steps.
Collect counts at germline variant sites for the matched-control
gatk --java-options "-Xmx3g" CollectAllelicCounts \ -L chr17_theta_snps.interval_list \ -I normal.bam \ -R /gatk/ref/Homo_sapiens_assembly38.fasta \ -O sandbox/hcc1143_N_clean.allelicCounts.tsv
Collect counts at the same sites for the case sample
gatk --java-options "-Xmx3g" CollectAllelicCounts \ -L chr17_theta_snps.interval_list \ -I tumor.bam \ -R /gatk/ref/Homo_sapiens_assembly38.fasta \ -O sandbox/hcc1143_T_clean.allelicCounts.tsv
This results in counts table files. Each data file has header lines that start with an
@ asperand symbol, e.g.
@RG lines, followed by a table of data with six columns. An example snippet is shown.
Comments on select parameters
-L. The intervals can be either a Picard-style intervals list or a VCF. See Article#1109 for descriptions of formats. The sites should represent sites of common and/or sample-specific germline variant SNPs-only sites. Omit indel-type and mixed-variant-type sites.
-R, and aligned reads, specified with
--minimum-base-qualityto twenty. As a result, the tool will include reads with MAPQ20 and above in the analysis .
Another GATK tool, GetPileupSummaries, similarly counts reference and alternate alleles. The resulting summaries are meant for use with CalculateContamination in estimating cross-sample contamination. GetPileupSummaries limits counts collections to those sites with population allele frequencies set by the parameters
--maximum-population-allele-frequency. Details are here.
CollectAllelicCounts employs fewer engine-level read filters than GetPileupSummaries. Of note, both tools use the MappingQualityReadFilter. However, each sets a different threshold with the filter. GetPileupSummaries uses a
--minimum-mapping-quality threshold of 50. In contrast, CollectAllelicCounts sets the
--minimum-mapping-quality parameter to 30. In addition, CollectAllelicCounts filters on base quality. The base quality threshold is set with the
--minimum-base-quality parameter, whose default is 20.
ModelSegments groups together copy and allelic ratios that it determines are contiguous on the same segment. A Gaussian-kernel binary-segmentation algorithm differentiates ModelSegments from a GATK4.beta tool, PerformSegmentation, which GATK4 ModelSegments replaces. The older tool used a CBS (circular binary-segmentation) algorithm. ModelSegment's kernel algorithm enables efficient segmentation of dense data, e.g. that of whole genome sequences. A discussion of preliminary algorithm performance is here.
The algorithm performs segmentation for both copy ratios and for allelic copy ratios jointly when given both datatypes together. For allelic copy ratios, ModelSegments uses only those sites it determines are heterozygous, either in the control in a paired analysis or in the case in a case-only analysis . In the paired analysis, the tool models allelic copy ratios in the case using sites for which the control is heterozygous. The workflow defines allelic copy ratios in terms of alternate-allele fraction, where total allele fractions for reference allele and alternate allele add to one for each site.
For the following command, be sure to specify an existing
--output directory or
. for the current directory.
gatk --java-options "-Xmx4g" ModelSegments \ --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \ --allelic-counts hcc1143_T_clean.allelicCounts.tsv \ --normal-allelic-counts hcc1143_N_clean.allelicCounts.tsv \ --output sandbox \ --output-prefix hcc1143_T_clean
This produces nine files each with the basename
hcc1143_T_clean in the current directory and listed below. The
param files contain global parameters for copy ratios (cr) and allele fractions (af), and the
seg files contain data on the segments. For either type of data, the tool gives data before and after segmentation smoothing. The tool documentation details what each file contains. The last two files, labeled
hets, contain the allelic counts for the control's heterogygous sites. Counts are for the matched control (
normal) and the case.
The tool has numerous adjustable parameters and these are described in the ModelSegments tool documentation. The tutorial uses the default values for all of the parameters. Adjusting parameters can change the resolution and smoothness of the segmentation results.
Comments on select parameters
--denoised-copy-ratios) or allelic-counts (
--allelic-counts) data. The matched-control allelic counts (
--normal-allelic-counts) is optional. If given both types of data, then copy ratios and allelic counts data together inform segmentation for both copy ratio and allelic segments. If given only one type of data, then segmentation is based solely on the given type of data.
--minimum-total-allele-countis set to 30 by default. This means the tool only considers sites with 30 or more read depth coverage for allelic copy ratios.
--genotyping-homozygous-log-ratio-thresholdoption is set to -10.0 by default. Increase this to increase the number of sites assumed to be heterozygous for modeling.
--maximum-number-of-smoothing-iterationsoption caps smoothing iterations to 25. MCMC model sampling is also set to 100, for both copy-ratio and allele-fraction sampling by the
--number-of-samples-allele-fractionoptions, respectively. Finally,
--number-of-smoothing-iterations-per-fitis set to zero by default to disable model refitting between iterations. What this means is that the tool will generate only two MCMC fits--an initial and a final fit.
--number-of-smoothing-iterations-per-fitto one gives 311 segments after seven smoothing iterations. Section 8 plots these alternative results.
Section 8 shows the results of segmentation, the result from changing
--number-of-smoothing-iterations-per-fit and the result of allelic segmentation modeled from allelic counts data alone. Section 8.1 details considerations depending on analysis approach and purity of samples. Section 8.2 shows the results of changing the advanced smoothing parameters given in .
ModelSegments runs in the following three stages.
hets.normal.tsv. For the same sites in the case, allelic counts are written to
modelBegin.seg, posterior summaries for copy-ratio global parameters to
modelBegin.cr.paramand allele-fraction global parameters to
modelFinal.seg, posterior summaries for copy-ratio global parameters to
modelFinal.cr.param, posterior summaries for allele-fraction global parameters to
modelFinal.af.paramand final copy-ratio segments to
At the second stage, the tutorial data generates the following message.
INFO MultidimensionalKernelSegmenter - Found 638 segments in 23 chromosomes.
At the third stage, the tutorial data generates the following message.
INFO MultidimensionalModeller - Final number of segments after smoothing: 398
For tutorial data, the initial number of segments before smoothing is 638 segments over 23 contigs. After smoothing with default parameters, the number of segments is 398 segments.
CallCopyRatioSegments allows for systematic calling of copy-neutral, amplified and deleted segments. This step is not required for plotting segmentation results. Provide the tool with the
cr.seg segmentation result from ModelSegments.
gatk CallCopyRatioSegments \ --input hcc1143_T_clean.cr.seg \ --output sandbox/hcc1143_T_clean.called.seg
called.seg data adds the sixth column to the provided copy ratio segmentation table. The tool denotes amplifications with a
+ plus sign, deletions with a
- minus sign and neutral segments with a
Here is a snippet of the results.
Comments on select parameters
--neutral-segment-copy-ratio-lower-bound(default 0.9) and
--neutral-segment-copy-ratio-upper-bound(default 1.1) together set the copy ratio range for copy-neutral segments. These two parameters replace the GATK4.beta workflow’s
PlotModeledSegments visualizes copy and allelic ratio segmentation results.
gatk PlotModeledSegments \ --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \ --allelic-counts hcc1143_T_clean.hets.tsv \ --segments hcc1143_T_clean.modelFinal.seg \ --sequence-dictionary Homo_sapiens_assembly38.dict \ --minimum-contig-length 46709983 \ --output sandbox/plots \ --output-prefix hcc1143_T_clean
This produces plots in the
plots folder. The plots represent final modeled segments for both copy ratios and alternate allele fractions. If we are curious about the extent of smoothing provided by MCMC, then we can similarly plot initial kernel segmentation results by substituting in
Comments on select parameters
--sequence-dictionarythat matches the GRCh38 reference used in mapping .
--minimum-contig-lengthfrom the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.
As of this writing, it is NOT possible to subset plotting with genomic intervals, i.e. with the
-L parameter. To interactively visualize data, consider the following options.
Below are three sets of results for the HCC1143 tumor cell line in order of increasing smoothing. The top plot of each set shows the copy ratio segments. The bottom plot of each set shows the allele fraction segments.
8A. Initial segmentation before MCMC smoothing gives 638 segments.
8B. Default smoothing gives 398 segments.
8C. Enabling additional smoothing iterations per fit gives 311 segments. See section 6 for a description of the
Smoothing accounts for data points that are outliers. Some of these outliers could be artifactual and therefore not of interest, while others could be true copy number variation that would then be missed. To understand the impact of joint copy ratio and allelic counts segmentation, compare the results of 8B to the single-data segmentation results below. Each plot below shows the results of modeling segmentation on a single type of data, either copy-ratios or allelic counts, using default smoothing parameters.
8D. Copy ratio segmentation based on copy ratios alone gives 235 segments.
8E. Allelic segmentation result based on allelic counts alone in the matched case gives 105 segments.
Compare chr1 and chr2 segmentation for the various plots. In particular, pay attention to the p arm (left side) of chr1 and q arm (right side) of chr2. What do you think is happening when adjacent segments are slightly shifted from each other in some sets but then seemingly at the same copy ratio for other sets?
For allelic counts, ModelSegments retains 16,872 sites that are heterozygous in the control. Of these, the case presents 15,486 usable sites. In allelic segmentation using allelic counts alone, the tool uses all of the usable sites. In the matched-control scenario, ModelSegments emits the following message.
INFO MultidimensionalKernelSegmenter - Using first allelic-count site in each copy-ratio interval (12668 / 15486) for multidimensional segmentation...
The message informs us that for the matched-control scenario, ModelSegments uses the first allele-count site for each genomic interval towards allelic modeling. For the tutorial data, this is 12,668 out of the 15,486 or 81.8% of the usable allele-count sites. The exclusion of ~20% of allelic-counts sites, together with the lack of copy ratio data informing segmentation, account for the difference we observe in this and the previous allelic segmentation plot.
In the allele fraction plot, some of the alternate-allele fractions are around 0.35/0.65 and some are at 0/1. We also see alternate-allele fractions around 0.25/0.75 and 0.5. These suggest ploidies that are multiples of one, two, three and four.
Is it possible a copy ratio of one is not diploid but represents some other ploidy?
For the plots above, focus on chr4, chr5 and chr17. Based on both the copy ratio and allelic results, what is the zygosity of each of the chromosomes? What proportion of each chromosome could be described as having undergone copy-neutral loss of heterozygosity?
For allelic copy ratio analysis, the matched-control is a sample from the same individual as the case sample. In the somatic case, the matched-control is the germline normal sample and the case is the tumor sample from the same individual.
The matched-control case presents the following considerations.
Here are some considerations for detecting loss of heterozygosity regions.
If the case is contaminated with matched control, whether the analysis is matched or not, then the range of alternate-allele fractions becomes squished so to speak in that the contaminating normal's heterozygous sites add to the allele fractions. In this case, putative LOH segments still appear at the top and bottom edges of the allelic plot, at the lowest and highest alternate-allele fractions. For a given depth of coverage, the fraction of reads that differentiates zygosity is narrower in range and therefore harder to differentiate visually.
8F. Case-only analysis of tumor contaminated with normal still allows for LOH detection. Here, we bluntly added together the tutorial tumor and normal sample reads. Results for the matched-control analysis are similar.
In the tumor-only case, if the tumor is pure, because ModelSegments drops homozygous sites from consideration and only models sites it determines are heterozygous, the workflow cannot ascertain LOH segments. Such LOH regions may present as an absence of allelic data or as low confidence segments, i.e. having a wide confidence interval on the allelic plot. Compare such a result below to that of the matched case in 8E above.
8G. Allelic segmentation result based on allelic counts alone for case-only, when the case is pure, can produce regions of missing representation and low confidence allelic fraction segments.
Compare results. Focus on chr4, chr5 and chr17. While the matched-case gives homozygous zygosity for each of these chromosomes, the case-only allelic segmentation either presents an absence of segments for regions or gives low confidence allelic fraction segments at alternate allele fractions that are inaccurate, i.e. do not represent actual zygosity. This is particularly true for tumor samples where aneuploidy and LOH are common. Interpret case-only allelic results with caution.
Finally, remember the tutorial analyses above utilize allelic counts from gnomAD sites of common population variation that have been lifted-over from GRCh37 to GRCh38. For allelic count sites, use of sample-specific germline variant sites may incrementally increase resolution. Also, use of confident variant sites from a callset derived from alignments to the target reference may help decrease noise. Confident germline variant sites can be derived with HaplotypeCaller calling on alignments and subsequent variant filtration. Alternatively, it is possible to fine-tune ModelSegments smoothing parameters to dampen noise.
This section shows plotting results of changing some advanced smoothing parameters. The parameters and their defaults are given below, in the order of recommended consideration .
--number-of-changepoints-penalty-factor 1.0 \ --kernel-variance-allele-fraction 0.025 \ --kernel-variance-copy-ratio 0.0 \ --kernel-scaling-allele-fraction 1.0 \ --smoothing-credible-interval-threshold-allele-fraction 2.0 \ --smoothing-credible-interval-threshold-copy-ratio 2.0 \
The first four parameters impact segmentation while the last two parameters impact modeling. The following plots show the results of changing these smoothing parameters. The tutorial chose argument values arbitrarily, for illustration purposes. Results should be compared to that of 8B, which gives 398 segments.
8H. Increasing changepoints penalty factor from 1.0 to 5.0 gives 140 segments.
8I. Increasing kernel variance parameters each to 0.8 gives 144 segments. Changing
--kernel-variance-copy-ratio alone to 0.025 increases the number of segments greatly, to 1,266 segments. Changing it to 0.2 gives 414 segments.
8J. Decreasing kernel scaling from 1.0 to 0 gives 236 segments. Conversely, increasing kernel scaling from 1.0 to 5.0 gives 551 segments.
8K. Increasing both smoothing parameters each from 2.0 to 10.0 gives 263 segments.
 The GATK Resource Bundle provides two variations of a SNPs-only gnomAD project resource VCF. Both VCFs are sites-only eight-column VCFs but one retains the
AC allele count and
AF allele frequency variant-allele-specific annotations, while the other removes these to reduce file size.
AFallele frequency only for biallelic sites. Non-biallelic sites make up ~3% of the gnomAD SNPs-only resource.
 In particular, the tool considers only heterozygous sites that have counts for both the reference allele and the alternate allele. If multiple alternate alleles are present, the tool uses the alternate allele with the highest count and ignores any other alternate allele(s).
--number-of-changepoints-penalty-factorfrom its default of 1.0.
--kernel-variance-copy-ratio(default 0.0) and
--kernel-variance-allele-fraction(default 0.025), or change the weighting of the allele-fraction data by changing
--smoothing-credible-interval-threshold-copy-ratio(default 2.0) and
--smoothing-credible-interval-threshold-allele-fraction(default 2.0). Increasing these will more aggressively merge adjacent segments.
 In particular, uses Gibbs sampling, a type of MCMC sampling, towards both allele-fraction modeling and copy-ratio modeling, and additionally uses slice sampling towards allele-fraction modeling. @slee details the following substeps.
modelBeginfiles, including those for global parameters.
Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.