**BETA** GermlineCNVCaller

Calls copy-number variants in germline samples given their counts and the output of DetermineGermlineContigPloidy.

Category Copy Number Variant Discovery


Overview

Calls copy-number variants in germline samples given their counts and the corresponding output of DetermineGermlineContigPloidy. The former should be either HDF5 or TSV count files generated by CollectFragmentCounts.

Introduction

Reliable detection of copy-number variation (CNV) from read-depth ("coverage" or "counts") data such as whole exome sequencing (WES), whole genome sequencing (WGS), and gene panel coverage profiles requires a comprehensive model of library preparation and sequencing biases. The Bayesian model and the associated inference scheme implemented in GermlineCNVCaller includes provisions for inferring and explaining away much of the technical variation and automatically determining CNV calling confidence along the genome.

The parameters of the probabilistic model for read-depth bias and variance (hereafter, "the coverage model") can be automatically inferred by GermlineCNVCaller by providing a cohort of germline samples sequenced using the same sequencing platform and library preparation protocol (in case of WES, the same capture kit). We refer to this mode as the COHORT mode. The number of samples required for the COHORT mode depends on many factors such as the quality of sequenced samples and the stringency of following the library preparation and sequencing protocols. For WES and WGS, we recommend including at least 30 samples.

The parametrized coverage model can be used for CNV detection on future case samples provided that they are strictly compatible (in terms of library preparation and sequencing protocol) with the cohort used to generate the model parameters. We refer to this mode as the CASE mode. There is no lower limit on the number of samples for running GermlineCNVCaller in the case mode.

In both modes, the output calls of DetermineGermlineContigPloidy are required for all samples. The germline contig ploidy estimates are used for choosing the baseline copy-number state (in particular, for sex chromosomes).

Tool run modes

COHORT mode:

The tool will be run in the COHORT mode via passing the argument --run-mode COHORT. In this mode, coverage model parameters are inferred simultaneously with the CNV events. Depending on available memory, it may be necessary to run the tool over a subset of all intervals, which can be specified by -L and must be present in all of the count files. The output will contain two subdirectories, one ending with "-model" and the other with "-calls".

The model subdirectory contains the inferred parameters of the coverage model, which may be used later for CNV calling in one or more similarly-sequenced samples. If a previously obtained coverage model parameter bundle is provided via --model <previous_model_path> in this mode, those parameters will only be used for initialization and a new parameter bundle will be generated based on the provided cohort. Furthermore, the range of genomic intervals is set to the range used for creating the previous parameter bundle and interval-related arguments will be ignored.

The calls subdirectory contains one subdirectory for each sample, listing various sample-specific quantities such as the probability of various copy-number states for each interval, the GC curve, sample-specific unexplained variance, read depth, and loadings of various coverage bias factors.

CASE mode:

The tool will be run in the CASE mode via passing the argument --run-mode CASE. The path to a previously obtained coverage model parameter bundle must be provided via --model <previous_model_path>. The range of genomic intervals is set to the range used for creating the parameter bundle and interval-related arguments will be ignored. The output of the CASE mode is only the "-calls" subdirectory.

Important Remarks

Choice of hyperparameters:

The quality of inferred coverage model parameters and germline CNV events is sensitive to the choice of model hyperparameters, such as the prior probability of alternative copy-number states, prevalence of active regions, the coherence length of CNV events and active/silent domains, and the typical scale of interval- and sample-specific unexplained variance. These hyperparameters are not universal and must be properly tuned for each sequencing protocol.

Running GermlineCNVCaller on a subset of intervals:

As mentioned earlier, it may be necessary to run the tool over a subset of all intervals depending on available memory. The number of intervals must be large enough to include a contextually diverse set of regions for reliable inference of the GC bias curve, as well as other bias factors. For WES and WGS, we recommend no less than 10000 consecutive intervals spanning at least 10 - 50 mb.

Memory Requirements for the python subprocess ("gcnvkernel"):

The computation done by this tool, for the most part, is performed outside of JVM and via a spawned python subprocess. The Java heap memory is only used for loading sample counts and preparing raw data for the python subprocess. The user must ensure that the machine has enough free physical memory for spawning and executing the python subprocess. Generally speaking, the resource requirements of this tool scale linearly with each of the number of samples, the number of modeled intervals, the highest copy number state, the number of bias factors, and the number of knobs on the GC curve. For example, the python subprocess requires approximately 16gb for RAM for modeling 10000 intervals for 100 samples, with 16 maximum bias factors and explicit GC bias modeling.

Usage examples

COHORT mode:

 gatk GermlineCNVCaller \
   --run-mode COHORT \
   -L intervals.interval_list \
   --contig-ploidy-calls path_to_contig_ploidy_calls
   --input normal_1.counts.hdf5 \
   --input normal_2.counts.hdf5 \
   ... \
   --output output_dir \
   --output-prefix normal_cohort_run
 

CASE mode:

 gatk GermlineCNVCaller \
   --run-mode CASE \
   -L intervals.interval_list \
   --contig-ploidy-calls path_to_contig_ploidy_calls
   --model previous_model_path \
   --input normal_1.counts.hdf5 \
   ... \
   --output output_dir \
   --output-prefix normal_case_run
 

GermlineCNVCaller specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--contig-ploidy-calls
null Input contig-ploidy calls directory (output of DetermlineGermlineContigPloidy).
--input
[] Input read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L must be contained; if none are specified, then intervals must be identical and in the same order for all samples.
--output
null Output directory.
--output-prefix
null Prefix for output filenames.
--run-mode
null Tool run-mode.
Optional Tool Arguments
--active-class-padding-hybrid-mode
50000 If copy-number-posterior-expectation-mode is set to hybrid, pad active intervals determined at any time by this value (in the units of bp) in order to obtain the set of intervals on which copy number posterior expectation is performed exactly.
--adamax-beta-1
0.9 Adamax optimizer first moment estimation forgetting factor.
--adamax-beta-2
0.99 Adamax optimizer second moment estimation forgetting factor.
--annotated-intervals
null Input annotated-interval file containing annotations for GC content in genomic intervals (output of AnnotateIntervals). All intervals specified via -L must be contained. This input should not be provided if an input denoising-model directory is given (the latter already contains the annotated-interval file).
--arguments_file
[] read one or more arguments files and add them to the command line
--caller-admixing-rate
0.75 Admixing ratio of new and old caller posteriors (between 0 and 1; higher means using more of the new posterior)
--caller-update-convergence-threshold
0.001 Maximum tolerated calling update size for convergence.
--class-coherence-length
10000.0 Coherence length of CNV class domains (in the units of bp).
--cnv-coherence-length
10000.0 Coherence length of CNV events (in the units of bp).
--convergence-snr-averaging-window
500 Averaging window for calculating training SNR for evaluating convergence.
--convergence-snr-countdown-window
10 The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence.
--convergence-snr-trigger-threshold
0.1 The SNR threshold to be reached for triggering convergence.
--copy-number-posterior-expectation-mode
HYBRID The strategy for calculating copy number posterior expectations in the denoising model.
--depth-correction-tau
10000.0 Precision of read depth pinning to its global value.
--disable-annealing
false (advanced) Disable annealing.
--disable-caller
false (advanced) Disable caller.
--disable-sampler
false (advanced) Disable sampler.
--enable-bias-factors
true Enable discovery of bias factors.
--gc-curve-standard-deviation
1.0 Prior standard deviation of the GC curve from flat.
--gcs-max-retries
 -gcs-retries
20 If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--help
 -h
false display the help message
--init-ard-rel-unexplained-variance
0.1 Initial value of ARD prior precision relative to the typical interval-specific unexplained variance scale.
--initial-temperature
2.0 Initial temperature (for DA-ADVI).
--interval-merging-rule
 -imr
ALL Interval merging rule for abutting intervals
--interval-psi-scale
0.001 Typical scale of interval-specific unexplained variance.
--intervals
 -L
[] One or more genomic intervals over which to operate
--learning-rate
0.05 Adamax optimizer learning rate.
--log-emission-samples-per-round
50 Log emission samples drawn per round of sampling.
--log-emission-sampling-median-rel-error
0.005 Maximum tolerated median relative error in log emission sampling.
--log-emission-sampling-rounds
10 Log emission maximum sampling rounds.
--log-mean-bias-standard-deviation
0.1 Standard deviation of log mean bias.
--mapping-error-rate
0.01 Typical mapping error rate.
--max-advi-iter-first-epoch
100 Maximum ADVI iterations in the first epoch.
--max-advi-iter-subsequent-epochs
100 Maximum ADVI iterations in the subsequent epochs.
--max-bias-factors
5 Maximum number of bias factors.
--max-calling-iters
10 Maximum number of calling internal self-consistency iterations.
--max-copy-number
5 Highest considered copy-number.
--max-training-epochs
50 Maximum number of training epochs.
--min-training-epochs
10 Minimum number of training epochs.
--model
null Input denoising-model directory. In the COHORT mode, this argument is optional and if provided,a new model will be built using this input model to initialize. In the CASE mode, the denoising model parameters set to this input model and therefore, this argument is required.
--num-gc-bins
20 Number of knobs on the GC curves.
--num-thermal-epochs
20 Number of thermal epochs (for DA-ADVI).
--p-active
0.01 Prior probability of treating an interval as CNV-active
--p-alt
1.0E-6 Prior probability of alt copy-number with respect to contig baseline state in the reference copy number.
--sample-psi-scale
1.0E-4 Typical scale of sample-specific unexplained variance.
--version
false display the version number for this tool
Optional Common Arguments
--exclude-intervals
 -XL
[] One or more genomic intervals to exclude from processing
--gatk-config-file
null A configuration file to use with the GATK.
--interval-exclusion-padding
 -ixp
0 Amount of padding (in bp) to add to each interval you are excluding.
--interval-padding
 -ip
0 Amount of padding (in bp) to add to each interval you are including.
--interval-set-rule
 -isr
UNION Set merging approach to use for combining interval inputs
--QUIET
false Whether to suppress job-summary info on System.err.
--TMP_DIR
[] Undocumented option
--use-jdk-deflater
 -jdk-deflater
false Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater
 -jdk-inflater
false Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--active-class-padding-hybrid-mode / NA

If copy-number-posterior-expectation-mode is set to hybrid, pad active intervals determined at any time by this value (in the units of bp) in order to obtain the set of intervals on which copy number posterior expectation is performed exactly.

int  50000  [ [ -∞  ∞ ] ]


--adamax-beta-1 / NA

Adamax optimizer first moment estimation forgetting factor.

double  0.9  [ [ 0  1 ] ]


--adamax-beta-2 / NA

Adamax optimizer second moment estimation forgetting factor.

double  0.99  [ [ 0  1 ] ]


--annotated-intervals / NA

Input annotated-interval file containing annotations for GC content in genomic intervals (output of AnnotateIntervals). All intervals specified via -L must be contained. This input should not be provided if an input denoising-model directory is given (the latter already contains the annotated-interval file).

File  null


--arguments_file / NA

read one or more arguments files and add them to the command line

List[File]  []


--caller-admixing-rate / NA

Admixing ratio of new and old caller posteriors (between 0 and 1; higher means using more of the new posterior)

double  0.75  [ [ 0  ∞ ] ]


--caller-update-convergence-threshold / NA

Maximum tolerated calling update size for convergence.

double  0.001  [ [ 0  ∞ ] ]


--class-coherence-length / NA

Coherence length of CNV class domains (in the units of bp).

double  10000.0  [ [ 0  ∞ ] ]


--cnv-coherence-length / NA

Coherence length of CNV events (in the units of bp).

double  10000.0  [ [ 0  ∞ ] ]


--contig-ploidy-calls / NA

Input contig-ploidy calls directory (output of DetermlineGermlineContigPloidy).

R String  null


--convergence-snr-averaging-window / NA

Averaging window for calculating training SNR for evaluating convergence.

int  500  [ [ 0  ∞ ] ]


--convergence-snr-countdown-window / NA

The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence.

int  10  [ [ 0  ∞ ] ]


--convergence-snr-trigger-threshold / NA

The SNR threshold to be reached for triggering convergence.

double  0.1  [ [ 0  ∞ ] ]


--copy-number-posterior-expectation-mode / NA

The strategy for calculating copy number posterior expectations in the denoising model.

The --copy-number-posterior-expectation-mode argument is an enumerated type (CopyNumberPosteriorExpectationMode), which can have one of the following values:

MAP
EXACT
HYBRID

CopyNumberPosteriorExpectationMode  HYBRID


--depth-correction-tau / NA

Precision of read depth pinning to its global value.

double  10000.0  [ [ 0  ∞ ] ]


--disable-annealing / NA

(advanced) Disable annealing.

boolean  false


--disable-caller / NA

(advanced) Disable caller.

boolean  false


--disable-sampler / NA

(advanced) Disable sampler.

boolean  false


--enable-bias-factors / NA

Enable discovery of bias factors.

boolean  true


--exclude-intervals / -XL

One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite). This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals (e.g. -XL myFile.intervals).

List[String]  []


--gatk-config-file / NA

A configuration file to use with the GATK.

String  null


--gc-curve-standard-deviation / NA

Prior standard deviation of the GC curve from flat.

double  1.0  [ [ 0  ∞ ] ]


--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int  20  [ [ -∞  ∞ ] ]


--help / -h

display the help message

boolean  false


--init-ard-rel-unexplained-variance / NA

Initial value of ARD prior precision relative to the typical interval-specific unexplained variance scale.

double  0.1  [ [ 0  ∞ ] ]


--initial-temperature / NA

Initial temperature (for DA-ADVI).

double  2.0  [ [ 0  ∞ ] ]


--input / NA

Input read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L must be contained; if none are specified, then intervals must be identical and in the same order for all samples.

R List[File]  []


--interval-exclusion-padding / -ixp

Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int  0  [ [ -∞  ∞ ] ]


--interval-merging-rule / -imr

Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not actually overlap) into a single continuous interval. However you can change this behavior if you want them to be treated as separate intervals instead.

The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:

ALL
OVERLAPPING_ONLY

IntervalMergingRule  ALL


--interval-padding / -ip

Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int  0  [ [ -∞  ∞ ] ]


--interval-psi-scale / NA

Typical scale of interval-specific unexplained variance.

double  0.001  [ [ 0  ∞ ] ]


--interval-set-rule / -isr

Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will always be merged using UNION). Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.

The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:

UNION
Take the union of all intervals
INTERSECTION
Take the intersection of intervals (the subset that overlaps all intervals specified)

IntervalSetRule  UNION


--intervals / -L

One or more genomic intervals over which to operate

List[String]  []


--learning-rate / NA

Adamax optimizer learning rate.

double  0.05  [ [ 0  ∞ ] ]


--log-emission-samples-per-round / NA

Log emission samples drawn per round of sampling.

int  50  [ [ 0  ∞ ] ]


--log-emission-sampling-median-rel-error / NA

Maximum tolerated median relative error in log emission sampling.

double  0.005  [ [ 0  ∞ ] ]


--log-emission-sampling-rounds / NA

Log emission maximum sampling rounds.

int  10  [ [ 0  ∞ ] ]


--log-mean-bias-standard-deviation / NA

Standard deviation of log mean bias.

double  0.1  [ [ 0  ∞ ] ]


--mapping-error-rate / NA

Typical mapping error rate.

double  0.01  [ [ 0  ∞ ] ]


--max-advi-iter-first-epoch / NA

Maximum ADVI iterations in the first epoch.

int  100  [ [ 0  ∞ ] ]


--max-advi-iter-subsequent-epochs / NA

Maximum ADVI iterations in the subsequent epochs.

int  100  [ [ 0  ∞ ] ]


--max-bias-factors / NA

Maximum number of bias factors.

int  5  [ [ 0  ∞ ] ]


--max-calling-iters / NA

Maximum number of calling internal self-consistency iterations.

int  10  [ [ 0  ∞ ] ]


--max-copy-number / NA

Highest considered copy-number.

int  5  [ [ 0  ∞ ] ]


--max-training-epochs / NA

Maximum number of training epochs.

int  50  [ [ 0  ∞ ] ]


--min-training-epochs / NA

Minimum number of training epochs.

int  10  [ [ 0  ∞ ] ]


--model / NA

Input denoising-model directory. In the COHORT mode, this argument is optional and if provided,a new model will be built using this input model to initialize. In the CASE mode, the denoising model parameters set to this input model and therefore, this argument is required.

String  null


--num-gc-bins / NA

Number of knobs on the GC curves.

int  20  [ [ 1  ∞ ] ]


--num-thermal-epochs / NA

Number of thermal epochs (for DA-ADVI).

int  20  [ [ 0  ∞ ] ]


--output / NA

Output directory.

R String  null


--output-prefix / NA

Prefix for output filenames.

R String  null


--p-active / NA

Prior probability of treating an interval as CNV-active

double  0.01  [ [ 0  ∞ ] ]


--p-alt / NA

Prior probability of alt copy-number with respect to contig baseline state in the reference copy number.

double  1.0E-6  [ [ 0  ∞ ] ]


--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean  false


--run-mode / NA

Tool run-mode.

The --run-mode argument is an enumerated type (RunMode), which can have one of the following values:

COHORT
CASE

R RunMode  null


--sample-psi-scale / NA

Typical scale of sample-specific unexplained variance.

double  1.0E-4  [ [ 0  ∞ ] ]


--showHidden / -showHidden

display hidden arguments

boolean  false


--TMP_DIR / NA

Undocumented option

List[File]  []


--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean  false


--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean  false


--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version / NA

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Docs Index | Support Forum

GATK version 4.0.0.0 built at 10-58-2018 11:58:10.