Showing tool doc from version 4.0.3.0 | The latest version is 4.1.3.0

**BETA** GermlineCNVCaller

Calls copy-number variants in germline samples given their counts and the output of DetermineGermlineContigPloidy

Category Copy Number Variant Discovery


Overview

Calls copy-number variants in germline samples given their counts and the corresponding output of DetermineGermlineContigPloidy. The former should be either HDF5 or TSV count files generated by CollectReadCounts.

Introduction

Reliable detection of copy-number variation (CNV) from read-depth ("coverage" or "counts") data such as whole exome sequencing (WES), whole genome sequencing (WGS), and custom gene sequencing panels requires a comprehensive model to account for technical variation in library preparation and sequencing. The Bayesian model and the associated inference scheme implemented in GermlineCNVCaller includes provisions for inferring and explaining away much of the technical variation. Furthermore, CNV calling confidence is automatically adjusted for each sample and genomic interval.

The parameters of the probabilistic model for read-depth bias and variance estimation (hereafter, "the coverage model") can be automatically inferred by GermlineCNVCaller by providing a cohort of germline samples sequenced using the same sequencing platform and library preparation protocol (in case of WES, the same capture kit). We refer to this run mode of the tool as the COHORT mode. The number of samples required for the COHORT mode depends on several factors such as the sequencing depth, tissue type/quality and similarity in the cohort, and the stringency of following the library preparation and sequencing protocols. For WES and WGS samples, we recommend including at least 30 samples.

The parametrized coverage model can be used for CNV calling on future case samples provided that they are strictly compatible with the cohort used to generate the model parameters (in terms of tissue type(s), library preparation and sequencing protocols). We refer to this mode as the CASE run mode. There is no lower limit on the number of samples for running GermlineCNVCaller in the case mode.

In both tool run modes, GermlineCNVCaller requires karyotyping and global read-depth information for all samples. Such information can be automatically generated by running DetermineGermlineContigPloidy on all samples, and passed on to GermlineCNVCaller by providing the ploidy output calls using the argument contig-ploidy-calls. The ploidy state of a contig is used as the baseline ("default") copy-number state of all intervals contained in that contig for the corresponding sample. All other copy-number states are treated as alternative states and get equal shares from the total alternative state probability (set using the p-alt argument).

Python environment setup

The computation done by this tool, aside from input data parsing and validation, is performed outside of the Java Virtual Machine and using the gCNV computational python module, namely gcnvkernel. It is crucial that the user has properly set up a python conda environment with gcnvkernel and its dependencies installed. If the user intends to run GermlineCNVCaller using one of the official GATK Docker images, the python environment is already set up. Otherwise, the environment must be created and activated as described in the main GATK README.md file.

Tool run modes

COHORT mode:

The tool will be run in the COHORT mode using the argument run-mode COHORT. In this mode, coverage model parameters are inferred simultaneously with the CNV states. Depending on available memory, it may be necessary to run the tool over a subset of all intervals, which can be specified by -L. The specified intervals must be present in all of the input count files. The output will contain two subdirectories, one ending with "-model" and the other with "-calls".

The model subdirectory contains a snapshot of the inferred parameters of the coverage model, which may be used later for CNV calling in one or more similarly-sequenced samples as mentioned earlier. Optionally, the path to a previously obtained coverage model parametrization can be provided via the model argument in the COHORT mode, in which case, the provided parameters will be only used for model initialization and a new parametrization will be generated based on the input count files. Furthermore, the genomic intervals are set to those used in creating the previous parameterization and interval-related arguments will be ignored. Note that the newly obtained parametrization ultimately reflects the input count files from the last run, regardless of whether or not an initialization parameter set is given. If the users wishes to model coverage data from two or more cohorts simultaneously, all of the input counts files must be given to the tool at once.

The calls subdirectory contains sequentially-ordered subdirectories for each sample, each listing various sample-specific estimated quantities such as the probability of various copy-number states for each interval, a parametrization of the GC curve, sample-specific unexplained variance, read-depth, and loadings of coverage bias factors.

CASE mode:

The tool will be run in the CASE mode using the argument run-mode CASE. The path to a previously obtained coverage model parameter bundle must be provided via the model argument in this mode. The modeled intervals are specified by the parameter bundle, all interval-related arguments are ignored in this mode, and all model intervals must be present in all of the input count files. The tool output in the CASE mode is only the "-calls" subdirectory and is organized similarly to the COHORT mode.

Note that at the moment, this tool does not automatically verify the compatibility of the provided parametrization with the provided count files. Model compatibility may be assessed a posteriori by inspecting the magnitude of sample-specific unexplained variance of each sample, and asserting that they lie within the same range as those obtained from the cohort used to generate the parametrization. This manual step is expected to be made automatic in the future.

Important Remarks

Choice of hyperparameters:

The quality of coverage model parametrization and the sensitivity/precision of germline CNV calling are sensitive to the choice of model hyperparameters, including the prior probability of alternative copy-number states (set using the p-alt argument), prevalence of active (i.e. CNV-rich) intervals (set via the p-active argument), the coherence length of CNV events and active/silent domains across intervals (set using the cnv-coherence-length and class-coherence-length arguments, respectively), and the typical scale of interval- and sample-specific unexplained variance (set using the interval-psi-scale and sample-psi-scale arguments, respectively). It is crucial to note that these hyperparameters are not universal and must be tuned for each sequencing protocol and properly set at runtime.

Running GermlineCNVCaller on a subset of intervals:

As mentioned earlier, it may be necessary to run the tool over a subset of all intervals depending on available memory. The number of intervals must be large enough to include a genomically diverse set of regions for reliable inference of the GC bias curve, as well as other bias factors. For WES and WGS, we recommend no less than 10000 consecutive intervals spanning at least 10 - 50 mb.

Memory requirements for the python subprocess ("gcnvkernel"):

The computation done by this tool, for the most part, is performed outside of JVM and via a spawned python subprocess. The Java heap memory is only used for loading sample counts and preparing raw data for the python subprocess. The user must ensure that the machine has enough free physical memory for spawning and executing the python subprocess. Generally speaking, the resource requirements of this tool scale linearly with each of the number of samples, the number of modeled intervals, the highest copy number state, the number of bias factors, and the number of knobs on the GC curve. For example, the python subprocess requires approximately 16GB of physical memory for modeling 10000 intervals for 100 samples, with 16 maximum bias factors, maximum copy-number state of 10, and explicit GC bias modeling.

Usage examples

COHORT mode:

 gatk GermlineCNVCaller \
   --run-mode COHORT \
   -L intervals.interval_list \
   --contig-ploidy-calls path_to_contig_ploidy_calls
   --input normal_1.counts.hdf5 \
   --input normal_2.counts.hdf5 \
   ... \
   --output output_dir \
   --output-prefix normal_cohort_run
 

CASE mode:

 gatk GermlineCNVCaller \
   --run-mode CASE \
   -L intervals.interval_list \
   --contig-ploidy-calls path_to_contig_ploidy_calls
   --model previous_model_path \
   --input normal_1.counts.hdf5 \
   ... \
   --output output_dir \
   --output-prefix normal_case_run
 

GermlineCNVCaller specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--contig-ploidy-calls
null Input contig-ploidy calls directory (output of DetermineGermlineContigPloidy).
--input
 -I
[] Input read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L must be contained; if none are specified, then intervals must be identical and in the same order for all samples.
--output
 -O
null Output directory.
--output-prefix
null Prefix for output filenames.
--run-mode
null Tool run-mode.
Optional Tool Arguments
--active-class-padding-hybrid-mode
50000 If copy-number-posterior-expectation-mode is set to HYBRID, CNV-active intervals determined at any time will be padded by this value (in the units of bp) in order to obtain the set of intervals on which copy number posterior expectation is performed exactly.
--adamax-beta-1
0.9 Adamax optimizer first moment estimation forgetting factor.
--adamax-beta-2
0.99 Adamax optimizer second moment estimation forgetting factor.
--annotated-intervals
null Input annotated-interval file containing annotations for GC content in genomic intervals (output of AnnotateIntervals). All intervals specified via -L must be contained. This input should not be provided if an input denoising-model directory is given (the latter already contains the annotated-interval file).
--arguments_file
[] read one or more arguments files and add them to the command line
--caller-admixing-rate
0.75 Admixing ratio of new and old called posteriors (between 0 and 1; larger values implies using more of the new posterior and less of the old posterior).
--caller-update-convergence-threshold
0.001 Maximum tolerated calling update size for convergence.
--class-coherence-length
10000.0 Coherence length of CNV-active and CNV-silent domains (in the units of bp).
--cnv-coherence-length
10000.0 Coherence length of CNV events (in the units of bp).
--convergence-snr-averaging-window
500 Averaging window for calculating training signal-to-noise ratio (SNR) for convergence checking.
--convergence-snr-countdown-window
10 The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence.
--convergence-snr-trigger-threshold
0.1 The SNR threshold to be reached before triggering the convergence countdown.
--copy-number-posterior-expectation-mode
HYBRID The strategy for calculating copy number posterior expectations in the coverage denoising model.
--depth-correction-tau
10000.0 Precision of read depth pinning to its global value.
--disable-annealing
false (advanced) Disable annealing.
--disable-caller
false (advanced) Disable caller.
--disable-sampler
false (advanced) Disable sampler.
--enable-bias-factors
true Enable discovery of bias factors.
--gc-curve-standard-deviation
1.0 Prior standard deviation of the GC curve from flat.
--gcs-max-retries
 -gcs-retries
20 If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--help
 -h
false display the help message
--init-ard-rel-unexplained-variance
0.1 Initial value of ARD prior precisions relative to the scale of interval-specific unexplained variance.
--initial-temperature
2.0 Initial temperature (for DA-ADVI).
--interval-merging-rule
 -imr
ALL Interval merging rule for abutting intervals
--interval-psi-scale
0.001 Typical scale of interval-specific unexplained variance.
--intervals
 -L
[] One or more genomic intervals over which to operate
--learning-rate
0.05 Adamax optimizer learning rate.
--log-emission-samples-per-round
50 Log emission samples drawn per round of sampling.
--log-emission-sampling-median-rel-error
0.005 Maximum tolerated median relative error in log emission sampling.
--log-emission-sampling-rounds
10 Log emission maximum sampling rounds.
--log-mean-bias-standard-deviation
0.1 Standard deviation of log mean bias.
--mapping-error-rate
0.01 Typical mapping error rate.
--max-advi-iter-first-epoch
100 Maximum ADVI iterations in the first epoch.
--max-advi-iter-subsequent-epochs
100 Maximum ADVI iterations in subsequent epochs.
--max-bias-factors
5 Maximum number of bias factors.
--max-calling-iters
10 Maximum number of internal self-consistency iterations within each calling step.
--max-copy-number
5 Highest allowed copy-number state.
--max-training-epochs
50 Maximum number of training epochs.
--min-training-epochs
10 Minimum number of training epochs.
--model
null Input denoising-model directory. In the COHORT mode, this argument is optional and if provided,a new model will be built using this input model to initialize. In the CASE mode, the denoising model parameters set to this input model and therefore, this argument is required.
--num-gc-bins
20 Number of bins used to represent the GC-bias curves.
--num-thermal-epochs
20 Number of thermal epochs (for DA-ADVI).
--p-active
0.01 Prior probability of treating an interval as CNV-active (in a CNV-active domains, all copy-number states are equally likely to be called).
--p-alt
1.0E-6 Total prior probability of alternative copy-number states (the reference copy-number is set to the contig integer ploidy)
--sample-psi-scale
1.0E-4 Typical scale of sample-specific correction to the unexplained variance.
--version
false display the version number for this tool
Optional Common Arguments
--exclude-intervals
 -XL
[] One or more genomic intervals to exclude from processing
--gatk-config-file
null A configuration file to use with the GATK.
--interval-exclusion-padding
 -ixp
0 Amount of padding (in bp) to add to each interval you are excluding.
--interval-padding
 -ip
0 Amount of padding (in bp) to add to each interval you are including.
--interval-set-rule
 -isr
UNION Set merging approach to use for combining interval inputs
--QUIET
false Whether to suppress job-summary info on System.err.
--TMP_DIR
[] Undocumented option
--use-jdk-deflater
 -jdk-deflater
false Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater
 -jdk-inflater
false Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--active-class-padding-hybrid-mode / NA

If copy-number-posterior-expectation-mode is set to HYBRID, CNV-active intervals determined at any time will be padded by this value (in the units of bp) in order to obtain the set of intervals on which copy number posterior expectation is performed exactly.

int  50000  [ [ -∞  ∞ ] ]


--adamax-beta-1 / NA

Adamax optimizer first moment estimation forgetting factor.

double  0.9  [ [ 0  1 ] ]


--adamax-beta-2 / NA

Adamax optimizer second moment estimation forgetting factor.

double  0.99  [ [ 0  1 ] ]


--annotated-intervals / NA

Input annotated-interval file containing annotations for GC content in genomic intervals (output of AnnotateIntervals). All intervals specified via -L must be contained. This input should not be provided if an input denoising-model directory is given (the latter already contains the annotated-interval file).

File  null


--arguments_file / NA

read one or more arguments files and add them to the command line

List[File]  []


--caller-admixing-rate / NA

Admixing ratio of new and old called posteriors (between 0 and 1; larger values implies using more of the new posterior and less of the old posterior).

double  0.75  [ [ 0  ∞ ] ]


--caller-update-convergence-threshold / NA

Maximum tolerated calling update size for convergence.

double  0.001  [ [ 0  ∞ ] ]


--class-coherence-length / NA

Coherence length of CNV-active and CNV-silent domains (in the units of bp).

double  10000.0  [ [ 0  ∞ ] ]


--cnv-coherence-length / NA

Coherence length of CNV events (in the units of bp).

double  10000.0  [ [ 0  ∞ ] ]


--contig-ploidy-calls / NA

Input contig-ploidy calls directory (output of DetermineGermlineContigPloidy).

R String  null


--convergence-snr-averaging-window / NA

Averaging window for calculating training signal-to-noise ratio (SNR) for convergence checking.

int  500  [ [ 0  ∞ ] ]


--convergence-snr-countdown-window / NA

The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence.

int  10  [ [ 0  ∞ ] ]


--convergence-snr-trigger-threshold / NA

The SNR threshold to be reached before triggering the convergence countdown.

double  0.1  [ [ 0  ∞ ] ]


--copy-number-posterior-expectation-mode / NA

The strategy for calculating copy number posterior expectations in the coverage denoising model.

The --copy-number-posterior-expectation-mode argument is an enumerated type (CopyNumberPosteriorExpectationMode), which can have one of the following values:

MAP
EXACT
HYBRID

CopyNumberPosteriorExpectationMode  HYBRID


--depth-correction-tau / NA

Precision of read depth pinning to its global value.

double  10000.0  [ [ 0  ∞ ] ]


--disable-annealing / NA

(advanced) Disable annealing.

boolean  false


--disable-caller / NA

(advanced) Disable caller.

boolean  false


--disable-sampler / NA

(advanced) Disable sampler.

boolean  false


--enable-bias-factors / NA

Enable discovery of bias factors.

boolean  true


--exclude-intervals / -XL

One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite). This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals (e.g. -XL myFile.intervals).

List[String]  []


--gatk-config-file / NA

A configuration file to use with the GATK.

String  null


--gc-curve-standard-deviation / NA

Prior standard deviation of the GC curve from flat.

double  1.0  [ [ 0  ∞ ] ]


--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int  20  [ [ -∞  ∞ ] ]


--help / -h

display the help message

boolean  false


--init-ard-rel-unexplained-variance / NA

Initial value of ARD prior precisions relative to the scale of interval-specific unexplained variance.

double  0.1  [ [ 0  ∞ ] ]


--initial-temperature / NA

Initial temperature (for DA-ADVI).

double  2.0  [ [ 0  ∞ ] ]


--input / -I

Input read-count files containing integer read counts in genomic intervals for all samples. All intervals specified via -L must be contained; if none are specified, then intervals must be identical and in the same order for all samples.

R List[File]  []


--interval-exclusion-padding / -ixp

Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int  0  [ [ -∞  ∞ ] ]


--interval-merging-rule / -imr

Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not actually overlap) into a single continuous interval. However you can change this behavior if you want them to be treated as separate intervals instead.

The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:

ALL
OVERLAPPING_ONLY

IntervalMergingRule  ALL


--interval-padding / -ip

Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int  0  [ [ -∞  ∞ ] ]


--interval-psi-scale / NA

Typical scale of interval-specific unexplained variance.

double  0.001  [ [ 0  ∞ ] ]


--interval-set-rule / -isr

Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will always be merged using UNION). Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.

The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:

UNION
Take the union of all intervals
INTERSECTION
Take the intersection of intervals (the subset that overlaps all intervals specified)

IntervalSetRule  UNION


--intervals / -L

One or more genomic intervals over which to operate

List[String]  []


--learning-rate / NA

Adamax optimizer learning rate.

double  0.05  [ [ 0  ∞ ] ]


--log-emission-samples-per-round / NA

Log emission samples drawn per round of sampling.

int  50  [ [ 0  ∞ ] ]


--log-emission-sampling-median-rel-error / NA

Maximum tolerated median relative error in log emission sampling.

double  0.005  [ [ 0  ∞ ] ]


--log-emission-sampling-rounds / NA

Log emission maximum sampling rounds.

int  10  [ [ 0  ∞ ] ]


--log-mean-bias-standard-deviation / NA

Standard deviation of log mean bias.

double  0.1  [ [ 0  ∞ ] ]


--mapping-error-rate / NA

Typical mapping error rate.

double  0.01  [ [ 0  ∞ ] ]


--max-advi-iter-first-epoch / NA

Maximum ADVI iterations in the first epoch.

int  100  [ [ 0  ∞ ] ]


--max-advi-iter-subsequent-epochs / NA

Maximum ADVI iterations in subsequent epochs.

int  100  [ [ 0  ∞ ] ]


--max-bias-factors / NA

Maximum number of bias factors.

int  5  [ [ 0  ∞ ] ]


--max-calling-iters / NA

Maximum number of internal self-consistency iterations within each calling step.

int  10  [ [ 0  ∞ ] ]


--max-copy-number / NA

Highest allowed copy-number state.

int  5  [ [ 0  ∞ ] ]


--max-training-epochs / NA

Maximum number of training epochs.

int  50  [ [ 0  ∞ ] ]


--min-training-epochs / NA

Minimum number of training epochs.

int  10  [ [ 0  ∞ ] ]


--model / NA

Input denoising-model directory. In the COHORT mode, this argument is optional and if provided,a new model will be built using this input model to initialize. In the CASE mode, the denoising model parameters set to this input model and therefore, this argument is required.

String  null


--num-gc-bins / NA

Number of bins used to represent the GC-bias curves.

int  20  [ [ 1  ∞ ] ]


--num-thermal-epochs / NA

Number of thermal epochs (for DA-ADVI).

int  20  [ [ 0  ∞ ] ]


--output / -O

Output directory.

R String  null


--output-prefix / NA

Prefix for output filenames.

R String  null


--p-active / NA

Prior probability of treating an interval as CNV-active (in a CNV-active domains, all copy-number states are equally likely to be called).

double  0.01  [ [ 0  ∞ ] ]


--p-alt / NA

Total prior probability of alternative copy-number states (the reference copy-number is set to the contig integer ploidy)

double  1.0E-6  [ [ 0  ∞ ] ]


--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean  false


--run-mode / NA

Tool run-mode.

The --run-mode argument is an enumerated type (RunMode), which can have one of the following values:

COHORT
CASE

R RunMode  null


--sample-psi-scale / NA

Typical scale of sample-specific correction to the unexplained variance.

double  1.0E-4  [ [ 0  ∞ ] ]


--showHidden / -showHidden

display hidden arguments

boolean  false


--TMP_DIR / NA

Undocumented option

List[File]  []


--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean  false


--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean  false


--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version / NA

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Docs Index | Support Forum

GATK version 4.0.3.0 built at 09-43-2018 09:43:10.