**BETA** DetermineGermlineContigPloidy

Determines the baseline contig ploidy for germline samples given counts data.

Category Copy Number Variant Discovery


Overview

Determines the baseline contig ploidy for germline samples given counts data. These should be either HDF5 or TSV count files generated by CollectFragmentCounts.

Important Remark: The Bayesian model underlying this tool assumes integer ploidy states (in contrast to fractional/variable ploidy states). Therefore, the tool is to be used strictly for germline samples and for the purpose of sex genotyping and detecting germline aneuploidy in autosomal contigs. The presence of significant somatic events and mosaicism (e.g. sex chromosome loss and somatic trisomy) will naturally lead to unreliable results. We strongly recommended inspecting genotyping qualities (GQ) from the tool output and considering to drop low-GQ contigs in the downstream analyses. Finally, given the Bayesian status of this tool, we suggest including as many high-quality germline samples as possible when building a ploidy-model in the COHORT mode (see below). This will downplay the role of questionable samples and will yield a more reliable estimation of genuine sequencing biases.

This tool has two modes as described below:

COHORT mode:
If a ploidy-model directory is not provided via the model argument, the tool is run in the COHORT mode. In this mode, ploidy-model parameters (e.g. coverage bias and variance for each contig) are inferred, along with baseline contig ploidy states of each sample. A tab-separated table specifying prior probabilities for each ploidy state and for each contig is required in this mode and must be specified via the contig-ploidy-priors argument. The following shows an example of such a table:
CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
1 0.01 0.01 0.97 0.01
2 0.01 0.01 0.97 0.01
X 0.01 0.49 0.49 0.01
Y 0.50 0.50 0.00 0.00
Note that the contig names under CONTIG_NAME column must match contig names in the counts files, and all contigs appearing in the counts files must have a corresponding entry in the priors table. The order of contigs is immaterial in this table. The highest ploidy state is determined by the prior table (3 in the above example). A ploidy state can be strictly forbidden by setting its prior probability to 0. For example, the X contig in the above example can only assume 0 and 1 ploidy states.

The tool output in the COHORT mode will contain two subdirectories, one ending with "-model" and the other ending with "-calls". The model subdirectory contains the inferred parameters of the ploidy-model, which may be used later on for ploidy determination in one or more similarly-sequenced samples (see below). The calls subdirectory contains one subdirectory for each sample, listing various sample-specific quantities such as the global read-depth, average ploidy, per-contig baseline ploidies, and per-contig coverage variance estimates.

CASE mode:
If a directory containing previously inferred ploidy-model parameters is provided via the model argument, then the tool is run in the CASE mode. In this mode, the ploidy-model parameters are loaded from the provided directory and only sample-specific quantities are inferred. Subsequently, the output directory will only contain the "-calls" subdirectory.

In the CASE mode, the contig ploidy prior table is taken directly from the provided model parameters path and must be not provided again.

Usage examples

COHORT mode:

 gatk DetermineGermlineContigPloidy \
   --input normal_1.counts.hdf5 \
   --input normal_2.counts.hdf5 \
   ... \
   --contig-ploidy-priors a_valid_ploidy_priors_table.tsv
   --output output_dir \
   --output-prefix normal_cohort
 

CASE mode:

 gatk DetermineGermlineContigPloidy \
   --model a_valid_ploidy_model_dir
   --input normal_1.counts.hdf5 \
   --input normal_2.counts.hdf5 \
   ... \
   --output output_dir \
   --output-prefix normal_case
 

DetermineGermlineContigPloidy specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--input
[] Input read-count files containing integer read counts in genomic intervals for all samples. Intervals must be identical and in the same order for all samples. If only a single sample is specified, an input ploidy-model directory must also be specified.
--output
null Output directory for sample contig-ploidy calls and the contig-ploidy model parameters for future use.
--output-prefix
null Prefix for output filenames.
Optional Tool Arguments
--adamax-beta-1
0.9 Adamax optimizer first moment estimation forgetting factor.
--adamax-beta-2
0.999 Adamax optimizer second moment estimation forgetting factor.
--arguments_file
[] read one or more arguments files and add them to the command line
--caller-admixing-rate
0.75 Admixing ratio of new and old caller posteriors (between 0 and 1; higher means using more of the new posterior)
--caller-update-convergence-threshold
0.001 Maximum tolerated calling update size for convergence.
--contig-ploidy-priors
null Input file specifying contig-ploidy priors. If only a single sample is specified, this input should not be provided. If multiple samples are specified, this input is required.
--convergence-snr-averaging-window
5000 Averaging window for calculating training SNR for evaluating convergence.
--convergence-snr-countdown-window
10 The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence.
--convergence-snr-trigger-threshold
0.1 The SNR threshold to be reached for triggering convergence.
--disable-annealing
false (advanced) Disable annealing.
--disable-caller
false (advanced) Disable caller.
--disable-sampler
false (advanced) Disable sampler.
--gcs-max-retries
 -gcs-retries
20 If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--global-psi-scale
0.001 Global contig-level unexplained variance scale. If a single sample is provided, this input will be ignored.
--help
 -h
false display the help message
--initial-temperature
2.0 Initial temperature (for DA-ADVI).
--learning-rate
0.05 Adamax optimizer learning rate.
--log-emission-samples-per-round
2000 Log emission samples drawn per round of sampling.
--log-emission-sampling-median-rel-error
5.0E-4 Maximum tolerated median relative error in log emission sampling.
--log-emission-sampling-rounds
100 Log emission maximum sampling rounds.
--mapping-error-rate
0.01 Typical mapping error rate.
--max-advi-iter-first-epoch
1000 Maximum ADVI iterations in the first epoch.
--max-advi-iter-subsequent-epochs
1000 Maximum ADVI iterations in the subsequent epochs.
--max-calling-iters
1 Maximum number of calling internal self-consistency iterations.
--max-training-epochs
100 Maximum number of training epochs.
--mean-bias-standard-deviation
0.01 Contig-level mean bias standard deviation. If a single sample is provided, this input will be ignored.
--min-training-epochs
20 Minimum number of training epochs.
--model
null Input ploidy-model directory. If only a single sample is specified, this input is required. If multiple samples are specified, this input should not be provided.
--num-thermal-epochs
20 Number of thermal epochs (for DA-ADVI).
--sample-psi-scale
1.0E-4 Sample-specific contig-level unexplained variance scale.
--version
false display the version number for this tool
Optional Common Arguments
--gatk-config-file
null A configuration file to use with the GATK.
--QUIET
false Whether to suppress job-summary info on System.err.
--TMP_DIR
[] Undocumented option
--use-jdk-deflater
 -jdk-deflater
false Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater
 -jdk-inflater
false Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--adamax-beta-1 / NA

Adamax optimizer first moment estimation forgetting factor.

double  0.9  [ [ 0  1 ] ]


--adamax-beta-2 / NA

Adamax optimizer second moment estimation forgetting factor.

double  0.999  [ [ 0  1 ] ]


--arguments_file / NA

read one or more arguments files and add them to the command line

List[File]  []


--caller-admixing-rate / NA

Admixing ratio of new and old caller posteriors (between 0 and 1; higher means using more of the new posterior)

double  0.75  [ [ 0  ∞ ] ]


--caller-update-convergence-threshold / NA

Maximum tolerated calling update size for convergence.

double  0.001  [ [ 0  ∞ ] ]


--contig-ploidy-priors / NA

Input file specifying contig-ploidy priors. If only a single sample is specified, this input should not be provided. If multiple samples are specified, this input is required.

File  null


--convergence-snr-averaging-window / NA

Averaging window for calculating training SNR for evaluating convergence.

int  5000  [ [ 0  ∞ ] ]


--convergence-snr-countdown-window / NA

The number of ADVI iterations during which the SNR is required to stay below the set threshold for convergence.

int  10  [ [ 0  ∞ ] ]


--convergence-snr-trigger-threshold / NA

The SNR threshold to be reached for triggering convergence.

double  0.1  [ [ 0  ∞ ] ]


--disable-annealing / NA

(advanced) Disable annealing.

boolean  false


--disable-caller / NA

(advanced) Disable caller.

boolean  false


--disable-sampler / NA

(advanced) Disable sampler.

boolean  false


--gatk-config-file / NA

A configuration file to use with the GATK.

String  null


--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int  20  [ [ -∞  ∞ ] ]


--global-psi-scale / NA

Global contig-level unexplained variance scale. If a single sample is provided, this input will be ignored.

double  0.001  [ [ 0  ∞ ] ]


--help / -h

display the help message

boolean  false


--initial-temperature / NA

Initial temperature (for DA-ADVI).

double  2.0  [ [ 0  ∞ ] ]


--input / NA

Input read-count files containing integer read counts in genomic intervals for all samples. Intervals must be identical and in the same order for all samples. If only a single sample is specified, an input ploidy-model directory must also be specified.

R List[File]  []


--learning-rate / NA

Adamax optimizer learning rate.

double  0.05  [ [ 0  ∞ ] ]


--log-emission-samples-per-round / NA

Log emission samples drawn per round of sampling.

int  2000  [ [ 0  ∞ ] ]


--log-emission-sampling-median-rel-error / NA

Maximum tolerated median relative error in log emission sampling.

double  5.0E-4  [ [ 0  ∞ ] ]


--log-emission-sampling-rounds / NA

Log emission maximum sampling rounds.

int  100  [ [ 0  ∞ ] ]


--mapping-error-rate / NA

Typical mapping error rate.

double  0.01  [ [ 0  ∞ ] ]


--max-advi-iter-first-epoch / NA

Maximum ADVI iterations in the first epoch.

int  1000  [ [ 0  ∞ ] ]


--max-advi-iter-subsequent-epochs / NA

Maximum ADVI iterations in the subsequent epochs.

int  1000  [ [ 0  ∞ ] ]


--max-calling-iters / NA

Maximum number of calling internal self-consistency iterations.

int  1  [ [ 0  ∞ ] ]


--max-training-epochs / NA

Maximum number of training epochs.

int  100  [ [ 0  ∞ ] ]


--mean-bias-standard-deviation / NA

Contig-level mean bias standard deviation. If a single sample is provided, this input will be ignored.

double  0.01  [ [ 0  ∞ ] ]


--min-training-epochs / NA

Minimum number of training epochs.

int  20  [ [ 0  ∞ ] ]


--model / NA

Input ploidy-model directory. If only a single sample is specified, this input is required. If multiple samples are specified, this input should not be provided.

String  null


--num-thermal-epochs / NA

Number of thermal epochs (for DA-ADVI).

int  20  [ [ 0  ∞ ] ]


--output / NA

Output directory for sample contig-ploidy calls and the contig-ploidy model parameters for future use.

R String  null


--output-prefix / NA

Prefix for output filenames.

R String  null


--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean  false


--sample-psi-scale / NA

Sample-specific contig-level unexplained variance scale.

double  1.0E-4  [ [ 0  ∞ ] ]


--showHidden / -showHidden

display hidden arguments

boolean  false


--TMP_DIR / NA

Undocumented option

List[File]  []


--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean  false


--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean  false


--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version / NA

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Docs Index | Support Forum

GATK version 4.0.0.0 built at 10-58-2018 11:58:10.