CNVDiscoveryPipeline documentation

CNVDiscoveryPipeline

The CNVDiscoveryPipeline Queue script implements a pipeline for discovering and copy number variants by seeding on read depth of coverage.

The CNV discovery pipline is a large and complex pipeline that recursively invokes the Queue workflow manager during many of the processing steps. See the CNV pipeline overview for a more complete description of the pipeline stages and processing steps.

This pipeline is complementary to the older deletion discovery pipeline, which seeds based on aberrantly spaced read pairs. Both pipelines will find large deletion sites, but the CNV pipeline will also find larger duplications and multi-allelic copy number variants. The deletion discovery pipeline will be more sensitive than the CNV pipeline for shorter deletions (at the same sequencing depth), but less sensitive for larger deletions where the breakpoints occur in more repetitive regions of the genome.

Window sizing and sequencing depth

The CNV discovery pipeline works by scanning overlapping windows across the genome looking for evidence of polymorphism based on analysis of read depth of coverage. The best window sizes to use depend on the sequencing depth in the data set being analyzed. Although a range of window sizes will typically work for any data set, window sizes that are too large may reduce sensitivity for shorter CNVs that would otherwise be detectable, while window sizes that are too small increase processing time and, in the extreme, can also reduce sensitivity.

Note that the tiling window size does not directly limit the size of variants that can be detected. In practice, it is possible to call variants that are some fraction of the tiling window size (down to approximately 50-75% of the initial tiling window size).

For deeply sequenced genomes at 30-40x coverage, we have found the following parameters to be reasonable:

   -tilingWindowSize 1000
   -tilingWindowOverlap 500
   -maximumReferenceGapLength 1000
   -boundaryPrecision 100
   -minimumRefinedLength 500
For more lightly sequenced genomes, you should modify the window sizing parameters roughly in proportion to sequencing depth. As an example, for the 1000 Genomes Project, which has sequencing depths in the range of 6-8x, we used the following parameters to call CNVs of length 3kb and up:
   -tilingWindowSize 5000
   -tilingWindowOverlap 2500
   -maximumReferenceGapLength 2500
   -boundaryPrecision 200
   -minimumRefinedLength 2500

It is a good idea to set the minimumRefinedLength parameter to a value somewhat below the minimum CNV length that can be called accurately in a given dataset and then filter to select variants at a slightly longer length scale. For instance, in the 1000 Genomes example above, a minimumRefinedLength of 2500 (bp) was used but the final call set consisted of variants 3kb and larger.

Example

The CNVDiscoveryPipeline is a script that is run using the Queue workflow engine. See the QCommandLine documentation for additional information.

The parameters used in the following example show typical values for deeply sequenced whole genomes, but these should be adjusted based on sequencing depth.

 classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
 java -Xmx4g -cp ${classpath} \
     org.broadinstitute.gatk.queue.QCommandLine \
     -S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
     -S ${SV_DIR}/qscript/SVQScript.q \
     -cp ${classpath} \
     -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
     -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
     -R path_to_rmd_dir/reference_genome.fasta \
     -I input_bam_files.list \
     -genderMapFile gender_map_file.txt \
     -md input_metadata_directory \
     -runDirectory run1 \
     -jobLogDir run1/logs \
     -intervalList reference_chromosomes.list \
     -tilingWindowSize 1000 \
     -tilingWindowOverlap 500 \
     -maximumReferenceGapLength 1000 \
     -boundaryPrecision 100 \
     -minimumRefinedLength 500 \
     -run

CNVDiscoveryPipeline specific arguments

Name Type Default value Summary
Required Inputs
-I List[String] NA One or more bam input files or input file lists (with extension .list)
Required Parameters
-boundaryPrecision Integer NA The desired precision on the boundary optimization phase of the pipeline
-cp String NA The java classpath
-gatk File NA The path to the GenomeAnalysisTK.jar file
-maximumReferenceGapLength Integer NA Maximum size of gaps in the reference genome that will be included within a window
-md List[String] NA Location of metadata directory created during preprocessing
-minimumRefinedLength Integer NA Minimum length of CNVs to call (in units of alignable bases)
-R File NA The reference genome sequence for the input bam files (indexed fasta format)
-runDirectory File NA Directory for running the genotyping pipeline
-tilingWindowOverlap Integer NA Overlap between adjacent windows used in searching the genome
-tilingWindowSize Integer NA Initial window size for searching the genome
Optional Parameters
-genderMapFile List[File] NA File(s) containing the declared gender for each sample
-genotypingParallelRecords Integer 1000 The number of parallel records to use in all genotyping phases
-inputFileIndexCache String NA Location where the index files for any remote input BAM files are cached
-intervalList List[String] NA Interval(s) or .list for which intervals to process (default is to process the whole genome)
-jobLogDir File NA Directory for output log files from Queue
-maximumRefinedLength Integer NA Maximum length of CNVs to call (in units of alignable bases)
-P List[String] NA Override individual parameters from the configuration property file
Advanced Parameters
-configFile File NA Configuration property file with default settings
-excludeReadGroup List[String] NA Read group(s) to exclude or .list files of read groups to exclude
-genomeMaskFile List[File] NA The genome mask file(s) for the reference genome
-lastStage Integer NA For debugging, specify the last pipeline stage to be executed
-ploidyMapFile File NA Map file defining the gender-specific ploidy of each region of the reference genome
-rmd File NA Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.)
Advanced Flags
-produceAuxiliaryFiles Flag NA Flag specifying whether to produce auxiliary files and retain them

Argument details

--boundaryPrecision / -boundaryPrecision ( required Integer )

The desired precision on the boundary optimization phase of the pipeline.

This parameter determines the resolution (in alignable base pairs) at which CNV boundaries are tested during the boundary optimization phase of the pipeline.

The boundary optimization phase uses read depth, so values below 100 will have limited effect. Larger values will speed up the boundary optimization process to some extent. Larger values may be more appropriate when using low coverage sequencing data.

--classpath / -cp ( required String )

The java classpath.

--configFile / -configFile ( File )

Configuration property file with default settings.

Specifies the path to the default Genome STRiP configuration file.

The configuration file contains internal algorithm parameters and advanced settings, most of which are not documented and should rarely if ever be changed from their default values. If it is necessary to override an advance config file settings, the preferred method is to use the -P argument, not to edit or make copies the default configuration file.

--excludeReadGroup / -excludeReadGroup ( List[String] )

Read group(s) to exclude or .list files of read groups to exclude.

This option allows you to exclude specific read groups from processing (as defined by the @RG identifiers in the input files).

This can be used to "downsample" the input data on a per-readgroup basis. This can be useful if one of the read groups is found to have a high sequencing error rate or other quality issues.

To use this option, the read group identifiers in the input files must be unique.

--gatkJarFile / -gatk ( required File )

The path to the GenomeAnalysisTK.jar file.

--genderMapFile / -genderMapFile ( List[File] )

File(s) containing the declared gender for each sample.

The gender map file should be a tab-delimited file with two columns: sample identifier (as specified in the input alignment files) and sample gender (either Male/Female or M/F). Multiple files may be supplied (but should contain disjoint sets of samples). If you supply gender, it must be supplied for all samples in the input files.

A gender map (and a ploidy map file) is required for calling variants on the sex chromosomes.

SVPreprocess now estimates the gender of each sample based on read depth and produces a gender map file suitable for discovery and genotyping. Currently, we do not use these estimated gender calls by default. You can choose to either use the estimated gender, or a file based on the declared gender of each sample or reconcile the two prior to running discovery or genotyping.

--genomeMaskFile / -genomeMaskFile ( List[File] )

The genome mask file(s) for the reference genome.

This argument supplies a genome mask that is used to mask positions of the genome that should be ignored for analysis of read depth, typically because alignments of reads to these positions are not reliable. The default value for the genome mask is rmd/reference.svmask.fasta, an indexed fasta file with each position marked as 0 (unmasked) or 1 (masked).

In the current implementation, the default genome mask file is built based on a fixed k-mer length that should correspond roughly to the minimum read length in the input data set. The k-mer size used can usually be determined by inspecting the file names in the reference metadata directory you are using. If you data set contains especially short or long reads, you may want to override the default genome mask to use a mask with a different k-mer size. See ComputeGenomeMask for additional details.

In some specialized applications, such as the CNV discovery pipeline, an additional genome mask can be specified. When multiple masks are present, the union of the masked positions will not be used for read depth estimation.

The format of this file and the behavior of this argument may change in a future release.

--genotypingParallelRecords / -genotypingParallelRecords ( Integer with default value 1000 )

The number of parallel records to use in all genotyping phases.

This pipeline applies the Genome STRiP genotyping module many thousands of times during pipeline execution. This parameter controls how many windows or prospective CNV sites are genotyped in each parallel job. Higher values will cause there to be fewer total jobs, but each job will run longer. Lower values do the opposite: more parallel jobs, each running for less time. This parameter is the most effective way to control how the processing is partitioned between jobs. See also the -parallelRecords option to SVGenotyper.

--inputFile / -I ( required List[String] )

One or more bam input files or input file lists (with extension .list).

This argument can be repeated multiple times can contain a mixture of file locations and .list files. Each .list file should be a plain text file containing a list of file locations, one per line. If a .list file is used, the extension must be .list.

The input locations can be file paths or they can be URLs that use supported protocols. The supported protocols for input files include http, ftp and the Amazon s3 protocol.

--inputFileIndexCache / -inputFileIndexCache ( String )

Location where the index files for any remote input BAM files are cached.

This argument specifies the location (usually a local file system path) where the index files that correspond to the input alignment files are cached (or should be cached). This parameter is used with remote input files accessed over web protocols like http or ftp.

For some public data sets, pre-computed Genome STRiP metadata is available and this metadata may include the bam index files for the dataset. If the precomputed metadata has been downloaded to the local file system, this parameter is used to indicate where the bam indexes for the remote bam files can be found locally.

--intervalList / -intervalList ( List[String] )

Interval(s) or .list for which intervals to process (default is to process the whole genome).

This parameter allows you to restrict processing to a subset of the genome.

For example, you can supply a list file containing just the names of the autosomal chromosomes (as specified in your reference genome) to limit processing to the autosome. You can also use this parameter to exclude processing of unplaced contigs or decoy sequence in your reference genome (recommended).

--jobLogDirectory / -jobLogDir ( File )

Directory for output log files from Queue.

This directory is used to store log files from the parallel jobs run by Queue during execution of the discovery pipeline. The log files contain information that can be helpful for debugging or performance tuning.

If not supplied, the default is to use the current working directory. A typical strategy is to make a log directory underneath the run directory for each SVDiscovery run and direct the log files there.

--lastStage / -lastStage ( Integer )

For debugging, specify the last pipeline stage to be executed.

Forces the pipeline to stop after N stages, rather than running to completion.

--maximumReferenceGapLength / -maximumReferenceGapLength ( required Integer )

Maximum size of gaps in the reference genome that will be included within a window.

This pipeline starts by looking for evidence of variation in windows across the genome. This parameter defines the length of gap on the reference sequence that will be included within a window. Larger gaps will cause a new window to be started.

--maximumRefinedLength / -maximumRefinedLength ( Integer )

Maximum length of CNVs to call (in units of alignable bases).

This parameter controls the search during the boundary refinement phase of the pipeline. No CNVs larger than this length will be called by the pipeline. Default is no limit on the maximum length.

--metaDataLocation / -md ( required List[String] )

Location of metadata directory created during preprocessing.

This argument specifies the location of the Genome STRiP preprocessing output for the input bam files.

If preprocessing was run in batches, multiple -md arguments may be supplied. This allows joint calling to be run across multiple batches of input data sets. In practice, the number of -md arguments should be kept fairly small, as the metadata directories will be merged on the fly.

Joint calling can only be performed across compatible sets of input data (e.g. data aligned to the same reference sequence).

--minimumRefinedLength / -minimumRefinedLength ( required Integer )

Minimum length of CNVs to call (in units of alignable bases).

This parameter controls the search during the boundary refinement phase of the pipeline. No CNVs shorter than this length will be called by the pipeline. Setting this parameter too low can lead to loss of sensitivity during boundary refinement.

--parameter / -P ( List[String] )

Override individual parameters from the configuration property file.

This argument can be used to override advanced settings from the configuration property file on a case-by-case basis. The syntax is -P name:value where name is one of the parameters in the configuration file.

--ploidyMapFile / -ploidyMapFile ( File )

Map file defining the gender-specific ploidy of each region of the reference genome.

Although technically optional, the ploidy map file is required by some of the pipelines in Genome STRiP, including SVPreprocess.

The ploidy map file is generally present in the reference metadata directory along with the reference genome. The ploidy map file is used to indicate which parts of the reference genome have gender-dependent ploidy. This is used in conjunction with the gender of each sample to process sex chromosomes.

If you are using a reference genome that does not have a complete metadata directory (for example, a non-human reference), you will need to create your own ploidy map if you want to process sex chromosomes.

If no ploidy map file is supplied, some of the time Genome STRiP will default to behave as if the entire input genome has ploidy 2 (i.e. is diploid) in all individuals. You can also force this behavior by creating a ploidy map with a single line containing "* * * * 2" to indicate that all chromosomes are diploid.

--produceAuxilliaryFiles / -produceAuxiliaryFiles ( Flag )

Flag specifying whether to produce auxiliary files and retain them.

If this flag is supplied, then the pipeline will keep auxilliary files at all stages of processing. This can be useful for debugging, but requires more working storage in the run directory.

--referenceFile / -R ( required File )

The reference genome sequence for the input bam files (indexed fasta format).

This must be the reference genome sequence that was used to align the input bam files.

In addition, this argument sets the default value for the reference metadata directory (see -rmd) which contains additional information about the reference genome that is required by Genome STRiP. Reference metadata directories are supplied with Genome STRiP for common human reference assemblies and generally you should set -R to refer to one of these directories that is appropriate for your input data.

--referenceMetaDataLocation / -rmd ( File )

Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.).

The -rmd location defaults to the directory of the reference genome, as specified by -R, and normally does not need to be supplied.

This argument can be used when you need to replace the reference genome with another file but want to continue to use the other auxilliary files in the reference metadata directory. In this case, the replacement reference genome must be very nearly identical to the original reference genome.

If you cannot use one of the standard sets of reference metadata supplied with Genome STRiP, for example because you are processing non-human data or using a different human reference genome, then you will need to generate the equivalent data for your reference genome.

--runDirectory / -runDirectory ( required File )

Directory for running the genotyping pipeline.

The CNVDiscovery pipeline uses the Queue pipeline manager to run many processes in parallel. The run directory contains files and subdirectories used to track the stages of the pipeline, to track the status of parallel jobs within each stage, and to hold intermediate results.

The run directory must be exclusive to a single pipeline run on a single input data set. Attempting to run multiple pipeline jobs in the same run directory, either serially or in parallel, will lead to conflicts over the intermediate files and status files in the run directory.

--tilingWindowOverlap / -tilingWindowOverlap ( required Integer )

Overlap between adjacent windows used in searching the genome.

This pipeline starts by looking for evidence of variation in windows across the genome. This parameter specifies the overlap between adjacent windows (in units of alignable bases). Typical practice is to set this to half of the tilingWindowSize.

--tilingWindowSize / -tilingWindowSize ( required Integer )

Initial window size for searching the genome.

This pipeline starts by looking for evidence of variation in windows across the genome. This parameter specifies the initial window size (in units of alignable bases).