GenerateDepthProfiles documentation

GenerateDepthProfiles

The GenerateDepthProfiles Queue script implements a pipeline for generating read depth profiles.

Overview

Read depth profiles summarize the read depth of coverage in each sample in bins across the genome. The bins all have an equal "effective length", but not necessarily an equal genomic length. The effective length is the number of uniquely (or reliably) alignable bases in an interval. Typical sizes for read depth profiles are 100Kb, 10Kb and sometimes 100bp. Read depth profiles are used in some downstream pipelines, such as the LCNVDiscoveryPipeline.

Example

The GenerateDepthProfiles is a script that is run using the Queue workflow engine. See the QCommandLine documentation for additional information.

The parameters used in the following example show typical values for deeply sequenced whole genomes.

 classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
 java -Xmx4g -cp ${classpath} \
     org.broadinstitute.gatk.queue.QCommandLine \
     -S ${SV_DIR}/qscript/profiles/GenerateDepthProfiles.q \
     -S ${SV_DIR}/qscript/SVQScript.q \
     -cp ${classpath} \
     -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
     -R path_to_rmd_dir/reference_genome.fasta \
     -md input_metadata_directory \
     -profileBinSize 10000 \
     -maximumReferenceGapLength 1000 \
     -runDirectory profiles_10000 \
     -jobLogDir profiles_10000/logs \
     -run

GenerateDepthProfiles specific arguments

Name Type Default value Summary
Required Parameters
-cp String NA The java classpath
-gatk File NA The path to the GenomeAnalysisTK.jar file
-profileBinSize Integer 100000 Bin size of the read depth profiles
-R File NA The reference genome sequence for the input bam files (indexed fasta format)
Optional Parameters
-jobLogDir File NA Directory for output log files from Queue
-maximumReferenceGapLength Integer NA An upper limit on the length of a reference gap that will be included within a bin
Advanced Parameters
-genomeMaskFile List[File] NA The genome mask file(s) for the reference genome
-ploidyMapFile File NA Map file defining the gender-specific ploidy of each region of the reference genome
-rmd File NA Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.)

Argument details

--classpath / -cp ( required String )

The java classpath.

--gatkJarFile / -gatk ( required File )

The path to the GenomeAnalysisTK.jar file.

--genomeMaskFile / -genomeMaskFile ( List[File] )

The genome mask file(s) for the reference genome.

This argument supplies a genome mask that is used to mask positions of the genome that should be ignored for analysis of read depth, typically because alignments of reads to these positions are not reliable. The default value for the genome mask is rmd/reference.svmask.fasta, an indexed fasta file with each position marked as 0 (unmasked) or 1 (masked).

In the current implementation, the default genome mask file is built based on a fixed k-mer length that should correspond roughly to the minimum read length in the input data set. The k-mer size used can usually be determined by inspecting the file names in the reference metadata directory you are using. If you data set contains especially short or long reads, you may want to override the default genome mask to use a mask with a different k-mer size. See ComputeGenomeMask for additional details.

In some specialized applications, such as the CNV discovery pipeline, an additional genome mask can be specified. When multiple masks are present, the union of the masked positions will not be used for read depth estimation.

The format of this file and the behavior of this argument may change in a future release.

--jobLogDirectory / -jobLogDir ( File )

Directory for output log files from Queue.

This directory is used to store log files from the parallel jobs run by Queue during execution of the discovery pipeline. The log files contain information that can be helpful for debugging or performance tuning.

If not supplied, the default is to use the current working directory. A typical strategy is to make a log directory underneath the run directory for each SVDiscovery run and direct the log files there.

--maximumReferenceGapLength / -maximumReferenceGapLength ( Integer )

An upper limit on the length of a reference gap that will be included within a bin.

This parameter determines the maximum length of a gap (run of Ns) in the reference sequence that will be allowed in any bin. If a gap longer than this value is encountered, then the bin prior to the gap will be truncated below the target bin size and the next bin will begin after the reference gap.

--ploidyMapFile / -ploidyMapFile ( File )

Map file defining the gender-specific ploidy of each region of the reference genome.

Although technically optional, the ploidy map file is required by some of the pipelines in Genome STRiP, including SVPreprocess.

The ploidy map file is generally present in the reference metadata directory along with the reference genome. The ploidy map file is used to indicate which parts of the reference genome have gender-dependent ploidy. This is used in conjunction with the gender of each sample to process sex chromosomes.

If you are using a reference genome that does not have a complete metadata directory (for example, a non-human reference), you will need to create your own ploidy map if you want to process sex chromosomes.

If no ploidy map file is supplied, some of the time Genome STRiP will default to behave as if the entire input genome has ploidy 2 (i.e. is diploid) in all individuals. You can also force this behavior by creating a ploidy map with a single line containing "* * * * 2" to indicate that all chromosomes are diploid.

--profileBinSize / -profileBinSize ( required Integer with default value 100000 )

Bin size of the read depth profiles.

This parameter specifies the bin size (most commonly, 10000 or 100000) to be used in generating the read depth profiles.

--referenceFile / -R ( required File )

The reference genome sequence for the input bam files (indexed fasta format).

This must be the reference genome sequence that was used to align the input bam files.

In addition, this argument sets the default value for the reference metadata directory (see -rmd) which contains additional information about the reference genome that is required by Genome STRiP. Reference metadata directories are supplied with Genome STRiP for common human reference assemblies and generally you should set -R to refer to one of these directories that is appropriate for your input data.

--referenceMetaDataLocation / -rmd ( File )

Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.).

The -rmd location defaults to the directory of the reference genome, as specified by -R, and normally does not need to be supplied.

This argument can be used when you need to replace the reference genome with another file but want to continue to use the other auxilliary files in the reference metadata directory. In this case, the replacement reference genome must be very nearly identical to the original reference genome.

If you cannot use one of the standard sets of reference metadata supplied with Genome STRiP, for example because you are processing non-human data or using a different human reference genome, then you will need to generate the equivalent data for your reference genome.