LCNVDiscoveryPipeline documentation

LCNVDiscoveryPipeline

The LCNVDiscoveryPipeline Queue script implements a pipeline for discovering large (100Kb+) germline and mosaic/somatic copy number variants.

Overview

The algorithm analyzes read depth-of-coverage in fixed size bins across the genome to detect copy number changes. Multiple samples are normalized together, but each sample is called separately. The prerequsite to running this pipeline is to generate read depth profiless for a given bin size (usually, 10,000 or 100,000 bp), see GenerateDepthProfiles for instructions.

The algorithm works by scoring each sample on a subset of bin intervals and finding local maxima in those scores that are above a certain threshold. The subset of the bin intervals that get scored is controlled by the argument maxDepth. If it is set to 0 ("complete scan"), the score will be computed for all of the bin intervals. Otherwise, the scores are computed only for the bin intervals that are at most maxDepth bins in length ("progressive scan"). Since intervals of length maxDeoth may be shorter than the CNV events, the method recursively merges the profiles to create a larger bin size (2 times the previous bin size) and then the scores get computed for the larger bin size intervals. This process is repeated until merged bin size times maxDepth is greater than or equal to the length of the chromosome.

Standard practice is to use bin sizes of 10Kb ("10Kb profiles"), although 100Kb profiles (such as those generated during preprocessing) can also be used. For 10kb bins, we recommend generating the bins using a maximumReferenceGapLength of 1000. It is computationally expensive to score all the bin intervals (complexity of O(n^2)), so when using the 10Kb profiles we recommend a positive value for maxDepth, usually 50 (which is the default for 10kb profiles).

Output Format

Unlike other GenomeStrip pipelines, the LCNV pipeline outputs the calls as a tab-delimited file, not as a vcf file.

Variants are reported in each sample separately. The columns in the output file are:

CHROM
chromosome
START
start coordinate of the variant segment
END
end coordinate of the variant segment
SAMPLE
sample carrying this variant
SCORE
the score for this call
COVERAGE
the estimated copy number for this sample in the interval CHROM:START-END
NBINS
the length of the call in bins
STARTBIN
the start bin of the call (numbered starting from 1)
ENDBIN
the end bin of the call (numbered starting from 1)
MERGE
the merge value at which the call was made (see maxDepth)

Filtering

The calls emitted by the pipeline are not filtered. Additional filters need to be applied to achieve good specificity.

Standard practice is to filter the calls using these conditions: NBINS >= 10 && SCORE >= 1000

This filtering usually works pretty well for the samples that don't have excessive variability in the read depth coverage. For "noisier" samples, however, this will produce a significant number of false positive calls and more stringent filtering will be required. Higher specificity can also be achieved by filtering out shorter calls with small copy number changes (i.e. mosaic/somatic calls with COVERAGE close to 2), where there is less power to reliably detect these events.

Example

The LCNVDiscoveryPipeline is a script that is run using the Queue workflow engine. See the QCommandLine documentation for additional information.

The parameters used in the following example show typical values for deeply sequenced whole genomes.

 classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
 java -Xmx4g -cp ${classpath} \
     org.broadinstitute.gatk.queue.QCommandLine \
     -S ${SV_DIR}/qscript/discovery/lcnv/LCNVDiscoveryPipeline.q \
     -S ${SV_DIR}/qscript/SVQScript.q \
     -cp ${classpath} \
     -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
     -R path_to_rmd_dir/reference_genome.fasta \
     -md input_metadata_directory \
     -profilesDir path_to_profiles_10000 \
     -runDirectory lcnv_output \
     -jobLogDir lcnv_output/logs \
     -run

LCNVDiscoveryPipeline specific arguments

Name Type Default value Summary
Required Parameters
-cp String NA The java classpath
-gatk File NA The path to the GenomeAnalysisTK.jar file
-profilesDir List[File] NA Path to the directory containing the input read depth profiles
-R File NA The reference genome sequence for the input bam files (indexed fasta format)
Optional Parameters
-backgroundSample List[String] NA Sample(s) or .list files of samples to be used as background samples for normalization
-genderMapFile File NA Gender map file for the samples being scanned
-jobLogDir File NA Directory for output log files from Queue
-maxDepth Integer NA The maximum length of merged bin intervals to be scored at each merge level
-sample List[String] NA Sample(s) or .list files of samples to be scanned for large CNV events
Advanced Parameters
-genomeMaskFile List[File] NA The genome mask file(s) for the reference genome
-ploidyMapFile File NA Map file defining the gender-specific ploidy of each region of the reference genome
-rmd File NA Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.)

Argument details

--backgroundSample / -backgroundSample ( List[String] )

Sample(s) or .list files of samples to be used as background samples for normalization.

Background sample(s) or .list files of samples which will be used as background samples when computing the scores for each target sample. This parameter can be used to exclude some samples from the cross-sample normalization step. If this parameter is omitted, it will be defaulted to all the samples in the read depth profiles.

--classpath / -cp ( required String )

The java classpath.

--gatkJarFile / -gatk ( required File )

The path to the GenomeAnalysisTK.jar file.

--genderMapFile / -genderMapFile ( File )

Gender map file for the samples being scanned.

Gender map file for the samples being scanned. This parameter is required if the scan includes sex chromosome(s)

--genomeMaskFile / -genomeMaskFile ( List[File] )

The genome mask file(s) for the reference genome.

This argument supplies a genome mask that is used to mask positions of the genome that should be ignored for analysis of read depth, typically because alignments of reads to these positions are not reliable. The default value for the genome mask is rmd/reference.svmask.fasta, an indexed fasta file with each position marked as 0 (unmasked) or 1 (masked).

In the current implementation, the default genome mask file is built based on a fixed k-mer length that should correspond roughly to the minimum read length in the input data set. The k-mer size used can usually be determined by inspecting the file names in the reference metadata directory you are using. If you data set contains especially short or long reads, you may want to override the default genome mask to use a mask with a different k-mer size. See ComputeGenomeMask for additional details.

In some specialized applications, such as the CNV discovery pipeline, an additional genome mask can be specified. When multiple masks are present, the union of the masked positions will not be used for read depth estimation.

The format of this file and the behavior of this argument may change in a future release.

--jobLogDirectory / -jobLogDir ( File )

Directory for output log files from Queue.

This directory is used to store log files from the parallel jobs run by Queue during execution of the discovery pipeline. The log files contain information that can be helpful for debugging or performance tuning.

If not supplied, the default is to use the current working directory. A typical strategy is to make a log directory underneath the run directory for each SVDiscovery run and direct the log files there.

--maxDepth / -maxDepth ( Integer )

The maximum length of merged bin intervals to be scored at each merge level.

This parameter determines the maximum length of the merged bin intervals for which the score will be computed at each merge level. By default, if binSize is 100000 or more, maxDepth will default to 0, otherwise maxDepth will default to 50. Explicitly setting this parameter to 0 will result in all of the possible bin intervals being evaluated and scored, which is slower but more thorough.

--ploidyMapFile / -ploidyMapFile ( File )

Map file defining the gender-specific ploidy of each region of the reference genome.

Although technically optional, the ploidy map file is required by some of the pipelines in Genome STRiP, including SVPreprocess.

The ploidy map file is generally present in the reference metadata directory along with the reference genome. The ploidy map file is used to indicate which parts of the reference genome have gender-dependent ploidy. This is used in conjunction with the gender of each sample to process sex chromosomes.

If you are using a reference genome that does not have a complete metadata directory (for example, a non-human reference), you will need to create your own ploidy map if you want to process sex chromosomes.

If no ploidy map file is supplied, some of the time Genome STRiP will default to behave as if the entire input genome has ploidy 2 (i.e. is diploid) in all individuals. You can also force this behavior by creating a ploidy map with a single line containing "* * * * 2" to indicate that all chromosomes are diploid.

--profilesDir / -profilesDir ( required List[File] )

Path to the directory containing the input read depth profiles.

This parameter specifies the location(s) of the read depth profiles for some bin size (e.g. 10000 or 100000 bin size profiles). If multiple locations are specified they must contain read depth profiles for the same bin size, and the sets of samples from each location must be disjoint.

--referenceFile / -R ( required File )

The reference genome sequence for the input bam files (indexed fasta format).

This must be the reference genome sequence that was used to align the input bam files.

In addition, this argument sets the default value for the reference metadata directory (see -rmd) which contains additional information about the reference genome that is required by Genome STRiP. Reference metadata directories are supplied with Genome STRiP for common human reference assemblies and generally you should set -R to refer to one of these directories that is appropriate for your input data.

--referenceMetaDataLocation / -rmd ( File )

Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.).

The -rmd location defaults to the directory of the reference genome, as specified by -R, and normally does not need to be supplied.

This argument can be used when you need to replace the reference genome with another file but want to continue to use the other auxilliary files in the reference metadata directory. In this case, the replacement reference genome must be very nearly identical to the original reference genome.

If you cannot use one of the standard sets of reference metadata supplied with Genome STRiP, for example because you are processing non-human data or using a different human reference genome, then you will need to generate the equivalent data for your reference genome.

--sample / -sample ( List[String] )

Sample(s) or .list files of samples to be scanned for large CNV events.

Sample(s) or .list files of samples for which the scan of large CNV events will be performed. If this parameter is omitted, it will be defaulted to all the samples in the read depth profiles.