SVDiscovery documentation

SVDiscovery

The SVDiscovery Queue script implements the original Genome STRiP deletion discovery pipeline.

The SVDiscovery deletion pipeline scans the genome for polymorphic sites of large deletions. The method is seeded iniitally on aberrantly spaced read pairs and then uses read depth as secondary support for variant sites.

This pipeline is complementary to the newer CNV discovery pipeline, which seeds based on read depth. Both pipelines will find large deletion sites, but the deletion discovery pipeline will be more sensitive than the CNV pipeline for shorter deletions (at the same sequencing depth) and less sensitive for larger deletions where the breakpoints occur in more repetitive regions of the genome.

Example

The SVDiscovery pipeline is a script that is run using the Queue workflow engine. See the QCommandLine documentation for additional information.

 classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
 java -Xmx4g -cp ${classpath} \
     org.broadinstitute.gatk.queue.QCommandLine \
     -S ${SV_DIR}/qscript/SVDiscovery.q \
     -S ${SV_DIR}/qscript/SVQScript.q \
     -cp ${classpath} \
     -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
     -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
     -R path_to_rmd_dir/reference_genome.fasta \
     -I input_bam_files.list \
     -genderMapFile gender_map_file.txt \
     -md input_metadata_directory \
     -runDirectory run1 \
     -jobLogDir run1/logs \
     -O run1/svdiscovery.dels.vcf \
     -minimumSize 100 \
     -maximumSize 100000 \
     -run

SVDiscovery specific arguments

Name Type Default value Summary
Required Inputs
-I List[String] NA One or more bam input files or input file lists (with extension .list)
Required Outputs
-O File NA The output vcf file
Required Parameters
-cp String NA The java classpath
-gatk File NA The path to the GenomeAnalysisTK.jar file
-md List[String] NA Location of metadata directory created during preprocessing
-R File NA The reference genome sequence for the input bam files (indexed fasta format)
-runDirectory File NA Directory for running the discovery pipeline
Optional Parameters
-debug Boolean false Enable verbose output for debugging
-maximumSize Integer NA Maximum event size for SV deletion discovery
-minimumSize Integer NA Minimum event size for SV deletion discovery
-windowPadding Integer NA Window padding for SV deletion discovery
-windowSize Integer NA Window size for SV deletion discovery
-genderMapFile List[File] NA File(s) containing the declared gender for each sample
-L String NA Genome interval to process for SV discovery
-inputFileIndexCache String NA Location where the index files for any remote input BAM files are cached
-jobLogDir File NA Directory for output log files from Queue
-P List[String] NA Override individual parameters from the configuration property file
Advanced Parameters
-configFile File NA Configuration property file with default settings
-excludeReadGroup List[String] NA Read group(s) to exclude or .list files of read groups to exclude
-genomeMaskFile List[File] NA The genome mask file(s) for the reference genome
-minimumInsertSizeMapFile File NA File containing a manual overrides of the minimum insert size used to select read pairs for deletion discovery
-partition List[String] NA Specific partition(s) to rerun
-ploidyMapFile File NA Map file defining the gender-specific ploidy of each region of the reference genome
-rmd File NA Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.)
-vcfCachingDistance Integer 10000 Caching distance for VCF records to preserve sort order (default 10kb)
Advanced Flags
-suppressVCFCommandLines Flag NA Flag indicating whether to suppress the command line headers in the output VCF file

Argument details

--classpath / -cp ( required String )

The java classpath.

--configFile / -configFile ( File )

Configuration property file with default settings.

Specifies the path to the default Genome STRiP configuration file.

The configuration file contains internal algorithm parameters and advanced settings, most of which are not documented and should rarely if ever be changed from their default values. If it is necessary to override an advance config file settings, the preferred method is to use the -P argument, not to edit or make copies the default configuration file.

--debug / -debug ( Boolean with default value false )

Enable verbose output for debugging.

--discoveryMaximumSize / -maximumSize ( Integer )

Maximum event size for SV deletion discovery.

The deletion discovery pipeline processes the genome in parallel in a set of overlapping intervals (each referred to as a partition). The intervals are defined by four inter-related parameters: -discoveryWindowSize, -discoveryWindowPadding, -discoveryMinimumSize and -discoveryMaximumSize.

For performance reasons, each run of the deletion discovery pipeline will search for deletions within a specified size range. The default is no maximum size, which will cause each chromosome to be called as a single window. This will be very slow and it is recommended to always specify a maximum deletion size.

For best scalability, it is recommended to run multiple SVDiscovery jobs to cover different size ranges of deletions. For example, one run to call deletions between 100bp and 100Kb, and a separate run to call deletions between 100Kb and 10Mb. The deletion discovery pipeline is not optimized for calling deletions larger than 10Mb.

--discoveryMinimumSize / -minimumSize ( Integer )

Minimum event size for SV deletion discovery.

The deletion discovery pipeline processes the genome in parallel in a set of overlapping intervals (each referred to as a partition). The intervals are defined by four inter-related parameters: -discoveryWindowSize, -discoveryWindowPadding, -discoveryMinimumSize and -discoveryMaximumSize.

For performance reasons, each run of the deletion discovery pipeline will search for deletions within a specified size range. The default is no minimum size.

For best scalability, it is recommended to run multiple SVDiscovery jobs to cover different size ranges of deletions. For example, one run to call deletions between 100bp and 100Kb, and a separate run to call deletions between 100Kb and 10Mb.

--discoveryWindowPadding / -windowPadding ( Integer )

Window padding for SV deletion discovery.

The deletion discovery pipeline processes the genome in parallel in a set of overlapping intervals (each referred to as a partition). The intervals are defined by four inter-related parameters: -discoveryWindowSize, -discoveryWindowPadding, -discoveryMinimumSize and -discoveryMaximumSize.

The window padding is intended to account for the window overlap needed to encompass all normally spaced read pairs, based on the expected distribution of DNA fragment lengths in the input data set. This value does not have a strong effect on run time, so is recommended to set it to a safely large value (for example 3 Kb for typical libraries with insert sizes under 1 Kb). If you may have libraries with longer insert sizes, however, you should set this parameter to a value that is safely above the longest expected insert size.

--discoveryWindowSize / -windowSize ( Integer )

Window size for SV deletion discovery.

The deletion discovery pipeline processes the genome in parallel in a set of overlapping intervals (each referred to as a partition). The intervals are defined by four inter-related parameters: -discoveryWindowSize, -discoveryWindowPadding, -discoveryMinimumSize and -discoveryMaximumSize.

The discovery window size specifies the "base" length of each genomic window. The default is to process each chromosome as a single window, which is inefficient. A typical value for the discovery window size is 1-5 Mb, but the optimum value can depend on the available compute infrastructure and input data set size. The actual window sizes used will be extended so that windows overlap by a minimum of the -discoveryWindowPadding plus -discoveryMaximumSize.

--excludeReadGroup / -excludeReadGroup ( List[String] )

Read group(s) to exclude or .list files of read groups to exclude.

This option allows you to exclude specific read groups from processing (as defined by the @RG identifiers in the input files).

This can be used to "downsample" the input data on a per-readgroup basis. This can be useful if one of the read groups is found to have a high sequencing error rate or other quality issues.

To use this option, the read group identifiers in the input files must be unique.

--gatkJarFile / -gatk ( required File )

The path to the GenomeAnalysisTK.jar file.

--genderMapFile / -genderMapFile ( List[File] )

File(s) containing the declared gender for each sample.

The gender map file should be a tab-delimited file with two columns: sample identifier (as specified in the input alignment files) and sample gender (either Male/Female or M/F). Multiple files may be supplied (but should contain disjoint sets of samples). If you supply gender, it must be supplied for all samples in the input files.

A gender map (and a ploidy map file) is required for calling variants on the sex chromosomes.

SVPreprocess now estimates the gender of each sample based on read depth and produces a gender map file suitable for discovery and genotyping. Currently, we do not use these estimated gender calls by default. You can choose to either use the estimated gender, or a file based on the declared gender of each sample or reconcile the two prior to running discovery or genotyping.

--genomeInterval / -L ( String )

Genome interval to process for SV discovery.

This parameter can be used to restrict deletion discovery to a specific genomic interval. The discovery windows will be chosen to cover this interval rather than the whole genome.

Note that if preprocessing of the input data set was limited to a set of restricted genome intervals, then any interval supplied during discovery must be completely contained within those restricted genome intervals.

--genomeMaskFile / -genomeMaskFile ( List[File] )

The genome mask file(s) for the reference genome.

This argument supplies a genome mask that is used to mask positions of the genome that should be ignored for analysis of read depth, typically because alignments of reads to these positions are not reliable. The default value for the genome mask is rmd/reference.svmask.fasta, an indexed fasta file with each position marked as 0 (unmasked) or 1 (masked).

In the current implementation, the default genome mask file is built based on a fixed k-mer length that should correspond roughly to the minimum read length in the input data set. The k-mer size used can usually be determined by inspecting the file names in the reference metadata directory you are using. If you data set contains especially short or long reads, you may want to override the default genome mask to use a mask with a different k-mer size. See ComputeGenomeMask for additional details.

In some specialized applications, such as the CNV discovery pipeline, an additional genome mask can be specified. When multiple masks are present, the union of the masked positions will not be used for read depth estimation.

The format of this file and the behavior of this argument may change in a future release.

--inputFile / -I ( required List[String] )

One or more bam input files or input file lists (with extension .list).

This argument can be repeated multiple times can contain a mixture of file locations and .list files. Each .list file should be a plain text file containing a list of file locations, one per line. If a .list file is used, the extension must be .list.

The input locations can be file paths or they can be URLs that use supported protocols. The supported protocols for input files include http, ftp and the Amazon s3 protocol.

--inputFileIndexCache / -inputFileIndexCache ( String )

Location where the index files for any remote input BAM files are cached.

This argument specifies the location (usually a local file system path) where the index files that correspond to the input alignment files are cached (or should be cached). This parameter is used with remote input files accessed over web protocols like http or ftp.

For some public data sets, pre-computed Genome STRiP metadata is available and this metadata may include the bam index files for the dataset. If the precomputed metadata has been downloaded to the local file system, this parameter is used to indicate where the bam indexes for the remote bam files can be found locally.

--jobLogDirectory / -jobLogDir ( File )

Directory for output log files from Queue.

This directory is used to store log files from the parallel jobs run by Queue during execution of the discovery pipeline. The log files contain information that can be helpful for debugging or performance tuning.

If not supplied, the default is to use the current working directory. A typical strategy is to make a log directory underneath the run directory for each SVDiscovery run and direct the log files there.

--metaDataLocation / -md ( required List[String] )

Location of metadata directory created during preprocessing.

This argument specifies the location of the Genome STRiP preprocessing output for the input bam files.

If preprocessing was run in batches, multiple -md arguments may be supplied. This allows joint calling to be run across multiple batches of input data sets. In practice, the number of -md arguments should be kept fairly small, as the metadata directories will be merged on the fly.

Joint calling can only be performed across compatible sets of input data (e.g. data aligned to the same reference sequence).

--minimumInsertSizeFile / -minimumInsertSizeMapFile ( File )

File containing a manual overrides of the minimum insert size used to select read pairs for deletion discovery.

By default, the threshold for detecting aberrantly space read pairs is determined based on the empirical distribution of the insert size lengths. This parameter allows selective override of this default calculation on a per-readgroup basis.

If supplied, the input file should be a tab-delimited file with four columns: SAMPLE, LIBRARY, READGROUP, ISIZE. The first three columns uniquely identify the read group and the fourth column specifies the minimum insert size which should be considered too widely spaced. A file header with the above column names is optional. Any read groups not specified in this input file will use the default criteria for aberrant read pair selection.

Note that to override the minimum insert size for a library, it is necessary to override it for all read groups sequnced from that library.

--outputFile / -O ( required File )

The output vcf file.

The output file is a site-only VCF containing all of the potential deletion sites evaluated by the deletion discovery pipeline.

Note that only the PASS sites (lines with PASS in the FILTER field in the output VCF file) should be considered confident deletion calls. The other lines indicate potential deletion sites that were evaluated but that did not pass the default filtering criteria. It is normal for there to be many such non-passing sites, especially when there are many input samples.

For advanced users, it is possible to increase sensitivity (or specificity), by re-filtering the output VCF file to apply a different set of customized site selection filters based on the annotations in the INFO field of each record. To successfully apply customized filters will typically require a deeper understanding of your data set and of the deletion discovery methods used in Genome STRiP.

--parameter / -P ( List[String] )

Override individual parameters from the configuration property file.

This argument can be used to override advanced settings from the configuration property file on a case-by-case basis. The syntax is -P name:value where name is one of the parameters in the configuration file.

--partition / -partition ( List[String] )

Specific partition(s) to rerun.

The deletion discovery pipeline runs many parallel jobs that process different sections of the reference genome. Each section is called a partition and is identified by an ID of the for Pnnnn (e.g. P1000). This argument allows you to force the pipeline to run a single partition or set of partitions and is typically used for debugging.

--ploidyMapFile / -ploidyMapFile ( File )

Map file defining the gender-specific ploidy of each region of the reference genome.

Although technically optional, the ploidy map file is required by some of the pipelines in Genome STRiP, including SVPreprocess.

The ploidy map file is generally present in the reference metadata directory along with the reference genome. The ploidy map file is used to indicate which parts of the reference genome have gender-dependent ploidy. This is used in conjunction with the gender of each sample to process sex chromosomes.

If you are using a reference genome that does not have a complete metadata directory (for example, a non-human reference), you will need to create your own ploidy map if you want to process sex chromosomes.

If no ploidy map file is supplied, some of the time Genome STRiP will default to behave as if the entire input genome has ploidy 2 (i.e. is diploid) in all individuals. You can also force this behavior by creating a ploidy map with a single line containing "* * * * 2" to indicate that all chromosomes are diploid.

--referenceFile / -R ( required File )

The reference genome sequence for the input bam files (indexed fasta format).

This must be the reference genome sequence that was used to align the input bam files.

In addition, this argument sets the default value for the reference metadata directory (see -rmd) which contains additional information about the reference genome that is required by Genome STRiP. Reference metadata directories are supplied with Genome STRiP for common human reference assemblies and generally you should set -R to refer to one of these directories that is appropriate for your input data.

--referenceMetaDataLocation / -rmd ( File )

Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.).

The -rmd location defaults to the directory of the reference genome, as specified by -R, and normally does not need to be supplied.

This argument can be used when you need to replace the reference genome with another file but want to continue to use the other auxilliary files in the reference metadata directory. In this case, the replacement reference genome must be very nearly identical to the original reference genome.

If you cannot use one of the standard sets of reference metadata supplied with Genome STRiP, for example because you are processing non-human data or using a different human reference genome, then you will need to generate the equivalent data for your reference genome.

--runDirectory / -runDirectory ( required File )

Directory for running the discovery pipeline.

The SVDiscovery pipeline uses the Queue pipeline manager to run many processes in parallel to scan the genome. The run directory contains a number of intermediate files and files that track the status of the parallel jobs.

The run directory must be exclusive to a single SVDiscovery run on a single input data set. Attempting to run multiple SVDiscovery jobs (or other Genome STRiP pipelines) in the same run directory, either serially or in parallel, will lead to conflicts over the intermediate files and status files in the run directory.

--suppressVCFCommandLines / -suppressVCFCommandLines ( Flag )

Flag indicating whether to suppress the command line headers in the output VCF file.

--vcfCachingDistance / -vcfCachingDistance ( Integer with default value 10000 )

Caching distance for VCF records to preserve sort order (default 10kb).

The output VCF file from SVDiscovery must be sorted in coordinate order, however the internal processing may consider potential sites in a slightly different order. This parameter controls the amount of in-memory buffering that is used to reorder the output VCF records before they are emitted to the output file. If the amount of buffering is insufficient, the processing of the current partition will fail with an error message.

The default value is usually sufficient to prevent out-of-order records from being omitted, but in complex regions of the genome the buffering may need to be increased depending on the input data set.