SVPreprocess documentation

SVPreprocess

The SVPreprocess Queue script computes summary metadata for a data set that is required by other Genome STRiP pipelines.

The preprocessing pipeline in Genome STRiP computes summary metadata for one or more bam files (referred to as an input "data set"). The computed summary data is stored in a dataset metadata directory specific to the dataset. All of the other Genome STRiP pipelines and most of the associated tools require this metadata.

Most of the Genome STRiP piplines will accept multiple dataset metadata directories as input (using the -md argument). This allows preprocessing to be performed in batches and then combine these batches for joint calling in various pipelines. The number of metadata directories should be kept reasonably small, as the metadata needs to be merged on the fly at each program invocation. The settings used to compute the metadata for each data set need to be the same if the metadata directories will be combined for joint calling.

Space management

The preprocessing pipeline will create a number of data files and sub-directories underneath the metadata output directory. The sub-directories generally contain intermediate files, for example per-input-file summaries that are later rolled up in to one top-level file. After preprocessing has successfully completed, any subdirectory of the metadata directory (and its contents) can be deleted to reclaim space, although if you do this and then rerun preprocessing, all of the intermediate files will need to be recreated anew.

Example

The SVPreprocess pipeline is a script that is run using the Queue workflow engine. See the QCommandLine documentation for additional information.

 classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
 java -Xmx4g -cp ${classpath} \
     org.broadinstitute.gatk.queue.QCommandLine \
     -S ${SV_DIR}/qscript/SVPreprocess.q \
     -S ${SV_DIR}/qscript/SVQScript.q \
     -cp ${classpath} \
     -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
     -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
     -R path_to_rmd_dir/reference_genome.fasta \
     -I input_bam_files.list \
     -md output_metadata_directory \
     -bamFilesAreDisjoint true \
     -jobLogDir logDir \
     -run

SVPreprocess specific arguments

Name Type Default value Summary
Required Inputs
-I List[String] NA One or more bam input files or input file lists (with extension .list)
Required Outputs
-md File NA Location of metadata directory created during preprocessing
Required Parameters
-R File NA The reference genome sequence for the input bam files (indexed fasta format)
Optional Parameters
-bamFilesAreDisjoint Boolean false Reduces memory footprint if each input file contains separate samples/libraries/read groups
-inputFileIndexCache String NA Location where the index files for any remote input BAM files are cached
-P List[String] NA Override individual parameters from the configuration property file
Advanced Parameters
-computeGCProfiles Boolean true True to calculate a per-library model for read depth bias as a function of genomic G+C content
-computeReadCounts Boolean true True to generate compact and efficient summaries of read depth of coverage.
-configFile File NA Configuration property file with default settings
-copyNumberMaskFile File NA Genome mask file for regions of potentially polymorphic copy number
-depthMaximumInsertSizeRadius Double 10.0 Maximum insert size radius for read depth analysis
-depthMinimumMappingQuality Integer 10 Minimum mapping quality for read depth analysis
-genderMaskBedFile File NA Bed file defining the sex chromosome regions that should be used in gender estimation
-genomeMaskFile List[File] NA The genome mask file(s) for the reference genome
-ploidyMapFile File NA Map file defining the gender-specific ploidy of each region of the reference genome
-readDepthMaskFile File NA Genome mask file specifying regions over which the sequencing depth should be estimated
-reduceInsertSizeDistributions Boolean true True to perform lossy compression on the fragment length distributions from each library to increase scalability
-rmd File NA Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.)

Argument details

--bamFilesAreDisjoint / -bamFilesAreDisjoint ( Boolean with default value false )

Reduces memory footprint if each input file contains separate samples/libraries/read groups.

If set to true, this argument reduces the memory footprint and some processing time when all of the input files contain disjoint data (in other words, the are no DNA samples, libraries or read groups that have reads in more than one of the input alignment files). If you are unsure whether your input data is disjoint by input file, then it is safer to let this parameter default to false.

--computeGCProfiles / -computeGCProfiles ( Boolean with default value true )

True to calculate a per-library model for read depth bias as a function of genomic G+C content.

By default, the G+C bias in read depth coverage is measured during preprocessing on a per-library basis. If this option is disabled, this information will not be available to downstream pipelines.

--computeReadCounts / -computeReadCounts ( Boolean with default value true )

True to generate compact and efficient summaries of read depth of coverage..

By default, during preprocessing a "read count cache" of summarized read counts is produced for each sample (and read group). This read count cache is used heavily by downstream pipelines to increase performance and eliminate the need in many cases to access the input alignment files.

--configFile / -configFile ( File )

Configuration property file with default settings.

Specifies the path to the default Genome STRiP configuration file.

The configuration file contains internal algorithm parameters and advanced settings, most of which are not documented and should rarely if ever be changed from their default values. If it is necessary to override an advance config file settings, the preferred method is to use the -P argument, not to edit or make copies the default configuration file.

--copyNumberMaskFile / -copyNumberMaskFile ( File )

Genome mask file for regions of potentially polymorphic copy number.

This argument supplies a genome mask that is used during estimation of GC-bias in each sequencing library. The default value for this argument is rmd/reference.gcmask.fasta, an indexed fasta file with each position marked as 0 (unmasked) or 1 (masked). This file is typically provided along with the other reference metadata in the reference metadata directory.

If you are using a reference genome that does not have a complete metadata directory (for example, a non-human reference), you will want to create your own mask file to use. For the human genome, the default mask aggressively removes all portions of the human genome that are in duplicated or repetitive sequence, sex chromosomes, unplaced contigs, or any regions that have been reported in DGV as variable in copy number between individuals.

--depthMaximumInsertSizeRadius / -depthMaximumInsertSizeRadius ( Double with default value 10.0 )

Maximum insert size radius for read depth analysis.

This option specifies a filter on how aberrantly spaced read pairs are filtered during preprocessing. It is in units of estimated (robust) standard deviation. It is recommended to leave this at its default value.

--depthMinimumMappingQuality / -depthMinimumMappingQuality ( Integer with default value 10 )

Minimum mapping quality for read depth analysis.

This option specifies a filter on mapping quality applied during preprocessing. It is recommended to leave this at its default value.

--genderMaskBedFile / -genderMaskBedFile ( File )

Bed file defining the sex chromosome regions that should be used in gender estimation.

This argument supplies a genome mask that is used during estimation of sample gender. The default value for this argument is rmd/reference.gendermask.bed, a bed file specifying the set of intervals to use. This file is typically provided along with the other reference metadata in the reference metadata directory.

This file specifies a set of "clean" regions on the sex chromosomes that are used for gender estimation. SVPreprocess estimates sample gender from normalized read depth as part of preprocessing. A report file is produced with sample gender and sex chromosome "dosage", but this information is currently not used in downstream processing by default. The user must explicitly specify a file containing the gender of each sample, which can be based on the read depth gender estimation or on the reported gender of each sample.

--genomeMaskFile / -genomeMaskFile ( List[File] )

The genome mask file(s) for the reference genome.

This argument supplies a genome mask that is used to mask positions of the genome that should be ignored for analysis of read depth, typically because alignments of reads to these positions are not reliable. The default value for the genome mask is rmd/reference.svmask.fasta, an indexed fasta file with each position marked as 0 (unmasked) or 1 (masked).

In the current implementation, the default genome mask file is built based on a fixed k-mer length that should correspond roughly to the minimum read length in the input data set. The k-mer size used can usually be determined by inspecting the file names in the reference metadata directory you are using. If you data set contains especially short or long reads, you may want to override the default genome mask to use a mask with a different k-mer size. See ComputeGenomeMask for additional details.

In some specialized applications, such as the CNV discovery pipeline, an additional genome mask can be specified. When multiple masks are present, the union of the masked positions will not be used for read depth estimation.

The format of this file and the behavior of this argument may change in a future release.

--inputFile / -I ( required List[String] )

One or more bam input files or input file lists (with extension .list).

This argument can be repeated multiple times can contain a mixture of file locations and .list files. Each .list file should be a plain text file containing a list of file locations, one per line. If a .list file is used, the extension must be .list.

The input locations can be file paths or they can be URLs, but preprocessing remote files is not recommended due to the amount of I/O that will be required.

--inputFileIndexCache / -inputFileIndexCache ( String )

Location where the index files for any remote input BAM files are cached.

This argument specifies the location (usually a local file system path) where the index files that correspond to the input alignment files are cached (or should be cached). This parameter is used with remote input files accessed over web protocols like http or ftp.

--metaDataLocation / -md ( required File )

Location of metadata directory created during preprocessing.

This argument specifies the location of the Genome STRiP preprocessing output for the input bam files.

The preprocessing pipeline will create a collection of files and subdirectories underneath the metadata directory.

--parameter / -P ( List[String] )

Override individual parameters from the configuration property file.

This argument can be used to override advanced settings from the configuration property file on a case-by-case basis. The syntax is -P name:value where name is one of the parameters in the configuration file.

--ploidyMapFile / -ploidyMapFile ( File )

Map file defining the gender-specific ploidy of each region of the reference genome.

Although technically optional, the ploidy map file is required by some of the pipelines in Genome STRiP, including SVPreprocess.

The ploidy map file is generally present in the reference metadata directory along with the reference genome. The ploidy map file is used to indicate which parts of the reference genome have gender-dependent ploidy. This is used in conjunction with the gender of each sample to process sex chromosomes.

If you are using a reference genome that does not have a complete metadata directory (for example, a non-human reference), you will need to create your own ploidy map if you want to process sex chromosomes.

If no ploidy map file is supplied, some of the time Genome STRiP will default to behave as if the entire input genome has ploidy 2 (i.e. is diploid) in all individuals. You can also force this behavior by creating a ploidy map with a single line containing "* * * * 2" to indicate that all chromosomes are diploid.

--readDepthMaskFile / -readDepthMaskFile ( File )

Genome mask file specifying regions over which the sequencing depth should be estimated.

This argument supplies a genome mask that is used during estimation of sequencing coverage in each sequencing run. The default value for this argument is rmd/reference.rdmask.bed, a bed file specifying the set of intervals to use. This file is typically provided along with the other reference metadata in the reference metadata directory.

If you are using a reference genome that does not have a complete metadata directory (for example, a non-human reference), you will want to create your own mask file to use. For the human genome, the default mask includes chromosomes 1-22, avoiding the sex chromosomes and any unplaced reference sequences.

--reduceInsertSizeDistributions / -reduceInsertSizeDistributions ( Boolean with default value true )

True to perform lossy compression on the fragment length distributions from each library to increase scalability.

By default, the empirical insert size (fragment length) distributions from each library are compressed during preprocessing. If this option is disabled, the number of samples that can be processed together is limited.

--referenceFile / -R ( required File )

The reference genome sequence for the input bam files (indexed fasta format).

This must be the reference genome sequence that was used to align the input bam files.

In addition, this argument sets the default value for the reference metadata directory (see -rmd) which contains additional information about the reference genome that is required by Genome STRiP. Reference metadata directories are supplied with Genome STRiP for common human reference assemblies and generally you should set -R to refer to one of these directories that is appropriate for your input data.

--referenceMetaDataLocation / -rmd ( File )

Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.).

The -rmd location defaults to the directory of the reference genome, as specified by -R, and normally does not need to be supplied.

This argument can be used when you need to replace the reference genome with another file but want to continue to use the other auxilliary files in the reference metadata directory. In this case, the replacement reference genome must be very nearly identical to the original reference genome.

If you cannot use one of the standard sets of reference metadata supplied with Genome STRiP, for example because you are processing non-human data or using a different human reference genome, then you will need to generate the equivalent data for your reference genome.