SVGenotyper documentation

SVGenotyper

The SVGenotyper Queue script implements a generic interface to genotype a defined set of structural variants with Genome STRiP.

This pipeline can be used to genotype variants discovered using Genome STRiP (e.g. SVDiscovery or CNVDiscoveryPipeline) or known polymorphic sites from other sources (e.g. the 1000 Genomes Project). It can also be used to genotype variants discovered in one cohort in additional cohorts.

The input to SVGenotyper is a site VCF - a VCF format file containing in the first eight columns providing information about a potentially variant genomic site. Any additional information columns in the input VCF file (supplying genotypes for specific samples) are ignored.

The output from SVGenotyper is a genotype VCF - a VCF format file containing the input site descriptions, additional site-specific information and a called genotype for each input sample. The per-sample genotype information may contain different fields, as described below, but genotypes are generally reported in terms of likelihoods (the likelihood of each potential genotype for each sample). Genotypes with a 95% posterior probability are reported as "confident" genotype calls.

For imprecise variants (those with symbolic alleles), the genotyping method applied to the input sites is driven in part by the SVTYPE INFO tag on each input variant, which selects the genotyping model to apply. If the SVTYPE tag specifies a deletion ("DEL"), then genotyping restricts the set of allowable alleles to just two: The reference allele (assumed to correspond to 1 copy) and an alternative deletion allele in which the reference sequence is missing. If the SVTYPE tag specifies a general copy number variant ("CNV"), then a different model is applied in which potentially many different copy number alleles are considered. The output genotypes may classify the site as a pure deletion (only two alleles, reference and copy-number zero), a pure duplication (two alleles, reference and copy-number 2) or a multi-allelic CNV with many different copy-number states.

This script supports parallel processing of the input VCF file across a compute cluster. The desired parallelism can be specified in one of two different ways: The -parallelRecords argument specified how many input records should be processed in each job, while the -parallelJobs argument specifies how many parallel jobs should be run (dividing the input sites between them). If neither argument is specified, then parallel execution is disabled and all input sites will be genotyped serially in a single job.

The -runDirectory argument specifies a directory where intermediate files are generated. These files track the status of the parallel jobs and also contain in some cases auxilliary output that can be utilitized by certain other Genome STRiP applications. For example, the PlotGenotypingResults utility and a few of the annotators utilize information from these auxilliary files. These intermediate files can also be deleted to conserve space.

Example

The SVGenotyper pipeline is a script that is run using the Queue workflow engine. See the QCommandLine documentation for additional information.

 classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
 java -Xmx4g -cp ${classpath} \
     org.broadinstitute.gatk.queue.QCommandLine \
     -S ${SV_DIR}/qscript/SVGenotyper.q \
     -S ${SV_DIR}/qscript/SVQScript.q \
     -cp ${classpath} \
     -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
     -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
     -R path_to_rmd_dir/reference_genome.fasta \
     -I input_bam_files.list \
     -genderMapFile gender_map_file.txt \
     -md input_metadata_directory \
     -runDirectory run1 \
     -jobLogDir run1/logs \
     -vcf input_file.sites.vcf \
     -O run1/output_file.genotypes.vcf \
     -parallelRecords 100 \
     -run

SVGenotyper specific arguments

Name Type Default value Summary
Required Inputs
-I List[String] NA One or more bam input files or input file lists (with extension .list)
-vcf File NA Input site VCF file containing the sites to genotype
Required Outputs
-O File NA The output genotype VCF file
Required Parameters
-cp String NA The java classpath
-gatk File NA The path to the GenomeAnalysisTK.jar file
-md List[String] NA Location of metadata directory created during preprocessing
-R File NA The reference genome sequence for the input bam files (indexed fasta format)
-runDirectory File NA Directory for running the genotyping pipeline
Optional Parameters
-genderMapFile List[File] NA File(s) containing the declared gender for each sample
-inputFileIndexCache String NA Location where the index files for any remote input BAM files are cached
-jobLogDir File NA Directory for output log files from Queue
-parallelJobs Integer NA Number of parallel jobs to run
-parallelRecords Integer NA Number of records to process in each parallel job
-P List[String] NA Override individual parameters from the configuration property file
Advanced Parameters
-configFile File NA Configuration property file with default settings
-excludeReadGroup List[String] NA Read group(s) to exclude or .list files of read groups to exclude
-genomeMaskFile List[File] NA The genome mask file(s) for the reference genome
-partition List[String] NA Specific partition(s) to rerun
-ploidyMapFile File NA Map file defining the gender-specific ploidy of each region of the reference genome
-rmd File NA Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.)
-sample List[String] NA Sample(s) to process or .list files of samples

Argument details

--classpath / -cp ( required String )

The java classpath.

--configFile / -configFile ( File )

Configuration property file with default settings.

Specifies the path to the default Genome STRiP configuration file.

The configuration file contains internal algorithm parameters and advanced settings, most of which are not documented and should rarely if ever be changed from their default values. If it is necessary to override an advance config file settings, the preferred method is to use the -P argument, not to edit or make copies the default configuration file.

--excludeReadGroup / -excludeReadGroup ( List[String] )

Read group(s) to exclude or .list files of read groups to exclude.

This option allows you to exclude specific read groups from processing (as defined by the @RG identifiers in the input files).

This can be used to "downsample" the input data on a per-readgroup basis. This can be useful if one of the read groups is found to have a high sequencing error rate or other quality issues.

To use this option, the read group identifiers in the input files must be unique.

--gatkJarFile / -gatk ( required File )

The path to the GenomeAnalysisTK.jar file.

--genderMapFile / -genderMapFile ( List[File] )

File(s) containing the declared gender for each sample.

The gender map file should be a tab-delimited file with two columns: sample identifier (as specified in the input alignment files) and sample gender (either Male/Female or M/F). Multiple files may be supplied (but should contain disjoint sets of samples). If you supply gender, it must be supplied for all samples in the input files.

A gender map (and a ploidy map file) is required for calling variants on the sex chromosomes.

SVPreprocess now estimates the gender of each sample based on read depth and produces a gender map file suitable for discovery and genotyping. Currently, we do not use these estimated gender calls by default. You can choose to either use the estimated gender, or a file based on the declared gender of each sample or reconcile the two prior to running discovery or genotyping.

--genomeMaskFile / -genomeMaskFile ( List[File] )

The genome mask file(s) for the reference genome.

This argument supplies a genome mask that is used to mask positions of the genome that should be ignored for analysis of read depth, typically because alignments of reads to these positions are not reliable. The default value for the genome mask is rmd/reference.svmask.fasta, an indexed fasta file with each position marked as 0 (unmasked) or 1 (masked).

In the current implementation, the default genome mask file is built based on a fixed k-mer length that should correspond roughly to the minimum read length in the input data set. The k-mer size used can usually be determined by inspecting the file names in the reference metadata directory you are using. If you data set contains especially short or long reads, you may want to override the default genome mask to use a mask with a different k-mer size. See ComputeGenomeMask for additional details.

In some specialized applications, such as the CNV discovery pipeline, an additional genome mask can be specified. When multiple masks are present, the union of the masked positions will not be used for read depth estimation.

The format of this file and the behavior of this argument may change in a future release.

--inputFile / -I ( required List[String] )

One or more bam input files or input file lists (with extension .list).

This argument can be repeated multiple times can contain a mixture of file locations and .list files. Each .list file should be a plain text file containing a list of file locations, one per line. If a .list file is used, the extension must be .list.

The input locations can be file paths or they can be URLs that use supported protocols. The supported protocols for input files include http, ftp and the Amazon s3 protocol.

--inputFileIndexCache / -inputFileIndexCache ( String )

Location where the index files for any remote input BAM files are cached.

This argument specifies the location (usually a local file system path) where the index files that correspond to the input alignment files are cached (or should be cached). This parameter is used with remote input files accessed over web protocols like http or ftp.

For some public data sets, pre-computed Genome STRiP metadata is available and this metadata may include the bam index files for the dataset. If the precomputed metadata has been downloaded to the local file system, this parameter is used to indicate where the bam indexes for the remote bam files can be found locally.

--jobLogDirectory / -jobLogDir ( File )

Directory for output log files from Queue.

This directory is used to store log files from the parallel jobs run by Queue during execution of the discovery pipeline. The log files contain information that can be helpful for debugging or performance tuning.

If not supplied, the default is to use the current working directory. A typical strategy is to make a log directory underneath the run directory for each SVDiscovery run and direct the log files there.

--metaDataLocation / -md ( required List[String] )

Location of metadata directory created during preprocessing.

This argument specifies the location of the Genome STRiP preprocessing output for the input bam files.

If preprocessing was run in batches, multiple -md arguments may be supplied. This allows joint calling to be run across multiple batches of input data sets. In practice, the number of -md arguments should be kept fairly small, as the metadata directories will be merged on the fly.

Joint calling can only be performed across compatible sets of input data (e.g. data aligned to the same reference sequence).

--outputFile / -O ( required File )

The output genotype VCF file.

The output file is a genotype VCF containing all of the input sites plus genotypes for each input sample.

--parallelJobs / -parallelJobs ( Integer )

Number of parallel jobs to run.

This Queue script supports two alternative methods for parallelizing the processing of the input VCF file: The -parallelJobs N argument specifies that the VCF file should be divided into N approximately equal subsets and that each of these subsets should be processed in parallel. See also -parallelRecords.

If neither -parallelJobs or -parallelRecords is specified, then the input VCF file is submitted as a single job. If both are specified, -parallelRecords overrides -parallelJobs.

--parallelRecords / -parallelRecords ( Integer )

Number of records to process in each parallel job.

This Queue script supports two alternative methods for parallelizing the processing of the input VCF file: The -parallelRecords N argument specifies that the VCF file should be divided into M jobs where each job (except the last) processes N records from the input VCF file. See also -parallelJobs.

If neither -parallelJobs or -parallelRecords is specified, then the input VCF file is submitted as a single job. If both are specified, -parallelRecords overrides -parallelJobs.

--parameter / -P ( List[String] )

Override individual parameters from the configuration property file.

This argument can be used to override advanced settings from the configuration property file on a case-by-case basis. The syntax is -P name:value where name is one of the parameters in the configuration file.

--partition / -partition ( List[String] )

Specific partition(s) to rerun.

The genotyping pipeline runs many parallel jobs that process different sections of the input file. Each section is called a partition and is identified by an ID of the for Pnnnn (e.g. P1000). This argument allows you to force the pipeline to run a single partition or set of partitions and is typically used for debugging.

if you change the parameters defining the parallelism strategy, then the number of partitions and the partition identifiers will change.

--ploidyMapFile / -ploidyMapFile ( File )

Map file defining the gender-specific ploidy of each region of the reference genome.

Although technically optional, the ploidy map file is required by some of the pipelines in Genome STRiP, including SVPreprocess.

The ploidy map file is generally present in the reference metadata directory along with the reference genome. The ploidy map file is used to indicate which parts of the reference genome have gender-dependent ploidy. This is used in conjunction with the gender of each sample to process sex chromosomes.

If you are using a reference genome that does not have a complete metadata directory (for example, a non-human reference), you will need to create your own ploidy map if you want to process sex chromosomes.

If no ploidy map file is supplied, some of the time Genome STRiP will default to behave as if the entire input genome has ploidy 2 (i.e. is diploid) in all individuals. You can also force this behavior by creating a ploidy map with a single line containing "* * * * 2" to indicate that all chromosomes are diploid.

--referenceFile / -R ( required File )

The reference genome sequence for the input bam files (indexed fasta format).

This must be the reference genome sequence that was used to align the input bam files.

In addition, this argument sets the default value for the reference metadata directory (see -rmd) which contains additional information about the reference genome that is required by Genome STRiP. Reference metadata directories are supplied with Genome STRiP for common human reference assemblies and generally you should set -R to refer to one of these directories that is appropriate for your input data.

--referenceMetaDataLocation / -rmd ( File )

Path to the directory containing data files based on the reference genome (ploidy map, gc-bias file, etc.).

The -rmd location defaults to the directory of the reference genome, as specified by -R, and normally does not need to be supplied.

This argument can be used when you need to replace the reference genome with another file but want to continue to use the other auxilliary files in the reference metadata directory. In this case, the replacement reference genome must be very nearly identical to the original reference genome.

If you cannot use one of the standard sets of reference metadata supplied with Genome STRiP, for example because you are processing non-human data or using a different human reference genome, then you will need to generate the equivalent data for your reference genome.

--runDirectory / -runDirectory ( required File )

Directory for running the genotyping pipeline.

The SVGenotyper pipeline uses the Queue pipeline manager to run many processes in parallel to genotype the input sites. The run directory contains a number of intermediate files and files that track the status of these parallel jobs.

The run directory must be exclusive to a single SVGenotyper run on a single input data set. Attempting to run multiple SVGenotyper jobs (or other Genome STRiP pipelines) in the same run directory, either serially or in parallel, will lead to conflicts over the intermediate files and status files in the run directory.

--sample / -sample ( List[String] )

Sample(s) to process or .list files of samples.

By default, genotypes are generated for all of the samples (individuals) in the input files (specified by the -I argument). These samples must also have available metadata (specified by the -md argument), although the -md arguments can specify metadata directories containing information on additional samples. This argument allows you to specify that only the indicated list of samples should be used for genotyping. This must be a subset of the samples listed in the input alignment files.

Genome STRiP genotyping is population-based and gains power by genotyping many samples together. An alternative to using -sample is to genotype based on all of the samples and then subset the output VCF by dropping columns pertaining to specific samples.

--vcfFile / -vcf ( required File )

Input site VCF file containing the sites to genotype.

This argument specifies the set of structual variant sites that you want to genotype. For each site, a genotype for each input sample will be generated for that site. The input file is typically just a site VCF (i.e. only 8 columns). Any per-sample columns in the input file are ignored. Each input site is genotyped independently.

This queue script supports using a scatter/gather pattern to allow sites to be genotyped in parallel across a compute farm.