SVDiscovery walker
GenomeSTRiP Documentation | Created 2012-09-12 | Last updated 2012-09-21

Comments (15)

1. Introduction

The SVDiscovery walker traverses a set of BAM files to perform structural variation discovery. This walker is the main component of the SVDiscovery pipeline.

Currently, only discovery of deletions relative to the reference is implemented.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -runDirectory <directory> : The directory where auxilliary output files will be written (default is the current directory).

  • -md <directory> : The metadata directory containing metadata about the input data set. See SVPreprocess.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence that the input BAM files were aligned against. The fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. : See Genome Mask Files.

  • -configFile <configuration-file> : This file contains settings for specialized settings that do not normally need to be changed. : A default configuration file is provided in conf/genstrip_parameters.txt.

  • -partitionName <string> : This specifies the name of the partition being computed during parallel runs. : The output files will be prefixed with the name of the partition.

  • -searchLocus <interval> : The genomic locus being searched. : Only structural variations that fit within the specified locus will be output. If non-overlapping search loci are used, then the union of the discovered variants should be non-redundant.

  • -searchWindow <interval> : The interval to be used for searching the input BAM files. : This is typically larger than the search locus to avoid missing events due to boundary effects. : This argument should typically be set to the same value as the GATK -L argument.

  • -searchMinimumSize <size> : The minimum length of a deletion event for it to be included in the output.

  • -searchMaximumSize <size> : The maximum length of a deletion event for it to be included in the output.

3. Outputs

  • -O <vcf-file> : The main output is a VCF file containing descriptions of the variant sites along with annotations about the evidence for the variability of the site. : The output VCF file will need to be filtered, based on the annotations, to select a final set of high specificity variants.

Depending on settings in the configuration file, this walker will also produce a number of auxilliary output files. These files are mostly useful for debugging. The content and format of these files is subject to change.

4. Running

Currently, this walker needs to be invoked through a special wrapper around the GATK command line interface. This wrapper accepts all of the standard GATK command line options. An example is shown below.

java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \ \ 
    -T SVDiscovery \ 
    -configFile conf/genstrip_parameters.txt \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -I input1.bam -I input2.bam \ 
    -O output.sites.vcf \ 
    -runDirectory run1 \ 
    -minimumSize 100 \
    -maximumSize 1000000 \ 
    -searchLocus chr20::1-1000000 \ 
    -L chr20:1-1000000 \
    -searchWindow chr20:1-1000000 

5. Dependencies

The SV Discovery code uses some R scripts. R needs to be installed and the Rscript executable needs to be on your path to run this walker.

Return to top Comment on this article in the forum