Genome STRiP Overview
Genome STRiP Overview
Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovery and genotyping of structural variation using whole-genome sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.
Genome STRiP requires genomes from multiple individuals in order to detect or genotype variants. Typically an absolute minimum of 20 to 30 genomes are required. Analyzing more individuals together improves accuracy. It is possible to use publicly available reference data (e.g. sequence data from the 1000 Genomes Project) as a background population to call events in single genomes, but this strategy has not been widely tried nor thoroughly evaluated, and batch effects are possible. If you are calling with deeply sequenced genomes, it will be beneficial to have a background population that is also deeply sequenced.
Genome STRiP uses the GATK and the Queue workflow manager originally developed for use with GATK. Included with Genome STRiP are pre-defined pipelines that use the Queue workflow manager to run analyses.
The original version of Genome STRiP (Genome STRiP 1.x) was focused on the discovery and genotyping of deletions (relative to a reference genome sequence). The latest version (Genome STRiP 2.0) includes additional pipelines and tools for duplications and multi-allelic copy number variants (mCNVs).
Genome STRiP is under active development and improvement. To report bugs, please use the support channels on our web site http://www.broadinstitute.org/software/genomestrip. Before posting a bug report, please review the latest FAQ.
Module Structure
Genome STRiP consists of a number of modules, related as shown below.
The SVPreprocess pipeline is a pre-requisite for any of the other pipelines in Genome STRiP. The original Genome STRiP 1.x pipeline for large deletions consisted of the SVPreprocess, SVDiscovery and SVGenotyper piplines. To discover and genotype deletions, you run all three of these steps, in order. You can also genotype a set of known variant sites in a new cohort by using SVGenotyper directly (after running SVPreprocess on the new cohort).
Genome STRiP 2.0 adds the new CNVDiscovery pipeline. This pipeline will detect deletions, duplications and multi-allelic CNVs (mCNVs). The new CNV pipeline is complementary to the deletion discovery pipeline. Both pipelines will detect deletions, but with certain tradeoffs in the deletions sites that will be found. The deletion discovery pipeline seeds on read pairs and then uses read depth as auxilliary evidence. The CNV pipeline seeds on read depth. The CNV pipeline can discover deletions in more repetitive regions of the genome, so will discover deletions missed by the deletion pipeline. The deletion pipeline is able to discovery shorter deletions, but only at sites where the deletion breakpoints are in sufficiently unique sequence.
The CNV pipeline uses genotyping extensively during the discovery process and produces a genotype VCF directly, whereas the deletion pipeline produces a sites-only VCF which can then be used as input to genotyping.
Inputs and Outputs
Genome STRiP requires aligned sequence data in BAM format. Support for CRAM (an alternative alignment format) is under development.
The primary outputs from Genome STRiP are polymorphic sites of structural variation and/or genotypes for these sites, both of which are represented in VCF format.
Genome STRiP also requires a FASTA file containing the reference genome sequence used to align the input reads as well as additional files that are based on the reference genome sequence. As of release 2.0, it is recommended to download and use one of the pre-packaged reference metadata bundles. These bundles contain all of the associated files needed by Genome STRiP. If your reference genome is not supported, it is possible to create your own reference metadata bundle, but this is slightly more advanced.
Downloading and Installation
Current and previous binary releases are available from our website http://www.broadinstitute.org/software/genomestrip.
To install, download the tarball and decompress into a suitable directory. You will need to install pre-requisite software as described below. There is a 15-minute installation/verification test in the installtest subdirectory.
The install test scripts also serve as example pipelines for running Genome STRiP.
Environment Variables
Currently, Genome STRiP requires you to set the SV_DIR environment variable to the installation directory. See the installtest scripts for details.Software Dependencies
The following pre-requisite software dependencies need to be downloaded and installed separately. Other dependencies required by Genome STRiP are bundled with each release.- Java
- R
- Samtools and Tabix
- Cluster management software (LSF, SGE)
Genome STRiP is written mostly in java and is packaged as a jar file (SVToolkit.jar). You will need java 1.7.
Genome STRiP uses some R scripts internally. To run Genome STRiP, R must be installed and the Rscript exectuable must be on your path. Genome STRiP will run with R 3.0 or newer. It may also run with older versions (2.x), but we do not test with R 2.x.
Some of the Genome STRiP pipelines use 'samtools index' to index BAM files and 'tabix index' to index other binary files. These tools are both part of the HTSlib software. You will need to install these tools separately. See http://www.htslib.org.
To run the pipelines in Genome STRiP efficiently, you generally need access to a compute cluster and some software for managing compute jobs, such as Platform Computing's Load Sharing Facility (LSF) or Sun Grid Engine (SGE). The Queue workflow manager supports both of these options.
Bundled Dependencies
- GATK
- Picard
- BWA
Genome STRiP is integrated with the the Genome Analysis Toolkit (GATK) and requires GenomeAnalysisTK.jar in order to run. Genome STRiP only bundles and relies on the freely available version of GATK that has no restrictions on use. The pipelines that automate running Genome STRiP are written in Queue (a workflow manager originally developed by the developers of GATK) and these pipelines require Queue.jar to run. A compatible version of GATK (and Queue) is included with each Genome STRiP release. We can't guarantee compatibility with any other version of the GATK software.
The Genome STRiP pipelines also use some Picard utilities. These are included with each Genome STRiP release.
Internally, Genome STRiP uses the bwa aligner in some pipelines. A compatible version of BWA (which is actually quite old) is included with each Genome STRiP release in the bwa subdirectory, both as an bwa executable and a shared library libbwa.o. These will need to be on your path and on LD_LIBRARY_PATH, respectively, when running some of the Genome STRiP pipelines. These older versions of bwa are not compatible with newer versions of bwa.
Running Genome STRiP
Before attempting to run Genome STRiP on your own data, please run the short installation test in the installtest subdirectory. This will ensure that your environment is set up properly. The test scripts run some Genome STRiP analyses on a small test data set. The test scripts also offer an example of how to organize your run directory structure and some sample end-to-end pipelines.
A number of pre-defined Queue pipelines are provided to run the different phases of analysis in Genome STRiP. Queue is a flexible scala-based system for writing processing pipelines that can be distributed on compute farms. Each Genome STRiP pipeline is defined as a Queue script:
- SVPreprocess
- SVDiscovery
- SVGenotyper
- CNVDiscoveryPipeline
Preprocess a set of input BAM files to generate data set metadata used by other Genome STRiP modules. This is a pre-requisite for all other Genome STRiP pipelines.
Runs the original Genome STRiP deletion discovery algorithm on a set of input BAM files, producing a VCF file of potentially variant sites.
Genotype a set of polymorphic structural variation loci described in an input VCF file.
Runs the Genome STRiP 2.0 pipeline for discovery and genotyping of CNVs (including deletions, duplications and mCNVs), seeding on read depth.