LCNV Discovery Pipeline Overview
LCNV Discovery Pipeline
The LCNVDiscoveryPipeline Queue script implements a pipeline for detecting large (100kb+) copy number variants, including somatic/mosaic CNVs from read depth of coverage from whole-genome sequencing data.
The goal of the LCNV pipeline is to detect chromosomal aneuploidies and "array sized" copy number variants with high specificity, even in difficult parts of the genome. The pipeline normalizes many samples together, but then calls variants in each sample individually. Each variant is assigned a score and high scoring variant calls should have high specificity.
The pipeline emits both high scoring calls as well as less confident calls. The output files should be post-filtered based on the desired level of sensitivity and specificity. See the detailed pipeline documentation for recommendations.
As with all of the Genome STRiP pipelines, all samples must first go through SVPreprocess. After this, the LCNV calling consists of two steps implemented by two seprate Queue pipelies: First GenerateDepthProfiles is run to generate a set of read depth profiles for a group of samples to be analyzed together. During this step, a bin size is chosen (typically 10,000 bp) which specifies the minimum granularity for boundary resolution on the called CNV segments. Then, the LCNVDiscoveryPipeline Queue script uses these read depth profiles as input and generates a tab delimited file listing the CNV calls in each sample. It is also possible to use the 100kb profiles that are generated during SVPreprocess instead of running GenerateDepthProfiles, but in this case the output will generally have high specificity only for events larger than 1Mb.
The level of specificity achieved by the pipeline depends strongly on how "clean" the read depth of coverage signal is. Samples with degraded DNA will tend to have excess variability in read depth and will lead to more CNV calls and for these samples more stringent filtering will be required. Similarly, cell lines (including lymphoblastoid cell lines (LCLs) will exhibit excess variability in read depth of coverage due to the effects of replication timing during active cell replication. Results can be improved by calling samples derived from blood and LCLs in separate batches.