RedundancyAnnotator documentation
RedundancyAnnotator
Annotator used to detect and filter redundant structural variation calls with similar coordinates.
Category: Variant Annotators
The Redundancy annotator is invoked through the SVAnnotator framework, which defines arguments common to all annotators.
Introduction
One difficulty in structural variant calling is that often different methods (or even the same method) will call the same variant more than once, with slightly different coordiantes. This annotator provides one solution to this problem, by detecting, ranking and scoring potentially duplicate calls, based on the genotypes from the genotyped population. These duplication scores can then be used to filter duplicate sites from the variant call set.
The Redundancy annotator computes a "redundancy" score and related annotations for each site in the input VCF. The redundancy score indicates how likely it is that a site is a duplicate of another site in the comparison VCF file, based on the genotype likelihoods at both sites in overlapping samples. The comparison VCF file can be, and usually is, the same as the input VCF file. The Redundancy annotator also indicates which other sites in the comparison VCF are potential duplicates and between these potential duplicate sites which site is the preferred one to keep based on the strength of the genotype likelihoods. Sites with stronger genotype likelihoods usually indicate a better call with more accurate boundaries.
This annotator is normally used as part of the Genome STRiP Queue pipelines for variant calling and filtration and is not often used separately. After running this annotator, the output VCF is usually filtered based on GSDUPLICATESCORE to remove duplicates. Variants are only compared if they overlap by more than a threshold given by the -duplicateOverlapThreshold parameter. Setting this parameter to zero will cause all overlapping variants to be evaluated as potential duplicates. Combining this with a very low filtering value (e.g. -1000) will result in a non-overlapping output call set, with the strongest calls retained.
Output Formats
This annotator can produce the following outputs: annotated VCF, report file.
The following VCF annotations are produced by this annotator:
- GSDUPLICATESCORE
- LOD score that this site is distinct based on the genotypes of the most discordant sample at each comparison site.
- GSDUPLICATEOVERLAP
- Highest overlap with a duplicate comparison site.
- GSDUPLICATES
- List of duplicate sites preferred to this site.
The report file contains one line for every pairwise comparison between two sites (one from the input VCF, one from the comparison VCF). The following columns are produced in the report file:
- SITE1
- The site from the input VCF.
- SITE2
- The site from the comparison VCF.
- LENGTH1
- The genomic length of site 1.
- LENGTH2
- The genomic length of site 2.
- OVERLAP
- The length of the genomic overlap between the two sites.
- SCORE
- The duplicate score between these two sites.
- LEGACY_SCORE
- An older method for scoring duplicate sites (retained for backwards compatibility).
- LODSUM
- Sum of the LOD scores across all samples (not used).
- DOSAGE_CORRELATION
- The correlation coefficient between genotype dosages of common samples at the two sites.
- DISCORDANT_GENOTYPES
- The number of discordant genotype calls at common samples between the two sites.
- DUPLICATE_OVERLAP
- Fractional overlap between the two sites.
- DUPLICATE_NONOVERLAP
- Fraction of non-overlapping bases between the two sites.
- PREFERRED
- Which of the two sites is the better site to retain.
Example
java -Xmx4g -cp SVToolkit.jar \ org.broadinstitute.sv.main.SVAnnotator \ -A Redundancy \ -R human_g1k_v37.fasta \ -vcf input.vcf \ -comparisonFile input.vcf \ -duplicateOverlapThreshold 0.5 \ -O output.vcf \ -writeReport true \ -reportDirectory reportdir
RedundancyAnnotator specific arguments
Name | Type | Default value | Summary |
---|---|---|---|
Required Parameters | |||
-comparisonFile | File | NA | VCF file of variants for comparison |
Optional Parameters | |||
-discordantGenotypeThreshold | Integer | NA | Maximum number of discordant genotypes allowed for redundant events (default: any) |
-duplicateOverlapDenominator | String | SHORTEST | Denominator for measuring overlap, UNION (reciprocal overlap) or SHORTEST (default, length of shortest event) |
-duplicateOverlapThreshold | Double | NA | Minimum overlap required to determine two events are redundant |
-duplicateScoreThreshold | Double | NA | Concordance LOD score threshold (minimum per-sample LOD score at which events are considered non-redundant) |
-filterGenotypes | Boolean | true | True to ignore genotypes that have been filtered |
-filterVariants | Boolean | true | True to ignore variants that have been filtered |
Argument details
--comparisonFile / -comparisonFile ( required File )
VCF file of variants for comparison.
--discordantGenotypeThreshold / -discordantGenotypeThreshold ( Integer )
Maximum number of discordant genotypes allowed for redundant events (default: any).
--duplicateOverlapDenonminator / -duplicateOverlapDenominator ( String with default value SHORTEST )
Denominator for measuring overlap, UNION (reciprocal overlap) or SHORTEST (default, length of shortest event).
--duplicateOverlapThreshold / -duplicateOverlapThreshold ( Double )
Minimum overlap required to determine two events are redundant.
--duplicateScoreThreshold / -duplicateScoreThreshold ( Double )
Concordance LOD score threshold (minimum per-sample LOD score at which events are considered non-redundant).
--filterGenotypes / -filterGenotypes ( Boolean with default value true )
True to ignore genotypes that have been filtered.
--filterVariants / -filterVariants ( Boolean with default value true )
True to ignore variants that have been filtered.