RedundancyAnnotator documentation

RedundancyAnnotator

Annotator used to detect and filter redundant structural variation calls with similar coordinates.

Category: Variant Annotators

The Redundancy annotator is invoked through the SVAnnotator framework, which defines arguments common to all annotators.

Introduction

One difficulty in structural variant calling is that often different methods (or even the same method) will call the same variant more than once, with slightly different coordiantes. This annotator provides one solution to this problem, by detecting, ranking and scoring potentially duplicate calls, based on the genotypes from the genotyped population. These duplication scores can then be used to filter duplicate sites from the variant call set.

The Redundancy annotator computes a "redundancy" score and related annotations for each site in the input VCF. The redundancy score indicates how likely it is that a site is a duplicate of another site in the comparison VCF file, based on the genotype likelihoods at both sites in overlapping samples. The comparison VCF file can be, and usually is, the same as the input VCF file. The Redundancy annotator also indicates which other sites in the comparison VCF are potential duplicates and between these potential duplicate sites which site is the preferred one to keep based on the strength of the genotype likelihoods. Sites with stronger genotype likelihoods usually indicate a better call with more accurate boundaries.

This annotator is normally used as part of the Genome STRiP Queue pipelines for variant calling and filtration and is not often used separately. After running this annotator, the output VCF is usually filtered based on GSDUPLICATESCORE to remove duplicates. Variants are only compared if they overlap by more than a threshold given by the -duplicateOverlapThreshold parameter. Setting this parameter to zero will cause all overlapping variants to be evaluated as potential duplicates. Combining this with a very low filtering value (e.g. -1000) will result in a non-overlapping output call set, with the strongest calls retained.

Output Formats

This annotator can produce the following outputs: annotated VCF, report file.

The following VCF annotations are produced by this annotator:

GSDUPLICATESCORE
LOD score that this site is distinct based on the genotypes of the most discordant sample at each comparison site.
GSDUPLICATEOVERLAP
Highest overlap with a duplicate comparison site.
GSDUPLICATES
List of duplicate sites preferred to this site.

The report file contains one line for every pairwise comparison between two sites (one from the input VCF, one from the comparison VCF). The following columns are produced in the report file:

SITE1
The site from the input VCF.
SITE2
The site from the comparison VCF.
LENGTH1
The genomic length of site 1.
LENGTH2
The genomic length of site 2.
OVERLAP
The length of the genomic overlap between the two sites.
SCORE
The duplicate score between these two sites.
LEGACY_SCORE
An older method for scoring duplicate sites (retained for backwards compatibility).
LODSUM
Sum of the LOD scores across all samples (not used).
DOSAGE_CORRELATION
The correlation coefficient between genotype dosages of common samples at the two sites.
DISCORDANT_GENOTYPES
The number of discordant genotype calls at common samples between the two sites.
DUPLICATE_OVERLAP
Fractional overlap between the two sites.
DUPLICATE_NONOVERLAP
Fraction of non-overlapping bases between the two sites.
PREFERRED
Which of the two sites is the better site to retain.

Example

 java -Xmx4g -cp SVToolkit.jar \
     org.broadinstitute.sv.main.SVAnnotator \
     -A Redundancy \
     -R human_g1k_v37.fasta \
     -vcf input.vcf \
     -comparisonFile input.vcf \
     -duplicateOverlapThreshold 0.5 \
     -O output.vcf \
     -writeReport true \
     -reportDirectory reportdir


RedundancyAnnotator specific arguments

Name Type Default value Summary
Required Parameters
-comparisonFile File NA VCF file of variants for comparison
Optional Parameters
-discordantGenotypeThreshold Integer NA Maximum number of discordant genotypes allowed for redundant events (default: any)
-duplicateOverlapDenominator String SHORTEST Denominator for measuring overlap, UNION (reciprocal overlap) or SHORTEST (default, length of shortest event)
-duplicateOverlapThreshold Double NA Minimum overlap required to determine two events are redundant
-duplicateScoreThreshold Double NA Concordance LOD score threshold (minimum per-sample LOD score at which events are considered non-redundant)
-filterGenotypes Boolean true True to ignore genotypes that have been filtered
-filterVariants Boolean true True to ignore variants that have been filtered

Argument details

--comparisonFile / -comparisonFile ( required File )

VCF file of variants for comparison.

--discordantGenotypeThreshold / -discordantGenotypeThreshold ( Integer )

Maximum number of discordant genotypes allowed for redundant events (default: any).

--duplicateOverlapDenonminator / -duplicateOverlapDenominator ( String with default value SHORTEST )

Denominator for measuring overlap, UNION (reciprocal overlap) or SHORTEST (default, length of shortest event).

--duplicateOverlapThreshold / -duplicateOverlapThreshold ( Double )

Minimum overlap required to determine two events are redundant.

--duplicateScoreThreshold / -duplicateScoreThreshold ( Double )

Concordance LOD score threshold (minimum per-sample LOD score at which events are considered non-redundant).

--filterGenotypes / -filterGenotypes ( Boolean with default value true )

True to ignore genotypes that have been filtered.

--filterVariants / -filterVariants ( Boolean with default value true )

True to ignore variants that have been filtered.