TagVariantsAnnotator documentation
TagVariantsAnnotator
Annotator that evaluates tagging SNPs (or other markers) based on pairwise r2.
Category: Variant Annotators
The TagVariants annotator is invoked through the SVAnnotator framework, which defines arguments common to all annotators.
Introduction
The TagVariants annotator compares the input variant file to a comparison file of variants (e.g. VCF) specified by -tagFile, which can be the same file or a different file. For each variant in the input variant file, a pairwise r2 value is computed to all variants in the tag variant file within a certain genomic distance of each input variant (by default 1 megabase).
This annotator can process either bi-allelic variants (with GT/GQ fields) or copy-number variants (with CN/CNQ fields). In the case of copy number variants, r2 is computed with respect to diploid copy number.
Multiple populations (potentially overlapping) may be specified and then the r2 calculations are performed separately in each population. If no populations are defined, the samples in the VCF are treated as a single population. The r2 values are only calculated for samples that are present in both the input VCF file and the tag VCF file.
This annotator produces both a summary report and a detailed report. The detailed report contains r2 values for all pairs of vairants above a user-specified r2 threshold (see -tagR2Threshold). The summary report contains the best r2 value for each input marker in each population.
The detailed report file contains a line for each pair of variants and each population where the r2 value is above the specified threshold. If this threshold is low and the tag window size is large, the report file may contain many lines for each variant.
Input Formats
Population map files are tab delimited files with two columns. The first column specifies the sample identifier and the second column specifies a population identifier. A header line is optional, but if present the column names should be SAMPLE and POPULATION.
Multiple population map files may be provided and these may assign multiple populations to the same sample. This allows multiple levels of population structure or overlapping populations to be described.
Output Formats
This annotator can produce the following outputs: summary file, report file.
The summary file contains one line for each input variant. The populations are reported in separate columns.
The columns in the summary file are:
- VARIANT
- The ID of the variant.
- CHR
- The chromosome of the variant.
- START
- The start position of the variant.
- END
- The end position of the variant.
- MAF
- The minor allele frequency of the variant (only present when there is a single population).
- R2
- The maximum r2 value found to the best tagging marker. This column is displayed as "R2" when there is only one population. If there are multiple populations being tested, then there will be one column for each population and the column header specifies the population identifier.
The columns in the detailed report file are:
- VARIANT
- The ID of the input variant.
- CHR
- The chromosome of the input variant.
- START
- The start position of the input variant.
- END
- The end position of the input variant.
- MAF
- The minor allele frequency of the input variant (in this population).
- TAGVARIANT
- The ID of the tagging variant.
- TAGCHR
- The chromosome of the tagging variant.
- TAGSTART
- The start position of the tagging variant.
- TAGEND
- The end position of the tagging variant.
- POP
- The population used to compute this r2 value.
- NGENOTYPES
- The number of genotypes used to compute r2. This reflects taking the intersection of the called genotypes at the two variant sites in the specified population.
- NREFALLELES
- The number of reference alleles observed in the input variant.
- TAGNREFALLELES
- The number of reference alleles observed in the tagging variant.
- OVERLAPMAF
- The minor allele frequency of the input variant measured only at overlapping samples with the tagging variant and where both the input and tagging variant have called genotypes. This may be different than MAF, which considers all samples in this population for which there are called genotypes.
- TAGOVERLAPMAF
- The minor allele frequency of the tagging variant measured only at overlapping samples with the input variant where both the input and tagging variant have called genotypes.
- R2
- The squared correlation (r2) between the input and tagging variant.
- MAJOR
- Integer specifying which allele in the input VCF file is higher frequency in this comparison. By comparing MAJOR and TAGMAJOR, you can tell whether the reference allele of the input variant corresponds to the reference or alternate allele of the tagging variant. Will be -1 for copy number variants.
- TAGMAJOR
- Integer specifying which allele in the tagging VCF file is higher frequency in this comparison.
Interpretation
It is important to evaluate the significance of a correlation based on the number of data points used to estimate the correlation. If NGENOTYPES is low, many high r2 values would be expected by chance.
It is important to use reasonable window sizes, especially when evaluating the correlation of rare variants. Singletons are by definition perfectly correlated with other singletons in the same sample.
Example
java -Xmx4g -cp SVToolkit.jar \ org.broadinstitute.sv.main.SVAnnotator \ -A TagVariants \ -R human_g1k_v37.fasta \ -vcf input.vcf \ -tagFile tag_variants.vcf \ -tagR2Threshold 0.5 \ -populationMapFile 1000G_populations.map \ -population CEU \ -writeReport true \ -writeSummary true \ -reportDirectory reportdir
TagVariantsAnnotator specific arguments
Name | Type | Default value | Summary |
---|---|---|---|
Optional Parameters | |||
-filterGenotypes | Boolean | true | True to ignore genotypes that have been filtered. Applies to both the input variants and tag variants. |
-filterVariants | Boolean | true | True to ignore variants that have been filtered. Applies to both the input variants and tag variants. |
-genotypeQualityThreshold | Double | NA | Ignore genotypes below this genotype quality GQ value (default no threshold). Applies to both the input variants and tag variants. |
-population | List[String] | NA | Population(s) or .list file of populations to process |
-populationMapFile | List[File] | NA | Map file (or files) containing sample to population assignments |
-sample | List[String] | NA | Sample(s) or .list file(s) of sample names |
-tagIncludeOverlapping | Boolean | false | True to calculate r2 for overlapping variants |
-tagR2Threshold | Double | NA | Minimum r2 value to report in report file (default no threshold) |
-tagWindowSize | Integer | 1000000 | Size of window to use for LD evaluation (default 1Mb) |
Argument details
--filterGenotypes / -filterGenotypes ( Boolean with default value true )
True to ignore genotypes that have been filtered. Applies to both the input variants and tag variants..
--filterVariants / -filterVariants ( Boolean with default value true )
True to ignore variants that have been filtered. Applies to both the input variants and tag variants..
--genotypeQualityThreshold / -genotypeQualityThreshold ( Double )
Ignore genotypes below this genotype quality GQ value (default no threshold). Applies to both the input variants and tag variants..
--population / -population ( List[String] )
Population(s) or .list file of populations to process.
--populationMapFile / -populationMapFile ( List[File] )
Map file (or files) containing sample to population assignments.
--sample / -sample ( List[String] )
Sample(s) or .list file(s) of sample names.
--tagIncudeOverlapping / -tagIncludeOverlapping ( Boolean with default value false )
True to calculate r2 for overlapping variants.
--tagR2Threshold / -tagR2Threshold ( Double )
Minimum r2 value to report in report file (default no threshold).
--tagWindowSize / -tagWindowSize ( Integer with default value 1000000 )
Size of window to use for LD evaluation (default 1Mb).