TagVariantsAnnotator documentation

TagVariantsAnnotator

Annotator that evaluates tagging SNPs (or other markers) based on pairwise r².

Category: Variant Annotators

The TagVariants annotator is invoked through the SVAnnotator framework, which defines arguments common to all annotators.

Introduction

The TagVariants annotator compares the input variant file to a comparison file of variants (e.g. VCF) specified by -tagFile, which can be the same file or a different file. For each variant in the input variant file, a pairwise r² value is computed to all variants in the tag variant file within a certain genomic distance of each input variant (by default 1 megabase).

This annotator can process either bi-allelic variants (with GT/GQ fields) or copy-number variants (with CN/CNQ fields). In the case of copy number variants, r² is computed with respect to diploid copy number.

Multiple populations (potentially overlapping) may be specified and then the r² calculations are performed separately in each population. If no populations are defined, the samples in the VCF are treated as a single population. The r² values are only calculated for samples that are present in both the input VCF file and the tag VCF file.

This annotator produces both a summary report and a detailed report. The detailed report contains r² values for all pairs of vairants above a user-specified r² threshold (see -tagR2Threshold). The summary report contains the best r² value for each input marker in each population.

The detailed report file contains a line for each pair of variants and each population where the r² value is above the specified threshold. If this threshold is low and the tag window size is large, the report file may contain many lines for each variant.

Input Formats

Population map files are tab delimited files with two columns. The first column specifies the sample identifier and the second column specifies a population identifier. A header line is optional, but if present the column names should be SAMPLE and POPULATION.

Multiple population map files may be provided and these may assign multiple populations to the same sample. This allows multiple levels of population structure or overlapping populations to be described.

Output Formats

This annotator can produce the following outputs: summary file, report file.

The summary file contains one line for each input variant. The populations are reported in separate columns.

The columns in the summary file are:

VARIANT: The ID of the variant.
CHR: The chromosome of the variant.
START: The start position of the variant.
END: The end position of the variant.
MAF: The minor allele frequency of the variant (only present when there is a single population).
R2: The maximum r² value found to the best tagging marker. This column is displayed as "R2" when there is only one population. If there are multiple populations being tested, then there will be one column for each population and the column header specifies the population identifier.

The columns in the detailed report file are:

VARIANT: The ID of the input variant.
CHR: The chromosome of the input variant.
START: The start position of the input variant.
END: The end position of the input variant.
MAF: The minor allele frequency of the input variant (in this population).
TAGVARIANT: The ID of the tagging variant.
TAGCHR: The chromosome of the tagging variant.
TAGSTART: The start position of the tagging variant.
TAGEND: The end position of the tagging variant.
POP: The population used to compute this r² value.
NGENOTYPES: The number of genotypes used to compute r². This reflects taking the intersection of the called genotypes at the two variant sites in the specified population.
NREFALLELES: The number of reference alleles observed in the input variant.
TAGNREFALLELES: The number of reference alleles observed in the tagging variant.
OVERLAPMAF: The minor allele frequency of the input variant measured only at overlapping samples with the tagging variant and where both the input and tagging variant have called genotypes. This may be different than MAF, which considers all samples in this population for which there are called genotypes.
TAGOVERLAPMAF: The minor allele frequency of the tagging variant measured only at overlapping samples with the input variant where both the input and tagging variant have called genotypes.
R2: The squared correlation (r²) between the input and tagging variant.
MAJOR: Integer specifying which allele in the input VCF file is higher frequency in this comparison. By comparing MAJOR and TAGMAJOR, you can tell whether the reference allele of the input variant corresponds to the reference or alternate allele of the tagging variant. Will be -1 for copy number variants.
TAGMAJOR: Integer specifying which allele in the tagging VCF file is higher frequency in this comparison.

Interpretation

It is important to evaluate the significance of a correlation based on the number of data points used to estimate the correlation. If NGENOTYPES is low, many high r² values would be expected by chance.

It is important to use reasonable window sizes, especially when evaluating the correlation of rare variants. Singletons are by definition perfectly correlated with other singletons in the same sample.

Example

 java -Xmx4g -cp SVToolkit.jar \
     org.broadinstitute.sv.main.SVAnnotator \
     -A TagVariants \
     -R human_g1k_v37.fasta \
     -vcf input.vcf \
     -tagFile tag_variants.vcf \
     -tagR2Threshold 0.5 \
     -populationMapFile 1000G_populations.map \
     -population CEU \
     -writeReport true \
     -writeSummary true \
     -reportDirectory reportdir

TagVariantsAnnotator specific arguments

Name	Type	Default value	Summary
Optional Parameters
-filterGenotypes	Boolean	true	True to ignore genotypes that have been filtered. Applies to both the input variants and tag variants.
-filterVariants	Boolean	true	True to ignore variants that have been filtered. Applies to both the input variants and tag variants.
-genotypeQualityThreshold	Double	NA	Ignore genotypes below this genotype quality GQ value (default no threshold). Applies to both the input variants and tag variants.
-population	List[String]	NA	Population(s) or .list file of populations to process
-populationMapFile	List[File]	NA	Map file (or files) containing sample to population assignments
-sample	List[String]	NA	Sample(s) or .list file(s) of sample names
-tagIncludeOverlapping	Boolean	false	True to calculate r² for overlapping variants
-tagR2Threshold	Double	NA	Minimum r² value to report in report file (default no threshold)
-tagWindowSize	Integer	1000000	Size of window to use for LD evaluation (default 1Mb)

Argument details

--filterGenotypes / -filterGenotypes ( Boolean with default value true )

True to ignore genotypes that have been filtered. Applies to both the input variants and tag variants..

--filterVariants / -filterVariants ( Boolean with default value true )

True to ignore variants that have been filtered. Applies to both the input variants and tag variants..

--genotypeQualityThreshold / -genotypeQualityThreshold ( Double )

Ignore genotypes below this genotype quality GQ value (default no threshold). Applies to both the input variants and tag variants..

--population / -population ( List[String] )

Population(s) or .list file of populations to process.

--populationMapFile / -populationMapFile ( List[File] )

Map file (or files) containing sample to population assignments.

--sample / -sample ( List[String] )

Sample(s) or .list file(s) of sample names.

--tagIncudeOverlapping / -tagIncludeOverlapping ( Boolean with default value false )

True to calculate r² for overlapping variants.

--tagR2Threshold / -tagR2Threshold ( Double )

Minimum r² value to report in report file (default no threshold).

--tagWindowSize / -tagWindowSize ( Integer with default value 1000000 )

Size of window to use for LD evaluation (default 1Mb).

Search form