TagVariantsAnnotator documentation

TagVariantsAnnotator

Annotator that evaluates tagging SNPs (or other markers) based on pairwise r2.

Category: Variant Annotators

The TagVariants annotator is invoked through the SVAnnotator framework, which defines arguments common to all annotators.

Introduction

The TagVariants annotator compares the input variant file to a comparison file of variants (e.g. VCF) specified by -tagFile, which can be the same file or a different file. For each variant in the input variant file, a pairwise r2 value is computed to all variants in the tag variant file within a certain genomic distance of each input variant (by default 1 megabase).

This annotator can process either bi-allelic variants (with GT/GQ fields) or copy-number variants (with CN/CNQ fields). In the case of copy number variants, r2 is computed with respect to diploid copy number.

Multiple populations (potentially overlapping) may be specified and then the r2 calculations are performed separately in each population. If no populations are defined, the samples in the VCF are treated as a single population. The r2 values are only calculated for samples that are present in both the input VCF file and the tag VCF file.

This annotator produces both a summary report and a detailed report. The detailed report contains r2 values for all pairs of vairants above a user-specified r2 threshold (see -tagR2Threshold). The summary report contains the best r2 value for each input marker in each population.

The detailed report file contains a line for each pair of variants and each population where the r2 value is above the specified threshold. If this threshold is low and the tag window size is large, the report file may contain many lines for each variant.

Input Formats

Population map files are tab delimited files with two columns. The first column specifies the sample identifier and the second column specifies a population identifier. A header line is optional, but if present the column names should be SAMPLE and POPULATION.

Multiple population map files may be provided and these may assign multiple populations to the same sample. This allows multiple levels of population structure or overlapping populations to be described.

Output Formats

This annotator can produce the following outputs: summary file, report file.

The summary file contains one line for each input variant. The populations are reported in separate columns.

The columns in the summary file are:

VARIANT
The ID of the variant.
CHR
The chromosome of the variant.
START
The start position of the variant.
END
The end position of the variant.
MAF
The minor allele frequency of the variant (only present when there is a single population).
R2
The maximum r2 value found to the best tagging marker. This column is displayed as "R2" when there is only one population. If there are multiple populations being tested, then there will be one column for each population and the column header specifies the population identifier.

The columns in the detailed report file are:

VARIANT
The ID of the input variant.
CHR
The chromosome of the input variant.
START
The start position of the input variant.
END
The end position of the input variant.
MAF
The minor allele frequency of the input variant (in this population).
TAGVARIANT
The ID of the tagging variant.
TAGCHR
The chromosome of the tagging variant.
TAGSTART
The start position of the tagging variant.
TAGEND
The end position of the tagging variant.
POP
The population used to compute this r2 value.
NGENOTYPES
The number of genotypes used to compute r2. This reflects taking the intersection of the called genotypes at the two variant sites in the specified population.
NREFALLELES
The number of reference alleles observed in the input variant.
TAGNREFALLELES
The number of reference alleles observed in the tagging variant.
OVERLAPMAF
The minor allele frequency of the input variant measured only at overlapping samples with the tagging variant and where both the input and tagging variant have called genotypes. This may be different than MAF, which considers all samples in this population for which there are called genotypes.
TAGOVERLAPMAF
The minor allele frequency of the tagging variant measured only at overlapping samples with the input variant where both the input and tagging variant have called genotypes.
R2
The squared correlation (r2) between the input and tagging variant.
MAJOR
Integer specifying which allele in the input VCF file is higher frequency in this comparison. By comparing MAJOR and TAGMAJOR, you can tell whether the reference allele of the input variant corresponds to the reference or alternate allele of the tagging variant. Will be -1 for copy number variants.
TAGMAJOR
Integer specifying which allele in the tagging VCF file is higher frequency in this comparison.

Interpretation

It is important to evaluate the significance of a correlation based on the number of data points used to estimate the correlation. If NGENOTYPES is low, many high r2 values would be expected by chance.

It is important to use reasonable window sizes, especially when evaluating the correlation of rare variants. Singletons are by definition perfectly correlated with other singletons in the same sample.

Example

 java -Xmx4g -cp SVToolkit.jar \
     org.broadinstitute.sv.main.SVAnnotator \
     -A TagVariants \
     -R human_g1k_v37.fasta \
     -vcf input.vcf \
     -tagFile tag_variants.vcf \
     -tagR2Threshold 0.5 \
     -populationMapFile 1000G_populations.map \
     -population CEU \
     -writeReport true \
     -writeSummary true \
     -reportDirectory reportdir

TagVariantsAnnotator specific arguments

Name Type Default value Summary
Optional Parameters
-filterGenotypes Boolean true True to ignore genotypes that have been filtered. Applies to both the input variants and tag variants.
-filterVariants Boolean true True to ignore variants that have been filtered. Applies to both the input variants and tag variants.
-genotypeQualityThreshold Double NA Ignore genotypes below this genotype quality GQ value (default no threshold). Applies to both the input variants and tag variants.
-population List[String] NA Population(s) or .list file of populations to process
-populationMapFile List[File] NA Map file (or files) containing sample to population assignments
-sample List[String] NA Sample(s) or .list file(s) of sample names
-tagIncludeOverlapping Boolean false True to calculate r2 for overlapping variants
-tagR2Threshold Double NA Minimum r2 value to report in report file (default no threshold)
-tagWindowSize Integer 1000000 Size of window to use for LD evaluation (default 1Mb)

Argument details

--filterGenotypes / -filterGenotypes ( Boolean with default value true )

True to ignore genotypes that have been filtered. Applies to both the input variants and tag variants..

--filterVariants / -filterVariants ( Boolean with default value true )

True to ignore variants that have been filtered. Applies to both the input variants and tag variants..

--genotypeQualityThreshold / -genotypeQualityThreshold ( Double )

Ignore genotypes below this genotype quality GQ value (default no threshold). Applies to both the input variants and tag variants..

--population / -population ( List[String] )

Population(s) or .list file of populations to process.

--populationMapFile / -populationMapFile ( List[File] )

Map file (or files) containing sample to population assignments.

--sample / -sample ( List[String] )

Sample(s) or .list file(s) of sample names.

--tagIncudeOverlapping / -tagIncludeOverlapping ( Boolean with default value false )

True to calculate r2 for overlapping variants.

--tagR2Threshold / -tagR2Threshold ( Double )

Minimum r2 value to report in report file (default no threshold).

--tagWindowSize / -tagWindowSize ( Integer with default value 1000000 )

Size of window to use for LD evaluation (default 1Mb).