Annotator for performing "in silico" validation of copy number variants using SNP array intensity data.
Category: Variant Annotators
The IntensityRankSum annotator is invoked through the SVAnnotator framework, which defines arguments common to all annotators.
The IntensityRankSum annotator uses SNP array intensity data to do a form of "in silico" validation of copy number variants. The general approach is to use a Wilcoxon rank sum test using the intensity data and to calculate a p-value as follows: For each array probe underlying the variant, each sample is assigned an integral rank for that probe. Then the set of ranks (across all probes) is combined and treated as a set of observations for the Wilcoxon rank sum test. If there is more than one probe, there will certainly be ties (i.e. some sample will be rank 1 with respect to each probe). Ties are broken randomly to assign the final ranking. The random seed is based on the input data, so identical inputs will always produce identical results. The Wilcoxon rank sum test is applied to the final combined ranking to test whether the event-carrying samples (see below) are shifted with respect to the non-event-carrying samples. This is a one-tailed test of a negative shift (for deletions) or a positive shift (for duplications).
For CNVs, two tests are performed (and two p-values generated). First, the appropriate reference ploidy is determined. The test is only performed on intervals of uniform gender-independent ploidy, or as a special case on intervals of uniform gender-dependent ploidy if gender information is available and all of the input samples share the same gender. The two tests are a negative-shift test for samples with copy number less than the reference ploidy (typically two, i.e. diploid) compared to all samples whose copy number matches the reference ploidy and a positive-shift test for samples with copy number greater than the reference ploidy (typically two) compared to all samples whose copy number matches the reference ploidy.
This annotator can be run in one of several modes, depending on whether you have genotypes for each sample or simply a list of samples believed to carry the variant (either as homozygotes or heterozygotes). The -irsUseGenotypes flag indicates which mode to use.
If you do not use genotypes (i.e. you are evaluating a set of discovered sites without genotypes), then the annotator will expect an INFO tag on each variant to indicate the set of samples that are thought to carry the variant (either as homozygotes or heterozygotes). The variants must be either deletions (SVTYPE=DEL) or duplications (SVTYPE=DUP). In addition, you must supply a list of samples that were evaluated during discovery (using -sample) so that the list of non-carrier samples can be determined for each variant.
If you use genotypes, then the annotator will process any kind of CNV (SVTYPE=DEL, SVTYPE=DUP or SVTYPE=CNV). Genotypes will be taken from the GT/GQ fields (if present) or from the CN/CNQ fields and the genotypes will be used to determine carrier and non-carrier samples. For CNVs, if GT/GQ fields are present then the alleles must be symbolic copy number alleles (e.g. <CNx>).
Input File Formats
The array intensity input file is a tab-delimited text file consisting of four fixed columns (ID, CHROM, START, END) and a variable number of additional columns, one per sample that has array data. The file consists of a header line specifying these four columns and then the name of each sample. After the header line, the file contains one line per array-probe location and must be sorted in reference-sequence order. Each array-probe location should have a single intensity value. For SNP probes, this is usually the sum of the A and B normalized probe intensities. The START and END coordinates are generally the same, corresponding to the SNP position.
This annotator can produce the following outputs: Annotated output VCF, report file.
This annotator produces the following INFO field annotations for each VCF record (or the corresponding column in the report file):
- IRS_NPROBES (NPROBES)
- The number of array probes used for this variant.
- IRS_NSAMPLES (NSAMPLES)
- The number of measured samples (either affected or unaffected).
- IRS_LOWERNSAMPLES (LOWERNSAMPLES)
- The number of measured samples for the deletion p-value.
- IRS_LOWERPVALUE (LOWERPVALUE)
- The p-value based on testing deleted samples (carrier samples with copy number less than two).
- IRS_UPPERNSAMPLES (UPPERNSAMPLES)
- The number of measured samples for the duplication p-value.
- IRS_UPPERPVALUE (UPPERPVALUE)
- The p-value based on testing duplicated samples (carrier samples with copy number greater than two).
- IRS_PVALUE (PVALUE)
- The lesser of IRS_LOWERPVALUE and IRS_UPPERPVALUE. You should not use this minimum p-value for FDR estimation unless you compute and apply a suitable null expectation. Instead, you should stratify your variants based on type (deletion, duplication, mixed) and estimate FDR separately using IRS_LOWERPVALUE and IRS_UPPERPVALUE.
The IntensityRankSum annotator provides basic statistics for each site, but the most robust way to interpret the results is to estimate the false discovery rate (FDR) of a large call set as a whole using the following procedure: For each IRS p-value (LOWER or UPPER), calculate the ratio of the number of sites with a p-value >= 0.5 to the number of sites with a valid p-value (i.e. sites with probes) and double this ratio to estimate the FDR of the call set as a whole. This callset-wide estimation is more robust than trying to assess individual sites, as the p-values on individual sites are not always well-calibrated due to inflation in the tails of the p-value distribution. As a result, using a p-value cutoff of 0.01 may yield a call set with a true FDR higher than 1%.
Example for assessing discovery, using INFO tags to determine carrier and non-carrier samples.
# Discovery example (using INFO tags) java -Xmx4g -cp SVToolkit.jar \ org.broadinstitute.sv.main.SVAnnotator \ -A IntensityRankSum \ -R human_g1k_v37.fasta \ -vcf input.vcf \ -O output.vcf \ -arrayIntensityFile ALL.genome.Omni25_probe_intensity_matrix.20110425.dat \ -sample discovery_samples.list \ -irsSampleTag IRSSAMPLES \ -writeReport true \ -reportFile irs_output.report.dat
Example for assessing genotyped variants, using the genotypes in the input VCF file to determine carrier and non-carrier samples.
# Genotyped example (using genotype fields) java -Xmx4g -cp SVToolkit.jar \ org.broadinstitute.sv.main.SVAnnotator \ -A IntensityRankSum \ -R human_g1k_v37.fasta \ -vcf input.vcf \ -O output.vcf \ -arrayIntensityFile ALL.genome.Omni25_probe_intensity_matrix.20110425.dat \ -irsUseGenotypes true \ -writeReport true \ -reportFile irs_output.report.dat
IntensityRankSumAnnotator specific arguments
|-arrayIntensityFile||File||NA||Path to file containing matrix of array intensity values|
|-genotypeQualityThreshold||Double||NA||When using genotypes, ignore genotypes below this genotype quality GQ value (default no threshold)|
|-irsBoundaryPadding||Integer||0||Probes within this distance of inner site boundaries are not used|
|-irsPermute||String||false||Whether to permute samples to generate null distribution|
|-irsSampleTag||String||IRSSAMPLES||VCF INFO tag containing list of carrier samples|
|-irsUseGenotypes||String||false||Whether to use genotypes or INFO tags to select carrier samples|
|-sample||List[String]||NA||Sample(s) or .list file of sample names. Used to determine the non-carrier samples when not using genotypes.|
Path to file containing matrix of array intensity values.
When using genotypes, ignore genotypes below this genotype quality GQ value (default no threshold).
Probes within this distance of inner site boundaries are not used.
Whether to permute samples to generate null distribution.
VCF INFO tag containing list of carrier samples.
Whether to use genotypes or INFO tags to select carrier samples.
Sample(s) or .list file of sample names. Used to determine the non-carrier samples when not using genotypes..