ApplyRecalibration

Apply a score cutoff to filter variants based on a recalibration table

Category Variant Discovery Tools

Traversal LocusWalker

PartitionBy LOCUS


Overview

This tool performs the second pass in a two-stage process called VQSR; the first pass is performed by the VariantRecalibrator tool. In brief, the first pass consists of creating a Gaussian mixture model by looking at the distribution of annotation values over a high quality subset of the input call set, and then scoring all input variants according to the model. The second pass consists of filtering variants based on score cutoffs identified in the first pass.

Using the tranche file and recalibration table generated by the previous step, the ApplyRecalibration tool looks at each variant's VQSLOD value and decides which tranche it falls in. Variants in tranches that fall below the specified truth sensitivity filter level have their FILTER field annotated with the corresponding tranche level. This will result in a call set that is filtered to the desired level but retains the information necessary to increase sensitivity if needed.

To be clear, please note that by "filtered", we mean that variants failing the requested tranche cutoff are marked as filtered in the output VCF; they are not discarded.

VQSR is probably the hardest part of the Best Practices to get right, so be sure to read the method documentation, parameter recommendations and tutorial to really understand what these tools and how to use them for best results on your own data.

Input

  • The raw input variants to be filtered.
  • The recalibration table file that was generated by the VariantRecalibrator tool.
  • The tranches file that was generated by the VariantRecalibrator tool.

Output

  • A recalibrated VCF file in which each variant of the requested type is annotated with its VQSLOD and marked as filtered if the score is below the desired quality level.

Usage example for filtering SNPs

 java -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference.fasta \
   -input raw_variants.vcf \
   --ts_filter_level 99.0 \
   -tranchesFile output.tranches \
   -recalFile output.recal \
   -mode SNP \
   -o path/to/output.recalibrated.filtered.vcf
 

Allele-specific usage

 java -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference.fasta \
   -input raw_variants.withASannotations.vcf \
   -AS \
   --ts_filter_level 99.0 \
   -tranchesFile output.AS.tranches \
   -recalFile output.AS.recal \
   -mode SNP \
   -o path/to/output.recalibrated.ASfiltered.vcf
 
Each allele will be annotated by its corresponding entry in the AS_FilterStatus INFO field annotation. Allele-specific VQSLOD and culprit are also carried through from VariantRecalibrator and stored in the AS_VQSLOD and AS_culprit INFO fields, respectively. The site-level filter is set to the most lenient of any of the allele filters. That is, if one allele passes, the whole site will be PASS. If no alleles pass, the site-level filter will be set to the lowest sensitivity tranche among all the alleles. Note that the .tranches and .recal files should be derived from an allele-specific run of VariantRecalibrator Also note that the AS_culprit, AS_FilterStatus, and AS_VQSLOD fields will have placeholder values (NA or NaN) for alleles of a type that have not yet been processed by ApplyRecalibration The spanning deletion allele (*) will not be recalibrated because it represents missing data. Its VQSLOD will remain NaN and it's culprit and FilterStatus will be NA.

Caveats

  • The tranche values used in the example above are only meant to be a general example. You should determine the level of sensitivity that is appropriate for your specific project. Remember that higher sensitivity (more power to detect variants, yay!) comes at the cost of specificity (more false negatives, boo!). You have to choose at what point you want to set the tradeoff.
  • In order to create the tranche reporting plots (which are only generated for SNPs, not indels!) Rscript needs to be in your environment PATH (this is the scripting version of R, not the interactive version).

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by ApplyRecalibration.

Parallelism options

This tool can be run in multi-threaded mode using this option.


Command-line Arguments

Engine arguments

All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.

ApplyRecalibration specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--input
NA The raw input variants to be recalibrated
--recal_file
 -recalFile
NA The input recal file used by ApplyRecalibration
Optional Inputs
--tranches_file
 -tranchesFile
NA The input tranches file describing where to cut the data
Optional Outputs
--out
 -o
stdout The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value
Optional Parameters
--ignore_filter
 -ignoreFilter
NA If specified, the recalibration will be applied to variants marked as filtered by the specified filter name in the input VCF file
--mode
SNP Recalibration mode to employ: 1.) SNP for recalibrating only SNPs (emitting indels untouched in the output VCF); 2.) INDEL for indels; and 3.) BOTH for recalibrating both SNPs and indels simultaneously.
--ts_filter_level
NA The truth sensitivity level at which to start filtering
Optional Flags
--excludeFiltered
 -ef
false Don't output filtered loci after applying the recalibration
--ignore_all_filters
 -ignoreAllFilters
false If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file.
--useAlleleSpecificAnnotations
 -AS
false If specified, the tool will attempt to apply a filter to each allele based on the input tranches and allele-specific .recal file.
Advanced Parameters
--lodCutoff
NA The VQSLOD score below which to start filtering

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--excludeFiltered / -ef

Don't output filtered loci after applying the recalibration

boolean  false


--ignore_all_filters / -ignoreAllFilters

If specified, the variant recalibrator will ignore all input filters. Useful to rerun the VQSR from a filtered output file.

boolean  false


--ignore_filter / -ignoreFilter

If specified, the recalibration will be applied to variants marked as filtered by the specified filter name in the input VCF file
For this to work properly, the -ignoreFilter argument should also be applied to the VariantRecalibration command.

String[]  NA


--input / -input

The raw input variants to be recalibrated
These calls should be unfiltered and annotated with the error covariates that are intended to use for modeling.

This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

R List[RodBinding[VariantContext]]  NA


--lodCutoff / -lodCutoff

The VQSLOD score below which to start filtering

Double  NA


--mode / -mode

Recalibration mode to employ: 1.) SNP for recalibrating only SNPs (emitting indels untouched in the output VCF); 2.) INDEL for indels; and 3.) BOTH for recalibrating both SNPs and indels simultaneously.

The --mode argument is an enumerated type (Mode), which can have one of the following values:

SNP
INDEL
BOTH

Mode  SNP


--out / -o

The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value

VariantContextWriter  stdout


--recal_file / -recalFile

The input recal file used by ApplyRecalibration

This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

R RodBinding[VariantContext]  NA


--tranches_file / -tranchesFile

The input tranches file describing where to cut the data

File  NA


--ts_filter_level / -ts_filter_level

The truth sensitivity level at which to start filtering

Double  NA


--useAlleleSpecificAnnotations / -AS

If specified, the tool will attempt to apply a filter to each allele based on the input tranches and allele-specific .recal file.
Filter the input file based on allele-specific recalibration data. See tool docs for site-level and allele-level filtering details. Requires a .recal file produced using an allele-specific run of VariantRecalibrator

boolean  false


Return to top


See also GATK Documentation Index | Tool Docs Index | Support Forum

GATK version 3.7-0-gcfedb67 built at 2017/02/09 12:35:06.