(How to) Filter variants either with VQSR or by hard-filtering
Tutorials | Created 2018-12-14 | Last updated 2019-02-14

Comments (7)

Document is in BETA. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.

This article outlines two different approaches to site-level variant filtration. Site-level filtering involves using INFO field annotations in filtering. Section 1 outlines steps in Variant Quality Score Recalibration (VQSR) and section 2 outlines steps in hard-filtering. See Article#6925 for in-depth descriptions of the different variant annotations.

The GATK Best Practices recommends filtering germline variant callsets with VQSR. For WDL script implementations of pipelines the Broad Genomics Platform uses in production, i.e. reference implementations, see links, e.g. provided on https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145 to both the gatk-workflows WDL script repository and to FireCloud workspaces. Each includes example data as well as publically available recommended human population resources.

Hard-filtering is useful when the data cannot support VQSR or when an analysis requires manual filtering. Additionally, hard-filtering allows for filtering on sample-level annotations, i.e. FORMAT field annotations, which this article does not cover. See Tutorial#12350 to filter on FORMAT field attributes and to change the genotypes of such filtered sample sites to NULL ./..

► GATK4 offers a deep learning method to filter germline variants that is applicable to single sample callsets. As of this writing, the CNN workflow is in experimental status (check here for an update). See Blog#10996 for an overview and initial benchmarking results, and see the gatk4-cnn-variant-filter repository for the WDL pipeline. ► For more complex variant filtering and annotation, see the Broad Hail.is framework at https://hail.is/index.html. ► After variant filtration, if downstream analyses require high-quality genotype calls, consider genotype refinement, e.g. filtering posterior-corrected GQ<20 genotypes. See Article#11074 for an overview.

Jump to a section

  1. VQSR: filter a cohort callset with VariantRecalibrator & ApplyVQSR ☞ 1.1 How can I parallelize VQSR?
  2. Hard filter a cohort callset with VariantFiltration
  3. Evaluate the filtered callset

1. VQSR: filter a cohort callset with VariantRecalibrator & ApplyVQSR

This section outlines the VQSR filtering steps performed in the 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline. Note the workflow hard-filters on the ExcessHet annotation before filtering with VQSR with the expectation that the callset represents many samples.

[A] Hard-filter a large cohort callset on ExcessHet using VariantFiltration ExcessHet filtering applies only to callsets with a large number of samples, e.g. hundreds of unrelated samples. Small cohorts should not trigger ExcessHet filtering as values should remain small. Note cohorts of consanguinous samples will inflate ExcessHet, and it is possible to limit the annotation to founders for such cohorts by providing a pedigree file during variant calling.

gatk --java-options "-Xmx3g -Xms3g" VariantFiltration \
-V cohort.vcf.gz \
--filter-expression "ExcessHet > 54.69" \
--filter-name ExcessHet \
-O cohort_excesshet.vcf.gz 

This produces a VCF callset where any record with ExcessHet greater than 54.69 is filtered with the ExcessHet label in the FILTER column. The phred-scaled 54.69 corresponds to a z-score of -4.5. If a record lacks the ExcessHet annotation, it will pass filters.

[B] Create sites-only VCF with MakeSitesOnlyVcf Site-level filtering requires only site-level annotations. We can speed up the analysis in the modeling step by using a VCF that drops sample-level columns.

gatk MakeSitesOnlyVcf \
-I cohort_excesshet.vcf.gz \
-O cohort_sitesonly.vcf.gz

This produces a VCF that retains the first eight-columns.

[C] Calculate VQSLOD tranches for indels using VariantRecalibrator All of the population resource files are publically available at gs://broad-references/hg38/v0. The parameters in this article reflect those in the 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline and are thus tuned for WGS samples. For recommendations specific to exome samples, reasons why SNPs versus indels require different filtering and additional discussion of training sets and arguments, see Article#1259. For example, the article states:

[For filtering indels, m]ost annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

gatk --java-options "-Xmx24g -Xms24g" VariantRecalibrator \
-V cohort_sitesonly.vcf.gz \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 \
-an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP \      
-mode INDEL \
--max-gaussians 4 \
-resource mills,known=false,training=true,truth=true,prior=12:Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
-resource axiomPoly,known=false,training=true,truth=false,prior=10:Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz \
-resource dbsnp,known=true,training=false,truth=false,prior=2:Homo_sapiens_assembly38.dbsnp138.vcf \
-O cohort_indels.recal \
--tranches-file cohort_indels.tranches

The --max-gaussians parameter sets the expected number of clusters in modeling. If a dataset gives fewer distinct clusters, e.g. as can happen for smaller data, then the tool will tell you there is insufficient data with a No data found error message. In this case, try decrementing the --max-gaussians value.

[D] Calculate VQSLOD tranches for SNPs using VariantRecalibrator

gatk --java-options "-Xmx3g -Xms3g" VariantRecalibrator \
-V cohort_sitesonly.vcf.gz \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP \
-mode SNP \
--max-gaussians 6 \
-resource hapmap,known=false,training=true,truth=true,prior=15:hapmap_3.3.hg38.vcf.gz \
-resource omni,known=false,training=true,truth=true,prior=12:1000G_omni2.5.hg38.vcf.gz \
-resource 1000G,known=false,training=true,truth=false,prior=10:1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource dbsnp,known=true,training=false,truth=false,prior=7:Homo_sapiens_assembly38.dbsnp138.vcf \
-O cohort_snps.recal \
--tranches-file cohort_snps.tranches

Each step, C and D, produces a .recal recalibration table and a .tranches tranches table. In the filtering step, ApplyVQSR will use both types of data.

  • To additionally produce the optional tranche plot, specify the --rscript-file parameter. See the VariantRecalibrator tool documentation for details and this discussion thread for an example plot.
  • For allele-specific recalibration of an allele-specific callset, a beta feature as of this writing, add the -AS parameter.

☞ 1.1 How can I parallelize VQSR?

For cohorts with more than 10,000 WGS samples, it is possible to break down the analysis across genomic regions for parallel processing. The 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline does so first by increasing --java-options to "-Xmx100g -Xms100g" and second by add the following parameters to the command to subsample variants and to produce a file of the VQSR model.

--sample-every-Nth-variant 10 \
--output-model ${model_report_filename} \

The pipeline then applies the resulting model to each genomic interval with the same parameters as above with two additions. It provides the resulting model report to VariantRecalibrator with --input-model and specifies the flag --output-tranches-for-scatter. The pipeline then collates the resulting per-interval tranches with GatherTranches. Refer to the pipeline script for implementation details.

Successively apply the indel and SNP recalibrations to the full callset that has already undergone ExcessHet filtering.

[E] Filter indels on VQSLOD using ApplyVQSR

gatk --java-options "-Xmx5g -Xms5g" \
ApplyVQSR \
-V cohort_excesshet.vcf.gz \
--recal-file cohort_indels.recal \
--tranches-file cohort_indels.tranches \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode INDEL \
-O indel.recalibrated.vcf.gz

This produces an indel-filtered callset. At this point, SNP-type variants remain unfiltered.

[F] Filter SNPs on VQSLOD using ApplyVQSR

gatk --java-options "-Xmx5g -Xms5g" \
ApplyVQSR \
-V indel.recalibrated.vcf.gz \
--recal-file ${snps_recalibration} \
--tranches-file ${snps_tranches} \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode SNP \
-O snp.recalibrated.vcf.gz \

This produces a SNP-filtered callset. Given the indel-filtered callset, this results in the final filtered callset.

back to top

2. Hard filter a cohort callset with VariantFiltration

This section of the tutorial provides generic hard-filtering thresholds and example commands for site-level manual filtering. A typical scenario requiring manual filtration is small cohort callsets, e.g. less than thirty exomes. See the GATK3 hard filtering Tutorial#2806 for additional discussion.

Researchers are expected to fine-tune hard-filtering thresholds for their data. Towards gauging the relative informativeness of specific variant annotations, the GATK hands-on hard-filtering workshop tutorial demonstrates how to plot distributions of annotation values for variant calls stratified against a truthset.

As with VQSR, hard-filter SNPs and indels separately. As of this writing, SelectVariants subsets SNP-only records, indel-only records or mixed-type, i.e. SNP and indel alternate alleles in the same record, separately. Therefore, when subsetting to SNP-only or indel-only records, mixed-type records are excluded. See this GitHub ticket for the status of a feature request to apply VariantFiltration directly on types of variants.

To avoid the loss of mixed-type variants, break up the multiallelic records into biallelic records before proceeding with the following subsetting. Alternatively, to process mixed-type variants with indel filtering thresholds similar to VQSR, add -select-type MIXED to the second command [B].

[A] Subset to SNPs-only callset with SelectVariants

gatk SelectVariants \
-V cohort.vcf.gz \
-select-type SNP \
-O snps.vcf.gz

This produces a VCF with records with SNP-type variants only.

[B] Subset to indels-only callset with SelectVariants

gatk SelectVariants \
-V cohort.vcf.gz \
-select-type INDEL \
-O indels.vcf.gz

This produces a VCF with records with indel-type variants only.

[C] Hard-filter SNPs on multiple expressions using VariantFiltration The GATK does not recommend use of compound filtering expressions, e.g. the logical || "OR". For such expressions, if a record is null for or missing a particular annotation in the expression, the tool negates the entire compound expression and so automatically passes the variant record even if it fails on one of the expressions. See this issue ticket for details.

Provide each expression separately with the -filter parameter followed by the –-filter-name. The tool evaluates each expression independently. Here we show basic filtering thresholds researchers may find useful to start.

gatk VariantFiltration \
-V snps.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O snps_filtered.vcf.gz

This produces a VCF with the same variant records now annotated with filter status. Specifically, if a record passes all the filters, it receives a PASS label in the FILTER column. A record that fails a filter receives the filter name in the FILTER column, e.g. SOR3. If a record fails multiple filters, then each failing filter name appears in the FILTER column separated by semi-colons ;, e.g. MQRankSum-12.5;ReadPosRankSum-8.

[D] Similarly, hard-filter indels on multiple expressions using VariantFiltration

gatk VariantFiltration \ 
-V indels.vcf.gz \ 
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \ 
-O indels_filtered.vcf.gz

This produces a VCF with the same variant records annotated with filter names for failing records. At this point, consider merging the separate callsets together. Select comments follow.

  • RankSum annotations can only be calculated for REF/ALT heterozygous sites and therefore will be absent from records that do not present read counts towards heterozygous genotypes.
  • By default, GATK HaplotypeCaller and GenotypeGVCFs do not emit variants with QUAL < 10. The --standard-min-confidence-threshold-for-calling (-stand-call-conf) parameter adjusts this threshold. GATK recommends filtering variants with QUAL less than 30. The lower default QUAL threshold of the callers allows for more negative training data in VQSR filtering.
  • When providing filtering thresholds, the tool expects the value to match the type specified in the ##INFO lines of the callset. For example, an Integer type is a whole number without decimals, e.g. 0, and a Float type is a number with decimals, e.g. 0.0. If the expected type mismatches, the tool will give a java.lang.NumberFormatException error.
  • If a filter expression is misspelled, the tool does not give a warning, so be sure to carefully review filter expressions for correctness.

back to top

3. Evaluate the filtered callset

Filtering is about balancing sensitivity and precision for research aims. ​For example, genome-wide association studies can afford to maximize sensitivity over precision such that there are more false positives in the callset. Conversely, downstream analyses that require high precision, e.g. those that cannot tolerate false positive calls because validating variants is expensive, maximize precision over sensitivity such that the callset loses true positives.

Two tools enable site-level evaluation--CollectVariantCallingMetrics and VariantEval. Another tool, GenotypeConcordance, measures sample-level genotype concordance and is not covered here. For an overview of all three tools, see Article#6308.

Compare callset against a known population callset using CollectVariantCallingMetrics

gatk CollectVariantCallingMetrics \
-I filtered.vcf.gz \
--DBSNP Homo_sapiens_assembly38.dbsnp138.vcf \
-SD Homo_sapiens_assembly38.dict \
-O metrics 

This produces detailed and summary metrics report files. The summary metrics provide cohort-level variant metrics and the detailed metrics segment variant metrics for each sample in the callset. The detail metrics give the same metrics as the summary metrics for the samples plus several additional metrics. These are explained in detail at https://broadinstitute.github.io/picard/picard-metric-definitions.html.

Compare callset against a known population callset using VariantEval As of this writing, VariantEval is in beta status in GATK v4.1. And so we provide an example GATK3 command, where the tool is in production status. GATK3 Dockers are available at https://hub.docker.com/r/broadinstitute/gatk3.

java -jar gatk3.jar \
-T VariantEval \
-R Homo_sapiens_assembly38.fasta \
-eval cohort.vcf.gz \
-D Homo_sapiens_assembly38.dbsnp138.vcf \
-noEV \
-EV CompOverlap -EV IndelSummary -EV TiTvVariantEvaluator \
-EV CountVariants -EV MultiallelicSummary \
-o cohortEval.txt

This produces a file containing a table for each of the evaluation modules, e.g. CompOverlap.

Please note the GA4GH (​Global Alliance for Genomics and Health​) recommends using ​hap.py​ for stratified variant evaluations (1, 2). One approach using hap.py​ wraps the ​vcfeval​ module of ​RTG-Tools​. The module accounts for differences in variant representation by matching variants mapped back to the reference.

back to top

Return to top Comment on this article