June 2016 - GATK 3.6
This document is an appendix to the GATK Tutorial :: Variant Discovery module worksheet. It contains a summary introduction to the scientific context of the tutorial.
The GATK Best Practices workflows provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. The following diagram illustrates the GATK Best Practices workflow for germline SNP and Indel discovery in whole genomes and exomes. It includes three phases: pre-processing, variant discovery, and callset refinement.
Figure 1: Best Practices workflow for germline SNP and Indel discovery in whole-genomes and exomes.
Pre-Processing starts from raw sequence data, either in FASTQ or uBAM format, and produces analysis-ready BAM files. Processing steps include alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.
Variant Discovery starts from analysis-ready BAM files and produces a callset in VCF format. Processing involves identifying sites where one or more individuals display possible genomic variation, and applying filtering methods appropriate to the experimental design. The Best Practices version 3.x include key innovations that enable joint analysis of multiple samples in a way that is scalable and allows incremental processing of the sequencing data. Those innovations are the focus of this tutorial.
Callset Refinement starts and ends with a VCF callset. Processing involves using metadata such as previously validated callsets to assess and improve genotyping accuracy, attach additional information and evaluate the overall quality of the callset. Learn more about the GATK Best Practices here.
In this context, joint analysis means that we consider evidence from multiple samples in order to determine the genotype of each sample at each site, rather than looking at only one sample at a time in isolation. Considering evidence from multiple samples empowers variant discovery and allows us to detect variants with great sensitivity and genotype samples as accurately as possible. Specifically, we have determined that joint analysis conveys the following benefits:
There are specific data contexts in which performing joint analysis makes an especially important difference. Two such cases are illustrated below.
Figure 2: Two cases where joint analysis provides important information that improves either the genotype determination or the interpretation of results.
Left: Power of joint analysis in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable.
Right: Importance of joint analysis to square off the genotype matrix, using an example of two disease-relevant variants. If we call these samples independently and produce a variants-only output, neither sample will have records for these two sites, for different reasons: the first sample is homozygous reference while the second sample has no data. Therefore, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.
Learn more about joint analysis here.
However, that method suffers from two major flaws: the N+1 problem and really bad scaling.
When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time. In the past, this was handled by doing batch calling, i.e. analyze the samples in batches and combine the resulting VCF callsets as they become available. But that’s not a true joint analysis, and it doesn’t give you the same significant gains that calling variants jointly can yield (on top of producing batch effects). If you wanted to do a true joint analysis using the multisample variant calling approach, you have to re-call all samples from scratch every time you get even one new sample sequence. And the more you add samples, the more computationally intensive it gets, bringing us to the next problem: really bad scaling.
Calling variants jointly across samples scales very badly. This is because the calculations involved in variant calling (especially by sophisticated tools like the HaplotypeCaller that perform a graph assembly step) become exponentially more computationally costly as you add samples to the cohort. If you don't have a lot of compute available, you run into limitations very quickly. Even at Broad, where we have fairly ridiculous amounts of compute available, we can't brute-force our way through the numbers for the large cohort sizes that we're called on to handle like the 92,000 exomes of the ExAC dataset (see this page).
The good news is that you don’t actually have to call variants on all your samples together to perform a joint analysis. We have developed a workflow that allows us to decouple the initial identification of potential variant sites, i.e. the variant calling, from the genotyping step, which is the only part that really needs to be done jointly. Since GATK 3.0, you can use the HaplotypeCaller to call variants individually per-sample in a special mode invoked by adding -ERC GVCF to your command line, generating an intermediate file called a GVCF (for Genomic VCF). You then run a joint genotyping step on all the GVCF files generated for the samples in the cohort. This achieves what we call incremental joint discovery, providing you with all the benefits of classic joint calling (as described below) without the drawbacks.
Figure 4. The new approach to joint analysis allows incremental processing of samples and scales much better than the traditional approach of calling variants on all samples simultaneously.
This uniquely innovative workflow solves both the scaling problems and the N+1 problem that plague traditional methods of joint analysis.
From here on out we will refer to this single-sample calling + joint genotyping workflow as the GVCF workflow because it involves the intermediate GVCF file, which uniquely distinguishes it from other methods.