The GATK Best Practices provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. There are several different Best Practices workflows tailored to particular applications depending on the type of variation of interest and the technology employed. The Best Practices documentation attempts to describe in detail the key principles of the processing and analysis steps required to go from raw reads coming off the sequencing machine, all the way to an appropriately filtered variant callset that can be used in downstream analyses. Wherever we can, we try to provide guidance regarding experimental design, quality control (QC) and pipeline implementation options, but please understand that those are dependent on many factors including sequencing technology and the hardware infrastructure that are at your disposal, so you may need to adapt our recommendations to your specific situation.
Our Best Practices workflows are each tailored to a particular application (type of variation and experimental design), but overall they follow similar patterns, comprising two or three analysis phases depending on the application. (1) Pre-processing starts from raw sequence data, either in FASTQ or uBAM format, and produces analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis. (2) Variant Discovery starts from analysis-ready BAM files and produces a callset in VCF format. This involves identifying genomic variation in one or more individuals and applying filtering methods appropriate to the experimental design. (3) Callset Refinement (where applicable) starts and ends with a VCF callset. This involves using meta-data to assess and improve genotyping accuracy, attach additional information and evaluate the overall quality of the callset.
We develop and validate these workflows in collaboration with many investigators within the Broad Institute's network of affiliated institutions. They are deployed at scale in the Broad's production pipelines -- a very large scale indeed. So as a general rule, the command-line arguments and parameters given in the documentation are meant to be broadly applicable (so to speak). However, our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data, organisms or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. In addition, several key steps make use of external resources, including validated databases of known variants. If there are few or no such resources available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.
Not that they're bad -- but they're not use cases that we cover. The canonical Best Practices workflows (as run in production at the Broad) are designed specifically for human genome research and are optimized for the instrumentation (overwhelmingly Illumina) and needs of the Broad Institute sequencing facility. They can be adapted for analysis of non-human organisms of all kinds, including non-diploids, and of different data types, with varying degrees of effort depending on how divergent the use case and data type are. However, any workflow that has been significantly adapted or customized, whether for performance reasons or to fit a use case that we do not explicitly cover, should not be called "GATK Best Practices". The correct way to refer to such workflows is "based on" or "adapted from" GATK Best Practices. When in doubt about whether a particular customization constitutes a significant divergence, feel free to ask us in the forum.
If someone hands you a script and tells you "this runs the GATK Best Practices", start by asking what version of GATK it uses, when it was written, and what are the key steps that it includes. Both our software and our usage recommendations evolve in step with the rapid pace of technological and methodological innovation in the field of genomics, so what was Best Practice last year (let alone in 2010) may no longer be applicable. And if all the steps seem to be in accordance with our docs (same tools in the same order), you should still check every single argument in the commands. If anything is unfamiliar to you, figure out what it does. If you can't find it in the docs, ask us in the forum. It's one or two hours of your life that can save you days of troubleshooting, so please protect yourself by being thorough.