GATK, pronounced "Gee Ay Tee Kay" (not "Gat-Kay"), stands for GenomeAnalysisToolkit.
It is a collection of command-line tools for analyzing high-throughput sequencing (HTS) data in formats such as SAM/BAM/CRAM and VCF, with a focus on variant discovery. The relevant file formats are defined in the hts-specs repository; see especially the SAM specification and the VCF specification.
The following instructions provide the minimum requirements for getting started with GATK. Additional instructions are provided elsewhere for installing software required
The GATK command-line tools are provided as a single executable jar file. You can download a bzipped package containing the jar file from the Download page. The file name will be of the format GenomeAnalysisTK-x.y-z.tar.bz2. You will need to register for a free account on the forum and accept the licensing terms in order to access the software download.
There is no installation necessary in the traditional sense, since we provide a precompiled jar file that should work on any POSIX platform (NOT Microsoft Windows!) equipped with the appropriate version of Java (see below). You'll simply need to open the downloaded package and place the folder containing the GenomeAnalysisTK.jar file in a convenient directory on your hard drive (or server). Unlike C-compiled programs such as Samtools, GATK cannot simply be added to your PATH, so we recommend setting up an environment variable to act as a shortcut.
For the tools to run properly, you must have Oracle Java 1.8 installed. To check your java version by open your terminal application and run the following command:
If the output looks something like java version "1.8.x", you are good to go. If not, you may need to change your version; see the Oracle Java website to download an appropriate JDK. Note also that OpenJDK is NOT supported.
To test that you can run GATK tools, run the following command in your terminal application, providing either the full path to the GenomeAnalysisTK.jar file:
java -jar /path/to/GenomeAnalysisTK.jar -h
or the environment variable that you set up as a shortcut (here we are using $GATK):
java -jar $GATK -h
You should see a complete list of all the tools in the GATK toolkit. If you don't, read on to the section on getting help further below.
The tools, which are all listed in the Tool Documentation section, are invoked as follows:
java jvm-args -jar GenomeAnalysisTK.jar -R reference.fasta -T GATKToolName -OPTION1 value1 -OPTION2 value2 ...
The GATK Best Practices are workflows that provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. We have several such workflows tailored to project aims (by type of variants of interest) and experimental designs (by type of sequencing approach). And although they were originally designed for human genome research, the GATK Best Practices can be adapted for analysis of non-human organisms of all kinds, including non-diploids.
You may notice we have a lot of documentation -- to the point that it can be overwhelming to newcomers. So in addition to going through the Best Practices, be sure to explore the documentation categories listed in the left-side menu to at least get a sense of the topics covered. This is guaranteed to save you time and effort down the road.
Most of the work involved in processing sequence data and performing variant discovery can be automated in the form of pipeline scripts, which often include some form of parallelization to speed up execution. There are two preferred options for writing and running GATK pipelines. The traditional option is Queue, a companion utility that is tightly integrated with GATK and optimized for execution on local computing clusters. The newer option is Cromwell + WDL, a pipelining solution that was recently developed at Broad to overhaul our production pipelines, optimized for cloud-based computing infrastructure. See here for a side-by-side comparison.
We provide all support through our very active community forum. You can ask questions and report problems that you might encounter while using GATK and related tools such as Picard (for source code-related questions, post an issue on Github instead), with the following guidelines:
Before posting to the Forum, please do the following:
When asking a question about a problem, please include the following:
We typically get back to you with a response within one or two business days, but be aware that more complex issues (or unclear reports) may take longer to address.
Note that the information in this documentation guide is targeted at end-users. For developers, the source code and related resources are available on GitHub.