Getting Started
All you need to start using GATK today


GATK, pronounced "Gee Ay Tee Kay" (not "Gat-Kay"), stands for GenomeAnalysisToolkit.

It is a collection of command-line tools for analyzing high-throughput sequencing (HTS) data in formats such as SAM/BAM/CRAM and VCF, with a focus on variant discovery. The relevant file formats are defined in the hts-specs repository; see especially the SAM specification and the VCF specification.

The following instructions provide the minimum requirements for getting started with GATK. Additional instructions are provided elsewhere for installing software required to run GATK Best Practices workflows and to attend a hands-on GATK workshop. For information about the complete analysis workflows we have developed for variant discovery, see the Best Practices documentation.


Download the software

The GATK command-line tools are provided as a single executable jar file. You can download a bzipped package containing the jar file from the Download page. The file name will be of the format GenomeAnalysisTK-x.y-z.tar.bz2. You will need to register for a free account on the forum and accept the licensing terms in order to access the software download.


Install it

There is no installation necessary in the traditional sense, since we provide a precompiled jar file that should work on any POSIX platform (NOT Microsoft Windows!) equipped with the appropriate version of Java (see below). You'll simply need to open the downloaded package and place the folder containing the GenomeAnalysisTK.jar file in a convenient directory on your hard drive (or server). Unlike C-compiled programs such as Samtools, GATK cannot simply be added to your PATH, so we recommend setting up an environment variable to act as a shortcut.

Important note about Java version

For the tools to run properly, you must have Oracle Java 1.8 installed. To check your java version by open your terminal application and run the following command:

java -version

If the output looks something like java version "1.8.x", you are good to go. If not, you may need to change your version; see the Oracle Java website to download an appropriate JDK. Note also that OpenJDK is NOT supported.


Test that it works

To test that you can run GATK tools, run the following command in your terminal application, providing either the full path to the GenomeAnalysisTK.jar file:

java -jar /path/to/GenomeAnalysisTK.jar -h

or the environment variable that you set up as a shortcut (here we are using $GATK):

java -jar $GATK -h

You should see a complete list of all the tools in the GATK toolkit. If you don't, read on to the section on getting help further below.


Use GATK tools

The tools, which are all listed in the Tool Documentation section, are invoked as follows:

java jvm-args -jar GenomeAnalysisTK.jar -R reference.fasta -T GATKToolName -OPTION1 value1 -OPTION2 value2 ...

See the FAQ article on GATK command syntax for more details, as well as the Tool Documentation for standard options and complete list of tools with usage recommendations, options, and example commands.


Follow the Best Practices

The GATK Best Practices are workflows that provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. We have several such workflows tailored to project aims (by type of variants of interest) and experimental designs (by type of sequencing approach). And although they were originally designed for human genome research, the GATK Best Practices can be adapted for analysis of non-human organisms of all kinds, including non-diploids.

You may notice we have a lot of documentation -- to the point that it can be overwhelming to newcomers. So in addition to going through the Best Practices, be sure to explore the documentation categories listed in the left-side menu to at least get a sense of the topics covered. This is guaranteed to save you time and effort down the road.


Set up pipelines

Most of the work involved in processing sequence data and performing variant discovery can be automated in the form of pipeline scripts, which often include some form of parallelization to speed up execution. There are two preferred options for writing and running GATK pipelines. The traditional option is Queue, a companion utility that is tightly integrated with GATK and optimized for execution on local computing clusters. The newer option is Cromwell + WDL, a pipelining solution that was recently developed at Broad to overhaul our production pipelines, optimized for cloud-based computing infrastructure. See here for a side-by-side comparison.


Get Help

We provide all support through our very active community forum. You can ask questions and report problems that you might encounter while using GATK and related tools such as Picard (for source code-related questions, post an issue on Github instead), with the following guidelines:

Before Asking For Help

Before posting to the Forum, please do the following:

  • Try the latest version of the software.
  • See if your problem is covered discussed in the Frequently Asked Questions or elsewhere in our (fairly extensive) documentation.
  • Search the GATK Forums to see if a similar problem has previously been discussed there (search box in top right corner).
  • Validate your input files (if applicable). Attempt to resolve or at least understand any problems reported.

When Asking For Help

When asking a question about a problem, please include the following:

  • Version(s) of GATK you tried with
  • Command line(s) you ran
  • Program console output and metrics files. Repetitive console output may be abbreviated
  • Entire stack trace if one was produced
  • Version of JVM you are using (obtained by running java -version)

We typically get back to you with a response within one or two business days, but be aware that more complex issues (or unclear reports) may take longer to address.


Note that the information in this documentation guide is targeted at end-users. For developers, the source code and related resources are available on GitHub.