GATK, pronounced "Gee Ay Tee Kay" (not "Gat-Kay"), stands for GenomeAnalysisToolkit. It is a collection of command-line tools for analyzing high-throughput sequencing data with a primary focus on variant discovery. The tools can be used individually or chained together into complete workflows. We provide end-to-end workflows, called GATK Best Practices, tailored for specific use cases.
Starting with version 4.0, GATK contains a copy of the Picard toolkit, so all Picard tools are available from within GATK itself and their documentation is available in the Tool Documentation section of this website.
gatkwrapper script rather than calling either jar directly
gatk [--java-options "-Xmx4G"] ToolName [GATK args]; full details here.
Most GATK4 tools have fairly simple software requirements: a Unix-style OS and Java 1.8. However, a subset of tools have additional R and/or Python dependencies. These dependencies (as well as the base system requirements) are described in detail here. So we strongly recommend using the Docker container system, if that's an option on your infrastructure, rather than a custom installation. All released versions of GATK4 can be found as prepackaged container images in Dockerhub here. If you can't use Docker, do yourself a favor and use the Conda environment that we provide to manage dependencies, as described in the github repository README.
You will also need Python 2.6 or greater to run the
gatk wrapper script (described below).
If you run into difficulties with the Java version requirement, see this article for help.
You can download the GATK package here OR get the Docker image here. The instructions below will assume you downloaded the GATK package to your local machine and are planning to run it directly. For instructions on how to go the Docker route, see this tutorial.
Once you have downloaded and unzipped the package (named
gatk-[version]), you will find four files inside the resulting directory:
gatk gatk-package-[version]-local.jar gatk-package-[version]-spark.jar README.md
Now you may ask, why are there two jars? As the names suggest,
gatk-package-[version]-spark.jar is the jar for running Spark tools on a Spark cluster, while
gatk-package-[version]-local.jar is the jar that is used for everything else (including running Spark tools "locally", i.e. on a regular server or cluster).
So does that mean you have to specify which one you want to run each time? Nope! See the
gatk file in there? That's an executable wrapper script that you invoke and that will choose the appropriate jar for you based on the rest of your command line. You could still invoke a specific jar if you wanted, but using
gatk is easier, and it will also take care of setting some parameters that you would otherwise have to specify manually.
There is no installation necessary in the traditional sense, since the precompiled jar files should work on any POSIX platform that satisfies the requirements listed above. You'll simply need to open the downloaded package and place the folder containing the jar files and launch script in a convenient directory on your hard drive (or server filesystem). Although the jars themselves cannot simply be added to your PATH, you can do so with the
gatk wrapper script. Please look up instructions depending on the terminal shell you use; in
bash the typical syntax is
export PATH=$PATH:/path/to/gatk-package/gatk where
path/to/gatk-package/ is the path to the location of the
gatk executable. Note that the jars must remain in the same directory as
gatk for it to work.
To test that you can successfully invoke the GATK, run the following command in your terminal application. Here we assume that you have added
gatk to your PATH as recommended above
This should output a summary of the invocation syntax, options for listing tools and invoking a specific tool's help documentation, and main Spark options if applicable.
Available tools are listed and described in some detail in the Tool Documentation section, along with available options. The basic syntax for invoking any GATK or Picard tool is the following:
gatk [--java-options "jvm args like -Xmx4G go here"] ToolName [GATK args go here]
So for example, a simple GATK command would look like:
gatk --java-options "-Xmx8G" HaplotypeCaller -R reference.fasta -I input.bam -O output.vcf
You can find more information about GATK command-line syntax here.
When used from within GATK, all Picard tools use the same syntax as GATK. The conversion relative to the "Picard-style" syntax is very straightforward; wherever you used to do e.g.
I=input.bam, you now do
-I input.bam. So for example, a simple Picard command would look like:
gatk ValidateSamFile -I input.bam -MODE SUMMARY
The GATK Best Practices are end-to-end workflows that are meant to provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. We have several such workflows tailored to project aims (by type of variants of interest) and experimental designs (by type of sequencing approach). And although they were originally designed for human genome research, the GATK Best Practices can be adapted for analysis of non-human organisms of all kinds, including non-diploids.
The documentation for the Best Practices includes high-level descriptions of the processes involved, various types of documents that explain deeper details and adaptations that can be made depending on constraints and use cases, a set of actual pipeline implementations of these recommendations, and perhaps the most important, workshop materials including slide decks, videos and tutorials that walk you through every step.
Most of the work involved in processing sequence data and performing variant discovery can be automated in the form of pipeline scripts, which often include some form of parallelization to speed up execution. We provide scripted implementations of the GATK Best Practices workflows plus some additional helper/accessory scripts in order to make it easier for everyone to run these sometimes rather complex workflows.
These workflows are written in WDL and intended to be run on any platform that supports WDL execution. Options are listed in the Pipelining section of the User Guide. Our preferred option is the Cromwell execution engine, which like GATK is also developed by the Broad's Data Sciences Platform (DSP), and is available as a service on our cloud-based portal, FireCloud. Note that if you choose to run GATK workflows through FireCloud, you don't really need to do any of the above, since everything is already preloaded in a ready-to-run form (the software, the scripts, even some example data). At this point FireCloud the easiest way to run the workflows exactly as we do in our own work.
We provide all support through our very active community forum. You can ask questions and report any problems that you might encounter, with the following guidelines:
Before posting to the Forum, please do the following:
When asking a question about a problem, please include the following:
We will typically get back to you with a response within one or two business days, but be aware that more complex issues (or unclear reports) may take longer to address. In addition, some times of the year are especially busy for us and we may take longer than usual to answer your question.
We may ask you to submit a formal bug report, which involves sending us some test data that we can use to reproduce the problem ourselves. This is often required for debugging. Rest assured we treat all data transferred to us as private and confidential. In some cases we may ask for your permission to include a snippet of your test case in our testing framework, which is publicly accessible. In such a case, YOU are responsible for verifying with whoever owns the data whether you are authorized to allow us to make that data public.
Note that the information in this documentation guide is targeted at end-users. For developers, the source code and related resources are available on GitHub.