Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic variant calling tools, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data.
These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.
The GATK, designed for human genome and exome analysis and extended to handle other organisms.
When you're isolating DNA in the lab, you don't treat the work like isolated, disconnected tasks. Every task is a step in a well-documented protocol, carefully developed to optimize yield, purity and to ensure reproducibility as well as consistency across all samples and experiments. We believe working with the sequencing data should be treated in the same thorough manner.
That's why GATK comes with complete reads-to-results Best Practices workflow recommendations, battle-tested in production at the Broad Institute and optimized to produce the most accurate results with the most computational efficiency.
Best Practices for SNP and Indel discovery in germline DNA - leveraging groundbreaking methods for combined power and scalability.
The GATK is designed to run on Linux and other POSIX-compatible platforms. Yes, that includes MacOS X! If you are on any of the above, see the Downloads section for downloading and installation instructions. Windows systems are not supported. And no, there are no plans to port the GATK to Android or iOS in the near future ;-)
You will need to have Java 1.8 installed to run the GATK, and some tools additionally require R to generate PDF plots. Detailed version requirements and installation instructions for both can be found in the Documentation Guide.
Versions of GATK up to 3 were optimized to run in traditional research computing environments such as local clusters and servers. The next generation of GATK tools (GATK4, available today as an alpha preview) are being developed to run best in cloud environments and to leverage Spark architectures wherever possible.
The GATK is designed to run on Linux and other POSIX-compatible platforms, including MacOS X.
At the heart of the GATK is an industrial-strength infrastructure and engine that handle data access, conversion and traversal, as well as high-performance computing features. On top of that lives a rich ecosystem of specialized tools, called walkers, that you can use out of the box, individually or chained into scripted workflows, to perform anything from simple data diagnostics to complex reads-to-results analyses. See the Tool Docs for a complete list of tools and their capabilities.
Many GATK tools can be parallelized by multithreading for faster execution. See this article for more details on parallelism with the GATK.
The complete toolkit source code is made available on Github. Note that you can also access the source code for the engine and development framework alone, which are fully open source under an MIT license in a separate Github repository provided for the convenience of third-party application developers.
The toolkit provides a wide set of tools that can be chained into workflows, taking advantage of the common architecture and powerful engine.
GATK does not have a graphical user interface. All the GATK tools are run from the command-line using the same basic command structure. The -jar argument invokes the GATK engine itself, and the -T argument tells it which tool you want to run. Arguments like -R for the genome reference and -I for the input file are also given to the GATK engine and can be used with all the tools (see complete list of available arguments for the GATK engine. Most tools also take additional arguments that are specific to their function. These are listed for each tool on that tool's documentation page, all easily accessible through the Tool Documentation index.
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R genome_reference.fasta \
-I sequencing_reads.bam \
-o variants.vcf
All tools are run using the same Java command-line syntax.
As of Wednesday, May 25 we are open-sourcing the upcoming new version, GATK4, which is available today as an Alpha preview and will be released into Beta status in June. See the GATK blog for a detailed announcement. This page will be updated shortly to reflect the new licensing policy.
If you are currently holding a commercial license for the current version or an older version of GATK, please contact softwarelicensing@broadinstitute.org to discuss what this means for you.
The text of the academic license can be viewed here. If your usage qualifies for this license, you can download the program and start using it right away.
We provide licensing directly to commercial/for-profit organizations that will be running the GATK or MuTect internally or as part of their own hardware offering. To inquire about licensing GATK for commercial use and/or redistribution of GATK as a service, please contact softwarelicensing@broadinstitute.org.
The GATK has a reputation for being wicked complicated, and it's not entirely undeserved. With great power comes great responsibility complexity... But we're here to help.
The toolkit comes with extensive documentation about the tools themselves, the underlying methods and algorithms, and a lot of information about how to apply them to your data for best results. For the major use cases, we provide best-practice workflow recommendations that describe how to chain the tools together into processing and analysis pipelines. This documentation is further enriched by a regularly updated collection of frequently asked questions and solutions to common problems, a dictionary of technical terms, and tutorials that explain step by step how to run the tools and apply our workflow recommendations.
Be sure to check out the Presentations from our recurring workshop series. In addition to the slide decks, we provide recordings of the workshops that we hold at the Broad; you can view them on the Broad website or on the Broad education channels on YouTube and iTunesU.
Finally, if you've exhausted all these avenues and still haven't found the answer to your question, check out the forum! You may find that others have run into the same problem and that the solution has already been posted. If not, let us know and we'll do our best to address your problems quickly and accurately. If something's not clearly documented, we'll answer your question and improve the docs accordingly. If you think you found a bug, we'll track it down and fix it. Just ask the team.

Pipelines optimized for
accuracy and performance

Announcements and
progress updates
Find a workshop or conference near you

