Developed by the Data Science and Data Engineering group at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic variant calling tools, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data.
These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.
When you're isolating DNA in the lab, you don't treat the work like isolated, disconnected tasks. Every task is a step in a well-documented protocol, carefully developed to optimize yield, purity and to ensure reproducibility as well as consistency across all samples and experiments. We believe working with the sequencing data should be treated in the same thorough manner.
That's why GATK comes with complete reads-to-results Best Practices workflow recommendations, battle-tested in production at the Broad Institute and optimized to produce the most accurate results with the most computational efficiency.
The GATK is designed to run on Linux and other POSIX-compatible platforms. Yes, that includes MacOS X! If you are on any of the above, see the Downloads section for downloading and installation instructions. Windows systems are not supported. And no, there are no plans to port the GATK to Android or iOS in the near future ;-)
You will need to have Java 1.8 installed to run the GATK, and some tools additionally require R to generate PDF plots. Detailed version requirements and installation instructions for both can be found in the Documentation Guide.
Versions of GATK up to 3 were optimized to run in traditional research computing environments such as local clusters and servers. The next generation of GATK tools (GATK4, available today as an alpha preview) are being developed to run best in cloud environments and to leverage Spark architectures wherever possible.
At the heart of the GATK is an industrial-strength infrastructure and engine that handle data access, conversion and traversal, as well as high-performance computing features. On top of that lives a rich ecosystem of specialized tools, called walkers, that you can use out of the box, individually or chained into scripted workflows, to perform anything from simple data diagnostics to complex reads-to-results analyses. See the Tool Docs for a complete list of tools and their capabilities.
Many GATK tools can be parallelized by multithreading for faster execution. See this article for more details on parallelism with the GATK.
The complete toolkit source code is made available on Github. Note that you can also access the source code for the engine and development framework alone, which are fully open source under an MIT license in a separate Github repository provided for the convenience of third-party application developers.
GATK does not have a graphical user interface. All the GATK tools are run from the command-line using the same basic command structure. The
-jar argument invokes the GATK engine itself, and the
-T argument tells it which tool you want to run. Arguments like
-R for the genome reference and
-I for the input file are also given to the GATK engine and can be used with all the tools (see complete list of available arguments for the GATK engine. Most tools also take additional arguments that are specific to their function. These are listed for each tool on that tool's documentation page, all easily accessible through the Tool Documentation index.
java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R genome_reference.fasta \ -I sequencing_reads.bam \ -o variants.vcf
GATK is released under a mixed licensing model: researchers at academic and non-profit organizations using GATK for non-commercial purposes can access the tools and source code for free while for-profit organizations are required to purchase a license.
The revenue generated by commercial licensing is used to fund and build out our support team and infrastructure to accommodate the demand for support in the community, as well as invest more resources to improve development speed, functionality and stability overall.
The text of the academic license can be viewed here. If your usage qualifies for this license, you can download the program and start using it right away.
We provide licensing directly to commercial/for-profit organizations that will be running the GATK or MuTect internally or as part of their own hardware offering. To inquire about licensing GATK for commercial use and/or redistribution of GATK as a service, please contact firstname.lastname@example.org.
The GATK has a reputation for being wicked complicated, and it's not entirely undeserved. With great power comes great
responsibility complexity... But we're here to help.
The toolkit comes with extensive documentation about the tools themselves, the underlying methods and algorithms, and a lot of information about how to apply them to your data for best results. For the major use cases, we provide best-practice workflow recommendations that describe how to chain the tools together into processing and analysis pipelines. This documentation is further enriched by a regularly updated collection of frequently asked questions and solutions to common problems, a dictionary of technical terms, and tutorials that explain step by step how to run the tools and apply our workflow recommendations.
Be sure to check out the Presentations from our recurring workshop series. In addition to the slide decks, we provide recordings of the workshops that we hold at the Broad; you can view them on the Broad website or on the Broad education channels on YouTube and iTunesU.
Finally, if you've exhausted all these avenues and still haven't found the answer to your question, check out the forum! You may find that others have run into the same problem and that the solution has already been posted. If not, let us know and we'll do our best to address your problems quickly and accurately. If something's not clearly documented, we'll answer your question and improve the docs accordingly. If you think you found a bug, we'll track it down and fix it. Just ask the team.