Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.
These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.
When you're isolating DNA in the lab, you don't treat the work like isolated, disconnected tasks. Every task is a step in a well-documented protocol, carefully developed to optimize yield, purity and to ensure reproducibility as well as consistency across all samples and experiments. We believe working with the sequencing data should be treated in the same thorough manner.
That's why GATK comes with complete reads-to-results Best Practices workflow recommendations, battle-tested in production at the Broad Institute and optimized to produce the most accurate results with the most computational efficiency.
GATK4 includes Best Practices workflows for all major classes of variants for genomic analysis in gene panels, exomes and whole genomes. In addition to the industry standard GATK Best Practices workflow for germline short variant discovery, GATK4 offers Best Practices workflows for somatic short variants, somatic and germline copy number variants, and structural variation discovery tools are in active development.
The GATK is designed to run on Linux and other POSIX-compatible platforms, which includes MacOS X. Windows systems are not supported. The major system requirement is Java 1.8; some tools have additional R or Python dependencies. We recommend using Docker containers for ease of deployment; an official docker container is available on Dockerhub. If you prefer to run the software directly, see the Download section for download and installation instructions.
In addition to supporting traditional compute environments such as local clusters, the next generation of GATK tools has been engineered to also play well with cloud environments and to leverage Spark architectures wherever possible. See the Pipelining Options documentation for more information on supported platforms and available optimizations.
At the heart of the GATK is an industrial-strength infrastructure and engine that handle data access, conversion and traversal, as well as high-performance computing features. This includes parallelization using Apache Spark and optimized usage of cloud infrastructure. On top of that lives a rich ecosystem of specialized tools that you can use out of the box, individually or chained into scripted workflows, to perform anything from simple data diagnostics to complex reads-to-variants analyses. See the Tool Docs for a complete list of tools and their capabilities.
Starting in GATK4, the GATK executable also bundles the popular Picard toolkit for manipulation and quality control of high-throughput sequencing data. All Picard tools are now available directly from the GATK command-line, with a harmonized command syntax and consolidated user guide.
GATK does not have a graphical user interface. All the GATK tools are run from the command-line using the same basic command structure. A convenient wrapper script invokes Java and the GATK program itself; you specify the tool you want to run, and any applicable arguments and input parameters. Arguments like
-R for the genome reference and
-I for the input file are also given to the GATK engine and can be used with all the tools. Most tools also take additional arguments that are specific to their function. These are listed for each tool on that tool's documentation page, all easily accessible through the Tool Documentation index.
./gatk HaplotypeCaller \ -R genome_reference.fasta \ -I sequencing_reads.bam \ -O variants.vcf
As described on Github, it is a permissive license similar to the BSD 2-Clause License, but with a 3rd clause that prohibits others from using the name of the project or its contributors to promote derived products without written consent.
The full text of the license can be viewed here.
If you have any questions about the licensing terms of older versions of GATK, please address them to firstname.lastname@example.org.
The GATK has a reputation for being wicked complicated, and it's not entirely undeserved. With great power comes great
responsibility complexity... But we're here to help.
The toolkit comes with extensive documentation about the tools themselves, the underlying methods and algorithms, and a lot of information about how to apply them to your data for best results. For the major use cases, we provide best-practice workflow recommendations that describe how to chain the tools together into processing and analysis pipelines. This documentation is further enriched by a regularly updated collection of frequently asked questions and solutions to common problems, a dictionary of technical terms, and tutorials that explain step by step how to run the tools and apply our workflow recommendations.
Be sure to check out the Presentations from our recurring workshop series. In addition to the slide decks, we provide recordings of the workshops that we hold at the Broad; you can view them on the Broad website or on the Broad education channels on YouTube and iTunesU.
Finally, if you've exhausted all these avenues and still haven't found the answer to your question, check out the forum! You may find that others have run into the same problem and that the solution has already been posted. If not, let us know and we'll do our best to address your problems quickly and accurately. If something's not clearly documented, we'll answer your question and improve the docs accordingly. If you think you found a bug, we'll track it down and fix it. Just ask the team.