GATK4 is the first and only open-source software package that covers all major variant classes for both germline and cancer genome analysis. This new version of GATK has been completely reengineered to solve key performance bottlenecks, increasing speed and scalability without sacrificing its trademark accuracy.
GATK4 is fully open-source and can be deployed on local computing infrastructure as well as cloud environments.
Earlier versions of GATK were focused on germline short variant discovery, which played a central role in the success of earlier versions. This new version adds somatic short variant calling with Mutect2, which combines the widely-used somatic modeling algorithm Mutect with the haplotype-centric logic of GATK's leading germline caller, HaplotypeCaller. Extending beyond short variant discovery, GATK4 adds full discovery pipeline capabilities for both germline and somatic copy number variants (CNV), and adds somatic allelic CNV (ACNV)-based estimation of tumor heterogeneity. These pipelines are engineered to scale seamlessly from gene panels and exomes to whole genome sequencing (WGS) datasets.
The GATK 4.0 package also includes early-access versions of tools currently in development for structural variation (SV) discovery; germline CNV discovery using machine learning approaches; and a new pipeline for germline short variant filtering based on Convolutional Neural Networks (CNN), as discussed in this recent blog post.
This major new version benefits from the combined experience of the Broad Institute's scientific and operational expertise running genomic pipelines at scale, and the engineering excellence of computational industry leaders including Intel, Google Cloud, Cloudera, Microsoft Genmics, IBM Research, Amazon Web Services and Alibaba Cloud, who have all made contributions to its development and/or cloud deployment.
Formalized by the creation in 2017 of the Intel-Broad Center for Genomic Data Engineering, the Broad Institute's collaboration with Intel has yielded major performance optimizations to key pipeline steps. Notably, the creation by Intel of the GenomicsDB datastore dramatically improved the scalability of the GVCF-based germline joint-calling pipeline, allowing the Broad Institute team to achieve a variant calling analysis on 15,000 WGS samples in 2 weeks, where GATK3 tools required 6 weeks to process a maximum of 3,000 WGS samples.
Thanks to the contributions of Cloudera engineers, GATK4 now uses Apache Spark under the hood for both traditional local multithreading and for parallelization on Spark-capable compute infrastructure and services such as Google Dataproc. In complementary work, Google Cloud engineers gave GATK4 the ability to stream data directly from Google Cloud Storage (GCS) through the NIO protocol, enabling considerable savings of time and money in cloud executions.
You can run GATK4 the old-fashioned way (via download or Docker on your own infrastructure) or you can use our cloud-based analysis portal, FireCloud, where everything is already set up for you including ready-to-run examples (see box).