Introducing GATK 4.0

GATK4 is the first and only open-source software package that covers all major variant classes for both germline and cancer genome analysis. This new version of GATK has been completely reengineered to solve key performance bottlenecks, increasing speed and scalability without sacrificing its trademark accuracy.

GATK4 is fully open-source and can be deployed on local computing infrastructure as well as cloud environments.

download
 

GATK4 includes both well-established pipelines and new tools that take advantage of the latest developments in machine learning and neural networks algorithms.

Earlier versions of GATK were focused on germline short variant discovery, which played a central role in the success of earlier versions. This new version adds somatic short variant calling with Mutect2, which combines the widely-used somatic modeling algorithm Mutect with the haplotype-centric logic of GATK's leading germline caller, HaplotypeCaller. Extending beyond short variant discovery, GATK4 adds full discovery pipeline capabilities for both germline and somatic copy number variants (CNV), and adds somatic allelic CNV (ACNV)-based estimation of tumor heterogeneity. These pipelines are engineered to scale seamlessly from gene panels and exomes to whole genome sequencing (WGS) datasets.

The GATK 4.0 package also includes early-access versions of tools currently in development for structural variation (SV) discovery; germline CNV discovery using machine learning approaches; and a new pipeline for germline short variant filtering based on Convolutional Neural Networks (CNN), as discussed in this recent blog post.

GATK4 has been extensively optimized for performance, flexibility, speed and scalability, and includes end-to-end pipeline scripts that can be run on any local or cloud compute infrastructure.

This major new version benefits from the combined experience of the Broad Institute's scientific and operational expertise running genomic pipelines at scale, and the engineering excellence of computational industry leaders including Intel, Google Cloud, Cloudera, Microsoft Genmics, IBM Research, Amazon Web Services and Alibaba Cloud, who have all made contributions to its development and/or cloud deployment.

Formalized by the creation in 2017 of the Intel-Broad Center for Genomic Data Engineering, the Broad Institute's collaboration with Intel has yielded major performance optimizations to key pipeline steps. Notably, the creation by Intel of the GenomicsDB datastore dramatically improved the scalability of the GVCF-based germline joint-calling pipeline, allowing the Broad Institute team to achieve a variant calling analysis on 15,000 WGS samples in 2 weeks, where GATK3 tools required 6 weeks to process a maximum of 3,000 WGS samples.

download

Scaling up joint calling with GenomicsDB

What's new in GATK4? In this short video, Laura Gauthier explains how the speed and scalability of joint calling is dramatically improved in GATK4 thanks to the Intl GenomicsDB datastore.

Thanks to the contributions of Cloudera engineers, GATK4 now uses Apache Spark under the hood for both traditional local multithreading and for parallelization on Spark-capable compute infrastructure and services such as Google Dataproc. In complementary work, Google Cloud engineers gave GATK4 the ability to stream data directly from Google Cloud Storage (GCS) through the NIO protocol, enabling considerable savings of time and money in cloud executions.

Two ways to get started with GATK4

You can run GATK4 the old-fashioned way (via download or Docker on your own infrastructure) or you can use our cloud-based analysis portal, FireCloud, where everything is already set up for you including ready-to-run examples (see box).

download

Run the GATK4 Best Practices in FireCloud

We are inviting researchers of all backgrounds, including those without computational training, to utilize these pipelines through FireCloud, the Broad Institute's cloud-based analysis portal, where the GATK 4.0 pipelines are available fully configured and ready-to-run on preloaded example datasets. Access to the portal and pipelines is free of charge for all users, and through a partnership with Google Cloud, we are offering a $250 credit per user for compute and storage costs for the first 1,000 applicants.


firecloud

GATK4 Package


download

Analysis Features

  • Consolidated pre-processing for all variant discovery
  • Germline SNPs & Indels
  • Somatic SNVs & Indels
  • Somatic CNV and Allelic CNV
  • Germline CNV (preview)
  • Neural network approach to germline short variant filtering (preview)

Performance and Engineering Features

  • Parallelization by Apache Spark
  • Cloud deployment support
  • Hardware-optimized versions of bottleneck algorithms
  • GenomicsDB datastore for germline joint-calling

Convenience Features

  • Picard tools bundled in GATK4
  • Wrapper script handles Java invocation and Spark parametrization
  • Docker images for easy deployment
  • Best Practices pipelines runnable out of the box