Production-ready tools to call copy-number variants

Early adopters of GATK4 will recall that somatic and germline copy-number variation (CNV) pipelines were among the first to be developed. The current generation of these pipelines still bear traits that reflect their evolutionary beginnings, but have also acquired adaptations that take them far beyond their predecessors’ limitations. With the release of GATK4.1, we are excited to bring the latest and greatest versions of these pipelines out of beta and to officially add CNV calling to GATK’s ever-growing set of capabilities.

Evolution of the CNV pipelines Beta versions of the GATK CNV pipelines were heavily influenced by methods previously developed at the Broad. For example, the GATK4(beta) CNV/AllelicCNV pipeline bore strong resemblances to the exome ReCapSeg/AllelicCapSeg pipeline developed by the Cancer Genome Analysis Group. The germline GATK4(beta) XHMMSegmentCaller pipeline was a near-direct port of the XHMM (eXome-Hidden Markov Model) tool. Vestiges of these venerable ancestors still remain in GATK4.1’s ModelSegments and GermlineCNVCaller pipelines; however, new innovations yield dramatically improved performance and enable scalability from exomes to genomes.

CNV calling in a nutshell To appreciate these innovations, let’s review the problem of calling CNVs from sequencing read-depth data---which can be a tough nut to crack! Like Darwin’s finches, different CNV tools have evolved a variety of different ways to crack this nut, but their overall function is largely the same. CNV tools typically break down the problem into more manageable tasks:

  1. Denoising: Distinguishing the signal from CNV events from systematic sequencing noise can be quite a challenge. Many CNV tools employ denoising strategies to learn patterns of noise from a panel of control samples and remove them. For example, both ReCapSeg and XHMM use principal components analysis (PCA) denoising.

  2. Segmentation: The signal from CNV events can vary both in genomic length and amplitude. Algorithms like the circular binary segmentation (CBS) method used by ReCapSeg can identify genomic segments that contain somatic CNV signal. For germline calling, where the signal appears at amplitudes corresponding to integer copy-number states, a Hidden Markov Model (HMM) like the one used by XHMM can work well.

GATK4.1 ModelSegments: A next-generation CNV caller GATK4.1’s ModelSegments pipeline is a streamlined, modernized, and highly evolved version of the ReCapSeg pipeline from which it descended. Like its ancestor, the ModelSegments pipeline uses PCA denoising and a panel of control samples to remove systematic sequencing noise. However, we’ve optimized our denoising code to drastically reduce both runtime and memory requirements. Panels that used to take upwards of an entire day to build using ReCapSeg can now be built in under an hour---and at ~100x higher resolution, to boot!

We’ve also developed a new kernel-segmentation method to replace the workhorse algorithm CBS. This method enables scaling to high-resolution whole genome data as well as segmentation of multidimensional data. Combined with the improvements to denoising, the new segmentation method allows ModelSegments to run well on both exomes and genomes.

GATK4.1 GermlineCNVCaller: A new species of CNV caller GATK4.1’s GermlineCNVCaller pipeline introduces even more novel methods---representing a saltational step in the evolution of CNV tools.

Taking advantage of computational frameworks from the world of probabilistic programming, (i.e., PyMC3 and Theano), GermlineCNVCaller is able to simultaneously model both systematic biases and CNV events. More naive approaches to denoising (such as PCA) cannot always distinguish between signal and noise, and sometimes inadvertently subtract the signal. In contrast, our new modeling approach yields high sensitivity---especially in genomic regions of common CNV activity.

GermlineCNVCaller also introduces a hierarchical HMM method for segmentation, which learns these regions of common CNV activity across multiple samples while simultaneously calling CNVs in each sample. GermlineCNVCaller’s abilities shine on noisy exome data, but can scale to genomes by harnessing the power of Cromwell and WDL.

An animation of GermlineCNVCaller inference performed on a cohort of simulated exome samples. Video by: Mehrtash Babadi

The sample-by-target heatmaps in the center column show 1) count data generated from 2) underlying copy-number (CN) events; GermlineCNVCaller infers 3) CN calls in each sample, while also identifying 4) regions of common CNV activity (indicated in yellow). Counts and inferred CN calls are plotted for a single sample on the right, while various quantities which determine model convergence are tracked over learning iterations on the left.

Though they owe a lot to their prototypical predecessors, GATK4.1’s CNV pipelines have evolved substantially to yield dramatically improved performance and augmented capabilities. GATK CNV tool development is ongoing, so stay tuned for the next stage of evolution!

Return to top

Fri 8 Feb 2019

SkyWarrior on 8 Feb 2019

Thank you all for getting this tool ready for production. I have been using it since the day its inception and saw the direct advantage over existing tools. Only a few kinks here and there but I would say this will become the ultimate tool for most I believe.

jin0008 on 8 Feb 2019

I have trouble with AnnotateInterval. I can't find the right one of segmental_duplicates.bed file in hg38. Can you provide one at resource bundle sites?

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


Getting started with GATK? It can be easier with @terrabioapp. Read the intro blog at
13 Jun 19
@E_TaylorCox Seems like everything is up now — are you still having trouble?
13 Jun 19
Would anyone in the Boston/Cambridge area be interested in an informal meetup about GATK? Chat about current tools,…
6 Jun 19
This #OpenScience contest is a great opportunity to win a trip to BOSC in Basel, Switzerland this summer! Just set…
28 May 19
Love to write, hate our docs? Join our Education team! We're looking for a science writer to help us improve our ed…
24 May 19

- Our favorite tweets from others

"GATK has always been kind of a beast to get started with..." #documentation
13 Jun 19
The award for best documentation headline goes to @gatk_dev @broadinstitute 😆
7 Jun 19
@sciencegurlz0 @gatk_dev documentation, tutorials and videos were very useful to me
14 May 19
Having a great introduction day! Looking forward the coming practicals and discussions!
14 May 19

See more of our favorite tweets...