Production-ready tools to call copy-number variants

Early adopters of GATK4 will recall that somatic and germline copy-number variation (CNV) pipelines were among the first to be developed. The current generation of these pipelines still bear traits that reflect their evolutionary beginnings, but have also acquired adaptations that take them far beyond their predecessors’ limitations. With the release of GATK4.1, we are excited to bring the latest and greatest versions of these pipelines out of beta and to officially add CNV calling to GATK’s ever-growing set of capabilities.

Evolution of the CNV pipelines Beta versions of the GATK CNV pipelines were heavily influenced by methods previously developed at the Broad. For example, the GATK4(beta) CNV/AllelicCNV pipeline bore strong resemblances to the exome ReCapSeg/AllelicCapSeg pipeline developed by the Cancer Genome Analysis Group. The germline GATK4(beta) XHMMSegmentCaller pipeline was a near-direct port of the XHMM (eXome-Hidden Markov Model) tool. Vestiges of these venerable ancestors still remain in GATK4.1’s ModelSegments and GermlineCNVCaller pipelines; however, new innovations yield dramatically improved performance and enable scalability from exomes to genomes.

CNV calling in a nutshell To appreciate these innovations, let’s review the problem of calling CNVs from sequencing read-depth data---which can be a tough nut to crack! Like Darwin’s finches, different CNV tools have evolved a variety of different ways to crack this nut, but their overall function is largely the same. CNV tools typically break down the problem into more manageable tasks:

  1. Denoising: Distinguishing the signal from CNV events from systematic sequencing noise can be quite a challenge. Many CNV tools employ denoising strategies to learn patterns of noise from a panel of control samples and remove them. For example, both ReCapSeg and XHMM use principal components analysis (PCA) denoising.

  2. Segmentation: The signal from CNV events can vary both in genomic length and amplitude. Algorithms like the circular binary segmentation (CBS) method used by ReCapSeg can identify genomic segments that contain somatic CNV signal. For germline calling, where the signal appears at amplitudes corresponding to integer copy-number states, a Hidden Markov Model (HMM) like the one used by XHMM can work well.

GATK4.1 ModelSegments: A next-generation CNV caller GATK4.1’s ModelSegments pipeline is a streamlined, modernized, and highly evolved version of the ReCapSeg pipeline from which it descended. Like its ancestor, the ModelSegments pipeline uses PCA denoising and a panel of control samples to remove systematic sequencing noise. However, we’ve optimized our denoising code to drastically reduce both runtime and memory requirements. Panels that used to take upwards of an entire day to build using ReCapSeg can now be built in under an hour---and at ~100x higher resolution, to boot!

We’ve also developed a new kernel-segmentation method to replace the workhorse algorithm CBS. This method enables scaling to high-resolution whole genome data as well as segmentation of multidimensional data. Combined with the improvements to denoising, the new segmentation method allows ModelSegments to run well on both exomes and genomes.

GATK4.1 GermlineCNVCaller: A new species of CNV caller GATK4.1’s GermlineCNVCaller pipeline introduces even more novel methods---representing a saltational step in the evolution of CNV tools.

Taking advantage of computational frameworks from the world of probabilistic programming, (i.e., PyMC3 and Theano), GermlineCNVCaller is able to simultaneously model both systematic biases and CNV events. More naive approaches to denoising (such as PCA) cannot always distinguish between signal and noise, and sometimes inadvertently subtract the signal. In contrast, our new modeling approach yields high sensitivity---especially in genomic regions of common CNV activity.

GermlineCNVCaller also introduces a hierarchical HMM method for segmentation, which learns these regions of common CNV activity across multiple samples while simultaneously calling CNVs in each sample. GermlineCNVCaller’s abilities shine on noisy exome data, but can scale to genomes by harnessing the power of Cromwell and WDL.

An animation of GermlineCNVCaller inference performed on a cohort of simulated exome samples. Video by: Mehrtash Babadi

The sample-by-target heatmaps in the center column show 1) count data generated from 2) underlying copy-number (CN) events; GermlineCNVCaller infers 3) CN calls in each sample, while also identifying 4) regions of common CNV activity (indicated in yellow). Counts and inferred CN calls are plotted for a single sample on the right, while various quantities which determine model convergence are tracked over learning iterations on the left.

Though they owe a lot to their prototypical predecessors, GATK4.1’s CNV pipelines have evolved substantially to yield dramatically improved performance and augmented capabilities. GATK CNV tool development is ongoing, so stay tuned for the next stage of evolution!

Return to top

Fri 8 Feb 2019

SkyWarrior on 8 Feb 2019

Thank you all for getting this tool ready for production. I have been using it since the day its inception and saw the direct advantage over existing tools. Only a few kinks here and there but I would say this will become the ultimate tool for most I believe.

jin0008 on 8 Feb 2019

I have trouble with AnnotateInterval. I can't find the right one of segmental_duplicates.bed file in hg38. Can you provide one at resource bundle sites?

Begali on 8 Feb 2019

Hi @slee @SkyWarrior @Mehrtash thanks for your information but how I can obtain this plotting as here []( "") is there any tutorial for that with best regards

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


RT @curroortuno: Do you want to learn about sequencing data analysis in an amazing city? Register now at @gatk_dev workshop "From reads to…
3 Sep 19
Thank you @murilocervato for hosting our GATK workshop in Sao Paolo last week! Great group of participants, we’ll s…
3 Sep 19
@RealMattJM “Convoluted”, huh? We see what you did there...
29 Aug 19
#GATK workshop caption competition: what is deep learning developer Sam Friedman trying to say here?
28 Aug 19
@wbsimey Happy to hear you’ve found the resources we provide helpful!
30 Jul 19

- Our favorite tweets from others

Do you want to learn about sequencing data analysis in an amazing city? Register now at @gatk_dev workshop "From re…
3 Sep 19
Another successful #GATK workshop in the books! @TerraBioApp @gatk_dev
3 Sep 19
Day 2 of #GATK workshop this time in São Paulo, Brazil! Hands-on tutorials using @TerraBioApp #GATK Best Practices…
28 Aug 19
In spite of their stated mission to support human health through genomics, many GATK pipelines are applicable to no…
29 Jul 19
Me: driving myself insane over what data to keep and what to not bother with for thesis and also frantically trying…
18 Jul 19

See more of our favorite tweets...