Production-ready tools to call copy-number variants

Early adopters of GATK4 will recall that somatic and germline copy-number variation (CNV) pipelines were among the first to be developed. The current generation of these pipelines still bear traits that reflect their evolutionary beginnings, but have also acquired adaptations that take them far beyond their predecessors’ limitations. With the release of GATK4.1, we are excited to bring the latest and greatest versions of these pipelines out of beta and to officially add CNV calling to GATK’s ever-growing set of capabilities.

Evolution of the CNV pipelines Beta versions of the GATK CNV pipelines were heavily influenced by methods previously developed at the Broad. For example, the GATK4(beta) CNV/AllelicCNV pipeline bore strong resemblances to the exome ReCapSeg/AllelicCapSeg pipeline developed by the Cancer Genome Analysis Group. The germline GATK4(beta) XHMMSegmentCaller pipeline was a near-direct port of the XHMM (eXome-Hidden Markov Model) tool. Vestiges of these venerable ancestors still remain in GATK4.1’s ModelSegments and GermlineCNVCaller pipelines; however, new innovations yield dramatically improved performance and enable scalability from exomes to genomes.


CNV calling in a nutshell To appreciate these innovations, let’s review the problem of calling CNVs from sequencing read-depth data---which can be a tough nut to crack! Like Darwin’s finches, different CNV tools have evolved a variety of different ways to crack this nut, but their overall function is largely the same. CNV tools typically break down the problem into more manageable tasks:

  1. Denoising: Distinguishing the signal from CNV events from systematic sequencing noise can be quite a challenge. Many CNV tools employ denoising strategies to learn patterns of noise from a panel of control samples and remove them. For example, both ReCapSeg and XHMM use principal components analysis (PCA) denoising.

  2. Segmentation: The signal from CNV events can vary both in genomic length and amplitude. Algorithms like the circular binary segmentation (CBS) method used by ReCapSeg can identify genomic segments that contain somatic CNV signal. For germline calling, where the signal appears at amplitudes corresponding to integer copy-number states, a Hidden Markov Model (HMM) like the one used by XHMM can work well.

GATK4.1 ModelSegments: A next-generation CNV caller GATK4.1’s ModelSegments pipeline is a streamlined, modernized, and highly evolved version of the ReCapSeg pipeline from which it descended. Like its ancestor, the ModelSegments pipeline uses PCA denoising and a panel of control samples to remove systematic sequencing noise. However, we’ve optimized our denoising code to drastically reduce both runtime and memory requirements. Panels that used to take upwards of an entire day to build using ReCapSeg can now be built in under an hour---and at ~100x higher resolution, to boot!

We’ve also developed a new kernel-segmentation method to replace the workhorse algorithm CBS. This method enables scaling to high-resolution whole genome data as well as segmentation of multidimensional data. Combined with the improvements to denoising, the new segmentation method allows ModelSegments to run well on both exomes and genomes.

GATK4.1 GermlineCNVCaller: A new species of CNV caller GATK4.1’s GermlineCNVCaller pipeline introduces even more novel methods---representing a saltational step in the evolution of CNV tools.

Taking advantage of computational frameworks from the world of probabilistic programming, (i.e., PyMC3 and Theano), GermlineCNVCaller is able to simultaneously model both systematic biases and CNV events. More naive approaches to denoising (such as PCA) cannot always distinguish between signal and noise, and sometimes inadvertently subtract the signal. In contrast, our new modeling approach yields high sensitivity---especially in genomic regions of common CNV activity.

GermlineCNVCaller also introduces a hierarchical HMM method for segmentation, which learns these regions of common CNV activity across multiple samples while simultaneously calling CNVs in each sample. GermlineCNVCaller’s abilities shine on noisy exome data, but can scale to genomes by harnessing the power of Cromwell and WDL.

An animation of GermlineCNVCaller inference performed on a cohort of simulated exome samples. Video by: Mehrtash Babadi

The sample-by-target heatmaps in the center column show 1) count data generated from 2) underlying copy-number (CN) events; GermlineCNVCaller infers 3) CN calls in each sample, while also identifying 4) regions of common CNV activity (indicated in yellow). Counts and inferred CN calls are plotted for a single sample on the right, while various quantities which determine model convergence are tracked over learning iterations on the left.

Though they owe a lot to their prototypical predecessors, GATK4.1’s CNV pipelines have evolved substantially to yield dramatically improved performance and augmented capabilities. GATK CNV tool development is ongoing, so stay tuned for the next stage of evolution!


Return to top

Fri 8 Feb 2019

SkyWarrior on 8 Feb 2019


Thank you all for getting this tool ready for production. I have been using it since the day its inception and saw the direct advantage over existing tools. Only a few kinks here and there but I would say this will become the ultimate tool for most I believe.

jin0008 on 8 Feb 2019


I have trouble with AnnotateInterval. I can't find the right one of segmental_duplicates.bed file in hg38. Can you provide one at resource bundle sites?




- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

@jlmatur @BroadFireCloud Note also that anyone with a @BroadFireCloud account should automatically be able to access Terra right away.
24 Mar 19
@jlmatur Technically you could already access these same resources through @BroadFireCloud without the email step,… https://t.co/Vm9ssQPsQE
24 Mar 19
@jlmatur Just let them know you want to access the GATK tutorials. The access list is going away May 1st, when… https://t.co/Ph7ucnIJ4I
24 Mar 19
Rori's slides on callset evaluation are at https://t.co/PndMRvrTXE
24 Mar 19
Takuto's slides on genotype refinement are at https://t.co/lBzvdwM3RN
24 Mar 19

- Our favorite tweets from others

The gift that keeps on giving. Remember that there are two books that can greatly influence teenaged boys: Lord of… https://t.co/gp6b9Ohm5L
24 Mar 19
@popgengoogling I am just happy with many improvements in mutect2 of GATK4 .... and a recent fix on the genotypecal… https://t.co/en1k39Dmn1
22 Mar 19
@jeromekelleher @ypriverol @fnothaft @hailgenetics Parquet and Avro are faster for read write, but there are questi… https://t.co/o8sDr8PclW
22 Mar 19
@gatk_dev @chdem Hold my tea...
22 Mar 19
@gatk_dev @broadinstitute Totally understood! It will be a great resource!
21 Mar 19

See more of our favorite tweets...