We're planning to upgrade one of the main servers at the heart of FireCloud in order to unlock some performance enhancements ahead of a training event that is taking place later this week. We plan to perform this operation between 6:30-7am EST on March 20, 2019 and expect it should take about 5 to 10 minutes to complete.
If you would like to be notified of all service incidents or upcoming scheduled maintenance, click Follow on this page.
Announcing a brand new “Best Practices” pipeline for calling SNPs and INDELs in the mitochondrial genome! Calling low VAF alleles (variant allele fraction) in the mitochondrial genome presents special problems, but come with great rewards, including diagnosing rare diseases and identifying asymptomatic carriers of pathogenic diseases. We’re excited to begin using this pipeline on tens of thousands of diverse samples from the gnomAD project (http://gnomad.broadinstitute.org/about) to gain greater understanding of population genetics from the perspective of mitochondrial DNA.
The short variants pipeline features a new QUAL score model, which has been around for a while but is now the default in GATK 4.1. The QUAL score of an allele is the probability that at least one sample’s genotype contains at least one copy of that allele. Previously, HaplotypeCaller and GenotypeGVCFs used an exhaustive calculation that explicitly considered the probabilities of zero, one, two copies of the allele. . . all the way up to every sample being homozygous for that allele. This approach was slow -- quadratic in the number of samples. Worse, it was exponentially slower for multiallelic sites, which are common in modern studies like ExAC and gnomAD. In practice the QUAL score relied on dodgy heuristics for multiallelics to keep runtime reasonable, which had unfortunate consequences like two good alleles cancelling each other’s evidence and rare alleles being lost at sites with more common alleles.
The new QUAL model is very similar to the somatic likelihoods model that underlies Mutect2. The idea is that each allele has some unknown frequency in the population and that genotypes are generated by sampling alleles independently (that is, assuming Hardy-Weinberg equilibrium) from these frequencies. Combining these elements with the genotype likelihoods from HaplotypeCaller gives us a probabilistic graphical model that we can solve using the technique of mean-field variational Bayesian inference. In Mutect2 tumor allele fractions replace population allele frequencies and reads replace genotypes but otherwise it’s the same model. For users joint calling many samples, samples with high ploidy, or highly multi-allelic samples, the new QUAL model will dramatically decrease the GenotypeGVCFs memory usage, and everyone will benefit from its increased accuracy.
An important HaplotypeCaller bug fix in the 4.1 release was prompted by a comparatively humble call set of 200 samples. An increased prevalence of ”no call” genotypes for samples that were clearly homozygous reference was biasing their allele frequency estimates for an important pathogenic mutation in their key disease gene. In GVCF mode, after HaplotypeCaller genotypes variant sites, it calculates a “reference confidence” for all other positions in the active region. The reference confidence likelihoods take the less confident of the likelihood of an uncalled SNP or an uncalled indel. Uncalled SNPs are modeled using the base qualities as a representation of the probability of a sequencing error at a variant base that just happened to match the reference. The indel model is more complicated. Reads are considered to be informative for the reference at a given position if the alignment based on the HaplotypeCaller assembly has fewer mismatches than any alternate alignment containing an indel of 10bp or fewer at that position. For example, at the edge of a CAAA repeat (as shown below), a read spanning the repeat and ending in unique sequence on the other side (such as the bottom two reads) will have more mismatches if an indel is introduced and is thus informative for reference likelihood calculation. Reads that end inside a repeat (such as the top two) could potentially have the same or fewer mismatches if an indel is introduced (e.g. at the position marked with dashed lines) and are thusly not informative with respect to the reference likelihoods. This heuristic works reasonably well in practice, but led to biased reference likelihoods near indels. In the old code, the read CIGAR was not being accounted for, so reads containing indels (like the two on the bottom) had a lot of mismatches in their original alignment. The hypothetical insertion of an indel (like the 4 bp deletion in the example) would then reduce the number of mismatches and make that a better alignment than the original, in which case the read cannot be said to support the reference. For the reads in the disease samples of interest, a homozygous deletion caused all samples to have no informative reads positions surrounding the indel (depending on the reference context) because of the CIGAR bug, leading to many GQ0 calls and allele frequency overestimates due to incorrect genotypes. Since making the change respecting the reads’ CIGARs, these no-call genotypes are fixed and reference confidences around indels are much more consistent. Cohorts of any size can benefit from improved reference confidences.
Figure caption: The indel reference confidence model becomes more important in repetitive regions. The top two reads are considered uninformative at the position indicated with the dashed lines. The first read has one mismatch (highlighted in green by IGV). If a one base insertion is added at that position, there will still be a single mismatch. Since that’s not worse than the reference alignment, we can’t say with confidence that there isn’t a 1bp insertion, so we don’t count that read towards the indel reference confidence. However the bottom two reads are informative. If a one base insertion is added, the TGTT sequence will get shifted one to the right, resulting in eight additional mismatches. Since the reference alignment has no mismatches, we can be confident that that alignment is more likely than a 1bp insertion. Mismatches are evaluated similarly for 1 to 10 base pair insertions and deletions for each read at each position.
A few years ago we celebrated the Broad Institute’s tenth birthday with a call for introspection: what have we done well? What should we keep doing? What should we change? By then we had grown to over 4000 employees and affiliates. Out of that concern grew “Be Big, Feel Small” -- a committee to ensure that the Broad enjoys all the advantages of “being big” while still maintaining a culture and community that “feels small”. Yes it’s nice to have a community large enough to capture diversity of culture, research interests, and experience, but it’s a little disheartening when you don’t know the name of the person in the office across the hall.
The GATK user community faces some of the same challenges. There’s certainly a wide variety of research questions and bioinformatic applications, but sometimes it seems like 1% of the users process 99% of the data. While working on some of the Broad flagship projects like gnomAD, we have made a lot of improvements to benefit the Broad production pipeline. At the same time, we’ve done our best to make sure that we don’t lose sight of the needs of the community working with smaller cohorts or single-sample data.
For the past year Funcotator has been a beta tool in GATK. With this new 4.1 release, Funcotator is (finally) out of beta and ready for use in a production environment. So... what exactly is Funcotator, and why should you care?
Funcotator is a functional annotator (FUNCtional annOTATOR) that reads in variants and adds useful information about their potential effects. It’s uses range from answering the question ‘in which gene (if any) does this variant occur’ to predicting an amino acid change string for variants that result in different protein structures. Accurate functional annotation is critical to turning vast amounts of genomic data into a better understanding of protein function.
Created to be a fast, functional, and accurate annotation tool that supports the hg38 genome, many recent updates have made Funcotator more robust and more correct. Of particular note - the protein change string algorithm can now account for protein changes that do not occur at the same amino acid position as a variant (such as when deletions occur in short tandem repeats). If you have a set of variants and wish to identify the genes affected and/or the protein amino acid sequence change, or if you simply wish to cross-reference your variants with a list of variants thought to be implicated in disease - Funcotator is the tool for you.
We publish two sets of data sources to go with Funcotator (including Gencode, ClinVar, gnomAD, and more) so it can be used out of the box and with minimal effort to add annotations to either germline or somatic variants. Best of all, it can be updated by you, the user, to include your favorite annotation data sources when you annotate your VCF files (with some caveats).
Authors of the blog: Laura Gauthier and David Benjamin
I like to think of Mutect2 and HaplotypeCaller as two sisters. One is a little older, but they live in the same house and share a fair amount of genetic material. One is gregarious and surrounds herself with friends. The other prefers the intimate company of a few special people. But they both have a formidable intellect, a determined work ethic, and a selfless instinct to help. Mutect2 and HaplotypeCaller both follow the same basic recipe: determine the active region, assemble the haplotypes, evaluate per-read allele likelihoods, and calculate variant likelihoods. Because their code is fundamentally alike for the first three steps, many changes in one tool are likely to impact the other. Recently Mutect2 development has been much more active than HaplotypeCaller development, but both tools are able to benefit. In his blog post (see https://software.broadinstitute.org/gatk/blog?id=23400), David Benjamin outlined the new Mutect2 features in more detail. Below you’ll see how many of them have consequences for HaplotypeCaller as well.
After training on a massive amount of data, the neural network models at the heart of CNNvariant have learned enough to be released from beta.
This suite of tools for single-sample variant filtration uses machine learning to differentiate between good variants and artifacts of the sequencing process, a fairly new approach that is especially effective at correctly calling indels. To enable the models to accurately filter and score variants from VCF files, we trained on validated VCFs (from truth models including SynDip, Genomes in a bottle, and Platinum Genomes) with unvalidated VCFs aligned to different reference builds (HG19, HG38), sequenced on different machines, using different protocols. The result is a single-sample tool that complements, rather than replaces, VQSR (our "traditional" cohort based variant filtering algorithm).
What’s in the CNN toolbox? Pre-trained and train-your-own CNNScoreVariant is a pre-trained Convolutional Neural Network ready to score variants. After scoring, you can filter your VCF by applying a sensitivity threshold with the tool FilterVariantTranches. Or you can train your own models with the tools CNNVariantWriteTensors and CNNVariantTrain, as long as you have validated VCFs to use as training data.
While the current release is only recommended for single sample variant filtration, in the future we hope to explore ways a similar model could be applied to joint-called VCFs.
Under the hood For details on how our deep learning tool works and how it stacks up to other variant filtering tools (both Deep Learning and traditional), as well as a bit of history on Deep Learning in genomics, see our other blog post.
Early adopters of GATK4 will recall that somatic and germline copy-number variation (CNV) pipelines were among the first to be developed. The current generation of these pipelines still bear traits that reflect their evolutionary beginnings, but have also acquired adaptations that take them far beyond their predecessors’ limitations. With the release of GATK4.1, we are excited to bring the latest and greatest versions of these pipelines out of beta and to officially add CNV calling to GATK’s ever-growing set of capabilities.
Evolution of the CNV pipelines Beta versions of the GATK CNV pipelines were heavily influenced by methods previously developed at the Broad. For example, the GATK4(beta) CNV/AllelicCNV pipeline bore strong resemblances to the exome ReCapSeg/AllelicCapSeg pipeline developed by the Cancer Genome Analysis Group. The germline GATK4(beta) XHMMSegmentCaller pipeline was a near-direct port of the XHMM (eXome-Hidden Markov Model) tool. Vestiges of these venerable ancestors still remain in GATK4.1’s ModelSegments and GermlineCNVCaller pipelines; however, new innovations yield dramatically improved performance and enable scalability from exomes to genomes.
By Tom White & James Emery
We continue to improve our support for users who want to run on Apache Spark with GATK 4.1. Spark is a popular engine for large-scale data processing that can be deployed in clusters or on a standalone machine with multiple cores - a good way to speed up GATK tools without having to invest in a cluster.
The newest release introduces MarkDuplicatesSpark, a newly out-of-beta replacement redesigned to take advantage of multicore parallelism and run on cloud platforms. When run with large multicore machines on Google Compute Engine, the new and improved version is cost-competitive with Picard MarkDuplicates + SortSam while being significantly faster overall. Other spark improvements include the full ReadsPipelineSpark, powered by a brand new Spark I/O library, Disq.
As of GATK 4.0, Mutect2 did not work well with the extremely high depths seen in mitochondrial samples and cfDNA liquid biopsies. Thanks to a new active region calculation and a few new filters, Mutect2 now works off the shelf for these cases.
We have been emboldened to improve the local assembly code shared with HaplotypeCaller (HC). Adaptive assembly graph pruning advances both high- and low- depth calling. This feature (default in Mutect2 and optional in HC) uses a probabilistic model that accounts for depth, removing spurious paths in the assembly graph while retaining real mutations. Processing fewer false paths reduces runtime and improves accuracy, even in an assembly region with varying depth, such as one that spans the end of an exome capture target.
Mutect2 can now monitor a tumor after chemotherapy using the new “GGA” (genotype given alleles) mode. Inputting pre-treatment somatic variants as the given alleles forces Mutect2 to genotype those alleles in a given vcf in addition to those that it would otherwise discover (in contrast to HC’s GGA mode). You can also use GGA mode to give special attention to calls from other tools, as in a head-to-head comparison, or to known driver mutations.
Mutect2 now allows joint somatic genotyping of an arbitrary number of tumor and normal samples. With this change, you only need to specify which samples are normal. Mutect2 assumes anything else is a tumor sample. Allowing multiple samples from the same individual enables tracking a tumor and possibly its metastases over time. The feature, currently in beta, will become important as liquid biopsies increase in popularity.
Several tools in the orbit of Mutect2 are also entirely new or overhauled:
Both sensitivity and precision are better than they were a year ago. In the coming weeks and months, watch out for additional improvements, including:
Improvements in Mutect2 that account for PCR error and allow you to keep both reads. Mutect2 currently discards one read of a mate pair that overlaps over a variant site on the grounds that the reads are not independent sources of evidence, since they might arise from the same PCR error.
A somatic likelihoods model that will force a read and its mate to support the same haplotype, since they must be from the same piece of DNA.
A Bayesian model that enables FilterMutectCalls to learn the spectrum of tumor allele fractions and better distinguish between false positives and real somatic variants.
On the software engineering side of things, we have made Mutect2 faster and cheaper. Whole-genome pairs of 60x tumors and 60x normals take ~24 hours of total CPU time and cost about a dollar on Google cloud in our public validations on Firecloud. The Mutect2 WDL can parallelize this to reduce the wall clock time to an hour or so! The WDL supports direct reading from google buckets, so a Mutect2 run streams only the reads it needs.
This is just the tip of our wish list for the next year, and we look forward to getting a lot done. While the joint assembly and genotyping work well, our initial attempt at multi-sample filtering will need improvement. We would be very grateful for suggestions, requests, suggestions for improvement, and requests for features by early adopters.
See Events calendar for full list and dates
See Events calendar for full list and dates