Part 2 of a series on the theme of synthetic data | Start with part 1: Fake your data -- for Science!
Last year my team and I collaborated with a genetics researcher, Dr. Matthieu Miossec, to reproduce an analysis from a paper he had co-authored, as a case study for a workshop. The paper reported variants associated with a particular type of congenital heart disease called Tetralogy of Fallot; a pretty classic example of a germline analysis that uses GATK for variant calling. Between the information provided in the preprint and Matthieu's generous cooperation, we had everything we needed to reimplement the analysis in a more portable, fully shareable form… Except the data, because as with many human genetics studies, the original exome data was private.
In illo tempore, with the exuberance of youth filling my sails, I thought it should be simple enough to create some synthetic data to stand in for the original. I am wiser now, and grayer.
Part 1 of a series on the theme of synthetic data
Don't get me wrong, I'm not suddenly advocating for fraudulent research. What I'm talking about is creating synthetic sequence data for testing pipelines, sharing tools and generally increasing the computational reproducibility of published studies, so that we can all more easily build on each other's work.
The majority of the effort around computational reproducibility has so far focused on better ways to share and run code, as far as I can tell. With great results -- it's been transformative to see the community adopt tooling like version control, containers and Jupyter notebooks. Yet you can give me all the containers and notebooks in the world; if I don't have appropriate data to run that code on, none of it helps me.
Most of the genomic data that gets generated for human biomedical research is subject to very strict access restrictions. These protections exist for good reason, but on the downside, they make it much harder to train researchers in key methodologies until after they have been granted access to specific datasets — if they can get access at all. There are certainly open datasets like 1000 Genomes and ENCODE that can be used beyond their original research purposes for some types of training and testing. However they don't cover the full range of what is needed in the field in terms of technical characteristics (eg exome vs WGS, amount of coverage, number of samples for scale testing etc); not by a long way.
That's where fake data comes in -- we can create synthetic datasets to use as proxies for the real data. This is not a new idea of course; people have been using synthetic data for some time, as in the ICGC-TCGA DREAM Mutation challenges, and there is already a rather impressive range of command-line software packages available for generating synthetic genomic data. It's even possible to introduce (or "spike in") variants into sequencing data, real or fake, on demand. So that's all pretty cool. But in practice these packages tend to be mostly used by savvy tool developers for small-scale testing and benchmarking purposes, and rarely (if ever? send me links!) by biomedical researchers for providing reproducible research supplements.
And frankly, it's no surprise. It's actually kinda hard.
We recently rolled out a fresh new edition of our popular 4-day GATK bootcamp, which we trialed at the Broad, and now we're excited to take it on the road! We have six locations lined up, one every month for the next six months. We're going to be bouncing around Europe quite a bit, with a couple of side trips to Central and South America:
(registration links will be added as they become available)
In this bootcamp-style workshop, we take you through the basics of working with genomic sequence data and a high-level overview of variant discovery so that by the end of the first day, you have a solid understanding of all the main pieces and processes. Then over the course of the next two days we delve into the details of variant calling for the most mature supported use cases -- germline SNPs and Indels on Day 2, somatic SNPs and Indels and somatic copy number (CNVs) on Day 3 -- through a combination of lectures and hands-on tutorials. Finally, on the fourth day we cover the mechanics of assembling and executing analysis workflows, as well as some useful tools and newer use cases that are hot off the development press.
The most innovative feature of this season is that we'll be running all hands-on tutorials on the Broad's new cloud-based analysis platform, Terra (the next-gen version of FireCloud), which you can learn more about at https://terra.bio. Terra is currently in a closed beta phase until May 1st, but you can already request access by filling in a contact form at https://terra.bio/contact. Going forward Terra will be our primary channel for delivering tutorials, working examples of GATK-based analyses and more, so be sure to check it out even if you don't plan to move your work to the cloud.
The other day I decided to disassemble the bathroom doorknob. Efforts included chipping away layers of paint and recruiting some muscle to remove the screws the chipping had revealed. When I levered out the latch system and took it apart, I noticed two things. First, parts of it had a beautiful copper color. Second, the internal spring was broken into three parts. The latter explained the sticky latch you had to jiggle to stay closed and a door that had started to pop open randomly as if possessed.
I made the fix with a new spring.
It may be a compulsion to want to fix broken things. I think it stems from the same curiosity that makes you want to take things apart to understand how they work . When I joined the GATK some 3.5 years ago as a technical writer, this compulsion surfaced and drove me to the effort that resulted in the pieces I wrote. Below, at the end of this post, is a sampling of my most viewed articles.
I would like to thank you for allowing me to serve you these past years. I have learned much in the process. The knowledge I have gained in genomics comes not only from these writing projects but also just as much from answering your questions on the forum. From holding to answering just one forum question a day, I am proud to have earned over 250 likes and points aplenty for the forum’s five-star ranking.
Looking back, 2018 was a busy year. Geraldine asked I help out at the July Cambridge UK workshop and also at the December Taiwan workshop . Each workshop brings with it a torrent of activity creating and updating materials. It is always insightful and rewarding to interact firsthand with researchers, to hear about sticking points and to see reactions to the tutorials I develop and write.
Since returning from the December workshop, I have been submarined pouring effort into finalizing the gCNV tutorial in time for my departure. I hope you find it useful. This tutorial has been the most challenging to develop so far in that exploring the results involved more creative solutions than usual, as you will see in the tutorial’s companion Jupyter Notebook reports here and here .
Before I start searching for a new job, this month I will spend some time visiting friends and family and remembering my Ph.D. advisor at his memorial. If you would like to lend your support, I would love to have your endorsement on LinkedIn . If you need to get in touch with me, please ping me on GitHub, in the broadinstitute/gatk repository. My handle is @sooheelee and I will be checking in intermittently.
It has been a privilege.
By Mark Fleharty & Madeleine Duran
Accurate detection of somatic events from liquid biopsies has the potential to revolutionize precision medicine in cancer. But identifying variants at the low allele fractions in blood requires higher base qualities than are typically reported by the Illumina platform. We are happy to unveil a pipeline that overcomes this challenge with an improved Mutect2 and a custom lab process that uses duplex consensus reads to reduce false-negatives. Our pipeline is able to call SNPs from liquid biopsies with 90% sensitivity for somatic SNPs present at a 1% allele fraction - while calling no more than one false positive per megabase, welcome news for patients who today must endure invasive biopsies to detect and track cancer.
Announcing a brand new “Best Practices” pipeline for calling SNPs and INDELs in the mitochondrial genome! Calling low VAF alleles (variant allele fraction) in the mitochondrial genome presents special problems, but come with great rewards, including diagnosing rare diseases and identifying asymptomatic carriers of pathogenic diseases. We’re excited to begin using this pipeline on tens of thousands of diverse samples from the gnomAD project (http://gnomad.broadinstitute.org/about) to gain greater understanding of population genetics from the perspective of mitochondrial DNA.
The short variants pipeline features a new QUAL score model, which has been around for a while but is now the default in GATK 4.1. The QUAL score of an allele is the probability that at least one sample’s genotype contains at least one copy of that allele. Previously, HaplotypeCaller and GenotypeGVCFs used an exhaustive calculation that explicitly considered the probabilities of zero, one, two copies of the allele. . . all the way up to every sample being homozygous for that allele. This approach was slow -- quadratic in the number of samples. Worse, it was exponentially slower for multiallelic sites, which are common in modern studies like ExAC and gnomAD. In practice the QUAL score relied on dodgy heuristics for multiallelics to keep runtime reasonable, which had unfortunate consequences like two good alleles cancelling each other’s evidence and rare alleles being lost at sites with more common alleles.
The new QUAL model is very similar to the somatic likelihoods model that underlies Mutect2. The idea is that each allele has some unknown frequency in the population and that genotypes are generated by sampling alleles independently (that is, assuming Hardy-Weinberg equilibrium) from these frequencies. Combining these elements with the genotype likelihoods from HaplotypeCaller gives us a probabilistic graphical model that we can solve using the technique of mean-field variational Bayesian inference. In Mutect2 tumor allele fractions replace population allele frequencies and reads replace genotypes but otherwise it’s the same model. For users joint calling many samples, samples with high ploidy, or highly multi-allelic samples, the new QUAL model will dramatically decrease the GenotypeGVCFs memory usage, and everyone will benefit from its increased accuracy.
An important HaplotypeCaller bug fix in the 4.1 release was prompted by a comparatively humble call set of 200 samples. An increased prevalence of ”no call” genotypes for samples that were clearly homozygous reference was biasing their allele frequency estimates for an important pathogenic mutation in their key disease gene. In GVCF mode, after HaplotypeCaller genotypes variant sites, it calculates a “reference confidence” for all other positions in the active region. The reference confidence likelihoods take the less confident of the likelihood of an uncalled SNP or an uncalled indel. Uncalled SNPs are modeled using the base qualities as a representation of the probability of a sequencing error at a variant base that just happened to match the reference. The indel model is more complicated. Reads are considered to be informative for the reference at a given position if the alignment based on the HaplotypeCaller assembly has fewer mismatches than any alternate alignment containing an indel of 10bp or fewer at that position. For example, at the edge of a CAAA repeat (as shown below), a read spanning the repeat and ending in unique sequence on the other side (such as the bottom two reads) will have more mismatches if an indel is introduced and is thus informative for reference likelihood calculation. Reads that end inside a repeat (such as the top two) could potentially have the same or fewer mismatches if an indel is introduced (e.g. at the position marked with dashed lines) and are thusly not informative with respect to the reference likelihoods. This heuristic works reasonably well in practice, but led to biased reference likelihoods near indels. In the old code, the read CIGAR was not being accounted for, so reads containing indels (like the two on the bottom) had a lot of mismatches in their original alignment. The hypothetical insertion of an indel (like the 4 bp deletion in the example) would then reduce the number of mismatches and make that a better alignment than the original, in which case the read cannot be said to support the reference. For the reads in the disease samples of interest, a homozygous deletion caused all samples to have no informative reads positions surrounding the indel (depending on the reference context) because of the CIGAR bug, leading to many GQ0 calls and allele frequency overestimates due to incorrect genotypes. Since making the change respecting the reads’ CIGARs, these no-call genotypes are fixed and reference confidences around indels are much more consistent. Cohorts of any size can benefit from improved reference confidences.
Figure caption: The indel reference confidence model becomes more important in repetitive regions. The top two reads are considered uninformative at the position indicated with the dashed lines. The first read has one mismatch (highlighted in green by IGV). If a one base insertion is added at that position, there will still be a single mismatch. Since that’s not worse than the reference alignment, we can’t say with confidence that there isn’t a 1bp insertion, so we don’t count that read towards the indel reference confidence. However the bottom two reads are informative. If a one base insertion is added, the TGTT sequence will get shifted one to the right, resulting in eight additional mismatches. Since the reference alignment has no mismatches, we can be confident that that alignment is more likely than a 1bp insertion. Mismatches are evaluated similarly for 1 to 10 base pair insertions and deletions for each read at each position.
A few years ago we celebrated the Broad Institute’s tenth birthday with a call for introspection: what have we done well? What should we keep doing? What should we change? By then we had grown to over 4000 employees and affiliates. Out of that concern grew “Be Big, Feel Small” -- a committee to ensure that the Broad enjoys all the advantages of “being big” while still maintaining a culture and community that “feels small”. Yes it’s nice to have a community large enough to capture diversity of culture, research interests, and experience, but it’s a little disheartening when you don’t know the name of the person in the office across the hall.
The GATK user community faces some of the same challenges. There’s certainly a wide variety of research questions and bioinformatic applications, but sometimes it seems like 1% of the users process 99% of the data. While working on some of the Broad flagship projects like gnomAD, we have made a lot of improvements to benefit the Broad production pipeline. At the same time, we’ve done our best to make sure that we don’t lose sight of the needs of the community working with smaller cohorts or single-sample data.
For the past year Funcotator has been a beta tool in GATK. With this new 4.1 release, Funcotator is (finally) out of beta and ready for use in a production environment. So... what exactly is Funcotator, and why should you care?
Funcotator is a functional annotator (FUNCtional annOTATOR) that reads in variants and adds useful information about their potential effects. It’s uses range from answering the question ‘in which gene (if any) does this variant occur’ to predicting an amino acid change string for variants that result in different protein structures. Accurate functional annotation is critical to turning vast amounts of genomic data into a better understanding of protein function.
Created to be a fast, functional, and accurate annotation tool that supports the hg38 genome, many recent updates have made Funcotator more robust and more correct. Of particular note - the protein change string algorithm can now account for protein changes that do not occur at the same amino acid position as a variant (such as when deletions occur in short tandem repeats). If you have a set of variants and wish to identify the genes affected and/or the protein amino acid sequence change, or if you simply wish to cross-reference your variants with a list of variants thought to be implicated in disease - Funcotator is the tool for you.
We publish two sets of data sources to go with Funcotator (including Gencode, ClinVar, gnomAD, and more) so it can be used out of the box and with minimal effort to add annotations to either germline or somatic variants. Best of all, it can be updated by you, the user, to include your favorite annotation data sources when you annotate your VCF files (with some caveats).
Authors of the blog: Laura Gauthier and David Benjamin
I like to think of Mutect2 and HaplotypeCaller as two sisters. One is a little older, but they live in the same house and share a fair amount of genetic material. One is gregarious and surrounds herself with friends. The other prefers the intimate company of a few special people. But they both have a formidable intellect, a determined work ethic, and a selfless instinct to help. Mutect2 and HaplotypeCaller both follow the same basic recipe: determine the active region, assemble the haplotypes, evaluate per-read allele likelihoods, and calculate variant likelihoods. Because their code is fundamentally alike for the first three steps, many changes in one tool are likely to impact the other. Recently Mutect2 development has been much more active than HaplotypeCaller development, but both tools are able to benefit. In his blog post (see https://software.broadinstitute.org/gatk/blog?id=23400), David Benjamin outlined the new Mutect2 features in more detail. Below you’ll see how many of them have consequences for HaplotypeCaller as well.
See Events calendar for full list and dates
See Events calendar for full list and dates