Two weeks ago, for the official release of GATK version 4.0, we held a live online event that was both a launch party and a comprehensive if condensed overview of everything that's new in GATK4. Over the course of two hours, members of the GATK development team and a great lineup of external guests gave presentations about the new capabilities, discussed their implications in small panels and answered questions from the online audience.
I had the privilege of serving as host -- and unintentional comic relief, between forgetting panelists and bumping into the set furniture -- so I'm probably biased, but I'd say it was the most fun-yet-informative event we've done so far on GATK. Not that we do a lot of events -- and it's mostly just workshops -- but this felt pretty special. We had a great time doing it, and lots of people showed up to watch and ask questions. So we're now considering doing others in a similar vein, though they would each be focused on a specific topic and have more time for answering questions from the online audience. If that sounds like something you'd be interested in, let us know in the comments!
The brand new 4.0 version of the GATK was released -- at long last! -- on Tuesday Jan 9, 2018.
In lieu of our traditional version highlights, for this release we have collected the following resources:
Coming soon: The GATK4 migration guide will detail the key differences at the level of tools and command lines that you should watch out for when you upgrade to using GATK4 in your own work.
In just a few days, we'll be releasing GATK4 into general availability -- that's right, the big 4.0! To mark the occasion we are hosting a launch event that will be livestreamed on the Broad Institute's Facebook. Here's a short URL if you'd like to share it: broad.io/facebook.
The launch event is going to be a two-hour whistle-stop tour of what's new and shiny in GATK4. My fellow members of the Data Sciences Platform and GATK development team will give short presentations on key features, then we'll have some panel discussions to dig a bit deeper into the technical underpinnings and implications of these features. For the panels we'll be joined by a really exciting lineup of special guests from the University of California Santa Cruz, Yale School of Medicine, Intel, IBM Research, Verily Life Sciences, Amazon Web Services, Cloudera, Alibaba Cloud, and Microsoft Genomics. Details below the fold.
We should also have some time to take questions from the online audience, so be sure to log in and ask your questions in the comments section of the livestream. We'll also be checking the forums and Twitter for those of you who don't have a Facebook account. To be clear, you don't need an account to watch the video stream.
We hope you'll join us to celebrate this important milestone!
With less than a week to go before the big day (aaaaaaah), we're putting the finishing touches on some important updates to the website and the documentation.
Starting Tuesday Jan 9, the primary supported version will be 4.0, so all the documentation displayed by default on the website will be the 4.0 documentation. That covers not just the Tool Docs, which have always been systematically versioned, but also the forum-based peripheral docs that are more general and typically do not change from one version to the next. In the case of the move to GATK4, a majority of these peripheral doc articles are affected by a range of changes, from minor points of syntax to major shifts in functionality (e.g. switching from
-nct to Spark for multithreading). Here's how we're planning to deal with that.
What's new in GATK4? In this short video, Laura Gauthier explains how the speed and scalability of joint calling is dramatically improved in GATK4 thanks to the Intel GenomicsDB datastore.
With the GATK4 release just around the corner, we wanted to make it easy for everyone to try out the new pipelines without going through a whole lot of setup. So we're setting them all up in ready-to-run workspaces on FireCloud, which is a secure, freely-accessible, open-source analysis portal we built on Google Cloud (think Galaxy but more scalable). The pipelines are preconfigured according to our Best Practices, so it'll be just a matter of a few clicks to run any pipeline you like on the preloaded example datasets -- or, with a few more (simple) steps, to run them on your own data. All this without ever touching a command line, unless you're the CLI-over-GUI type, in which case you're welcome to use the FireCloud APIs vis Swagger or the FISS Python bindings to do all this programmatically.
But that's not all -- we're super excited to announce that we're giving out free credits for running the pipelines! Normally you would have to pay Google for the compute and storage costs -- we make the portal and tools available for free, but Google runs the machines, and they charge you for what you use. However, if you apply ASAP, you can get $250 worth of credits for free! That should be more than enough to test out the new pipelines; with that amount of credits you should be able to get real work done toward your research. And you can run any pipelines you want as long as they're written in WDL, so you can run other tools besides GATK.
The FireCloud free credits program starts January 9th, 2018, when GATK4 is released and the new pipelines are made available in FireCloud. We have secured funding to give out $250 worth of credits each to 1,000 people. Credits will be allocated on a first come, first serve basis, so the sooner you sign up, the more likely you are to receive credits.
To take advantage of this unique opportunity, all you need to do is register for an account on the FireCloud portal (which is itself always free and open to all) and sign up for the free credits program. Read on below for details about signing up and which pipelines will be featured.
By Sam Friedman, deep learning developer in GATK4
Over the past couple of weeks, there's been a lot of chatter online --and in the press!-- about the applicability of deep learning to variant calling. This is exciting to me because I've been working on developing a deep learning approach to variant filtering based on Convolutional Neural Networks (CNNs), with the goal of replacing the VQSR filtering step in the GATK germline short variants pipeline. In fact, multiple groups have been picking up on the promise of deep learning and applying it to genomics.
As far as I'm aware, the first group to publish a deep learning-based approach for variant calling was the Campagne lab at Cornell, who released their variationanalysis software in December 2016. There's also a group at Illumina that has been doing some interesting work with deep learning for predicting functional effects of variants, and some of my colleagues are currently working with researchers at Microsoft to see if CNNs can be used to discover complex Structural Variations (SVs) from short sequencing reads.
When the Google team made the source for DeepVariant public last week, I ran a few tests to compare it to the tool that I've been working on for the last six months (GATK4 CNN). The results are summarized in the table below.
Both my tool and DeepVariant outperform VQSR (our "traditional" variant filtering algorithm), especially when VQSR is run on a single sample rather than on a cohort according to our current best practices. The delta isn't all that large on SNPs, but that's expected because germline SNPs are largely a solved problem, so all tools tend to do great there. The harder problem is indel calling, and that's where we see more separation between the tools. The good news is we're all doing better than VQSR on calling indels, which means progress for the research community, whatever else happens. It doesn't hurt my mood that these preliminary findings suggest GATK4 CNN is doing even better than DeepVariant :)
But keep in mind it's early days yet for deep learning in genomics, so a lot could still change as we all figure out how best to take advantage of these methods. Read on if you want to know more about how these results were generated and how my tool works under the hood.
To celebrate the release of GATK 4.0, a project more than two years in the making, we're planning a Facebook Live event during which we'll livestream a series of presentations and discussion panels. These will feature GATK developers as well as special guests who will talk about their experience either using GATK in their research or contributing to its development. There will be ample opportunity to ask questions in the event discussion thread and have them answered live by the presenters and panelists.
The event will take place from 2 PM to 4 PM Eastern Time, and the livestream video will be accessible to all on the Broad Institute's Facebook page. Exact details including URL and guest lineup will be posted on this blog the first week of January.
We're hoping you will join us online for this event!
By Laura Gauthier, lead GATK developer for germline short variant discovery
A: Yes! We are super excited to announce the long-awaited release of The HaplotypeCaller Paper -- or rather, the preprint in bioRxiv. (Actually we announced it on Twitter a while back but we understand not everyone enjoys such an old-school way of keeping up with the news). Hopefully you’re as excited as we are, if not more so, but we understand that this probably raises a few questions for some of you, so we tried to address some of those below.
By Yossi Farjoun, Associate Director of computational research methods in the Data Sciences Platform
A note to explain the context of the new paper by Heng Li, myself and others, “New synthetic-diploid benchmark for accurate variant calling evaluation” available as a preprint in bioRxiv.
Developing new tools and algorithms for genome analysis relies heavily on the availability of so-called "truth sets" that are used to evaluate performance (accuracy, sensitivity etc.). This has long been a sticking point, though recently the situation has improved dramatically with the availability of several public, high-quality truth sets such as Genome In A Bottle from NIST and Platinum Genomes from Illumina. Even these resources, which have been produced through painstaking analysis and curation, are not immune to the lack of “orthogonality” which plagues most available truth-sets. Chief among these is that the failure modes of Illumina sequencing are usually masked out and the resulting data are biased towards the easier parts of the genome.
The paper I linked above introduces a new dataset that we developed to be less biased. It is based solely on PacBio sequencing, and thus its error modes are less correlated with Illumina’s error modes. Using this dataset for benchmarking has given us high confidence in the accuracy of our validations and has enabled us to improve our methods with less concern of overfitting.
See Events calendar for full list and dates
See Events calendar for full list and dates