Given my years as a biochemist, if given two samples to compare, my first impulse is to want to know what are the functional differences, i.e. differences in proteins expressed between the two samples. I am interested in genomic alterations that ripple down the central dogma to transform a cell.

Please note the workflow that follows is NOT a part of the Best Practices. This is an illustrative, unsupported workflow. For the official Somatic Short Variant Calling Best Practices workflow, see Tutorial#11136.

To call every allele that is different between two samples, I have devised a two-pass workflow that takes advantage of Mutect2 features. This workflow uses Mutect2 in tumor-only mode and appropriates the --germline-resource argument to supply a single-sample VCF with allele fractions instead of population allele frequencies. The workflow assumes the two case samples being compared originate from the same parental line and the ploidy and mutation rates make it unlikely that any site accumulates more than one allele change.

Read the whole post
See comments (40)

Over the past two weeks and a bit, the GATK 4.0(.0.0) package has been downloaded nearly eight thousand times. That's... not too shabby! Let's see if y'all can take it to 8,000 before we cut the release :)

Yes, I plan to explain the version numbering system in an upcoming blog post.

Looking back at our download records, it outdoes any previous release we've ever done by a factor of nearly four. Interestingly, it comes after a major slump in download numbers over the past six months, a.k.a. since we announced the GATK4 beta and the open-sourcing at the Bio-IT World meeting in May 2017. It looks like a lot of people were holding their breath waiting for the 4.0 release... I hope it was worth the wait.

Read the whole post
See comments (0)

Two weeks ago, for the official release of GATK version 4.0, we held a live online event that was both a launch party and a comprehensive if condensed overview of everything that's new in GATK4. Over the course of two hours, members of the GATK development team and a great lineup of external guests gave presentations about the new capabilities, discussed their implications in small panels and answered questions from the online audience.

I had the privilege of serving as host -- and unintentional comic relief, between forgetting panelists and bumping into the set furniture -- so I'm probably biased, but I'd say it was the most fun-yet-informative event we've done so far on GATK. Not that we do a lot of events -- and it's mostly just workshops -- but this felt pretty special. We had a great time doing it, and lots of people showed up to watch and ask questions. So we're now considering doing others in a similar vein, though they would each be focused on a specific topic and have more time for answering questions from the online audience. If that sounds like something you'd be interested in, let us know in the comments!

Anyway, the recording of the full event is now available on YouTube, and you can always find it again on our Presentations page.

Read the whole post
See comments (0)

The brand new 4.0 version of the GATK was released -- at long last! -- on Tuesday Jan 9, 2018.

In lieu of our traditional version highlights, for this release we have collected the following resources:

Coming soon: The GATK4 migration guide will detail the key differences at the level of tools and command lines that you should watch out for when you upgrade to using GATK4 in your own work.

Read the whole post
See comments (0)

In just a few days, we'll be releasing GATK4 into general availability -- that's right, the big 4.0! To mark the occasion we are hosting a launch event that will be livestreamed on the Broad Institute's Facebook. Here's a short URL if you'd like to share it:

The launch event is going to be a two-hour whistle-stop tour of what's new and shiny in GATK4. My fellow members of the Data Sciences Platform and GATK development team will give short presentations on key features, then we'll have some panel discussions to dig a bit deeper into the technical underpinnings and implications of these features. For the panels we'll be joined by a really exciting lineup of special guests from the University of California Santa Cruz, Yale School of Medicine, Intel, IBM Research, Verily Life Sciences, Amazon Web Services, Cloudera, Alibaba Cloud, and Microsoft Genomics. Details below the fold.

We should also have some time to take questions from the online audience, so be sure to log in and ask your questions in the comments section of the livestream. We'll also be checking the forums and Twitter for those of you who don't have a Facebook account. To be clear, you don't need an account to watch the video stream.

We hope you'll join us to celebrate this important milestone!

Read the whole post
See comments (5)

With less than a week to go before the big day (aaaaaaah), we're putting the finishing touches on some important updates to the website and the documentation.

Starting Tuesday Jan 9, the primary supported version will be 4.0, so all the documentation displayed by default on the website will be the 4.0 documentation. That covers not just the Tool Docs, which have always been systematically versioned, but also the forum-based peripheral docs that are more general and typically do not change from one version to the next. In the case of the move to GATK4, a majority of these peripheral doc articles are affected by a range of changes, from minor points of syntax to major shifts in functionality (e.g. switching from -nt/-nct to Spark for multithreading). Here's how we're planning to deal with that.

Read the whole post
See comments (1)

What's new in GATK4? In this short video, Laura Gauthier explains how the speed and scalability of joint calling is dramatically improved in GATK4 thanks to the Intel GenomicsDB datastore.

See comments (0)

With the GATK4 release just around the corner, we wanted to make it easy for everyone to try out the new pipelines without going through a whole lot of setup. So we're setting them all up in ready-to-run workspaces on FireCloud, which is a secure, freely-accessible, open-source analysis portal we built on Google Cloud (think Galaxy but more scalable). The pipelines are preconfigured according to our Best Practices, so it'll be just a matter of a few clicks to run any pipeline you like on the preloaded example datasets -- or, with a few more (simple) steps, to run them on your own data. All this without ever touching a command line, unless you're the CLI-over-GUI type, in which case you're welcome to use the FireCloud APIs vis Swagger or the FISS Python bindings to do all this programmatically.

But that's not all -- we're super excited to announce that we're giving out free credits for running the pipelines! Normally you would have to pay Google for the compute and storage costs -- we make the portal and tools available for free, but Google runs the machines, and they charge you for what you use. However, if you apply ASAP, you can get $250 worth of credits for free! That should be more than enough to test out the new pipelines; with that amount of credits you should be able to get real work done toward your research. And you can run any pipelines you want as long as they're written in WDL, so you can run other tools besides GATK.

The FireCloud free credits program starts January 9th, 2018, when GATK4 is released and the new pipelines are made available in FireCloud. We have secured funding to give out $250 worth of credits each to 1,000 people. Credits will be allocated on a first come, first serve basis, so the sooner you sign up, the more likely you are to receive credits.

To take advantage of this unique opportunity, all you need to do is register for an account on the FireCloud portal (which is itself always free and open to all) and sign up for the free credits program. Read on below for details about signing up and which pipelines will be featured.

Read the whole post
See comments (0)

Deep learning in GATK4

Posted by samwell on 21 Dec 2017 (25)

By Sam Friedman, deep learning developer in GATK4

Over the past couple of weeks, there's been a lot of chatter online --and in the press!-- about the applicability of deep learning to variant calling. This is exciting to me because I've been working on developing a deep learning approach to variant filtering based on Convolutional Neural Networks (CNNs), with the goal of replacing the VQSR filtering step in the GATK germline short variants pipeline. In fact, multiple groups have been picking up on the promise of deep learning and applying it to genomics.

As far as I'm aware, the first group to publish a deep learning-based approach for variant calling was the Campagne lab at Cornell, who released their variationanalysis software in December 2016. There's also a group at Illumina that has been doing some interesting work with deep learning for predicting functional effects of variants, and some of my colleagues are currently working with researchers at Microsoft to see if CNNs can be used to discover complex Structural Variations (SVs) from short sequencing reads.

When the Google team made the source for DeepVariant public last week, I ran a few tests to compare it to the tool that I've been working on for the last six months (GATK4 CNN). The results are summarized in the table below.

Both my tool and DeepVariant outperform VQSR (our "traditional" variant filtering algorithm), especially when VQSR is run on a single sample rather than on a cohort according to our current best practices. The delta isn't all that large on SNPs, but that's expected because germline SNPs are largely a solved problem, so all tools tend to do great there. The harder problem is indel calling, and that's where we see more separation between the tools. The good news is we're all doing better than VQSR on calling indels, which means progress for the research community, whatever else happens. It doesn't hurt my mood that these preliminary findings suggest GATK4 CNN is doing even better than DeepVariant :)

But keep in mind it's early days yet for deep learning in genomics, so a lot could still change as we all figure out how best to take advantage of these methods. Read on if you want to know more about how these results were generated and how my tool works under the hood.

Read the whole post
See comments (25)

To celebrate the release of GATK 4.0, a project more than two years in the making, we're planning a Facebook Live event during which we'll livestream a series of presentations and discussion panels. These will feature GATK developers as well as special guests who will talk about their experience either using GATK in their research or contributing to its development. There will be ample opportunity to ask questions in the event discussion thread and have them answered live by the presenters and panelists.

The event will take place from 2 PM to 4 PM Eastern Time, and the livestream video will be accessible to all on the Broad Institute's Facebook page. Exact details including URL and guest lineup will be posted on this blog the first week of January.

We're hoping you will join us online for this event!

See comments (0)

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


@dbernick @mattmight Not really — interpretation is downstream of our space; we defer to the subject matter experts on this one.
11 Dec 18
@Greg_Owens No need to trim your reads for GATK -- in fact it's better not to. The tools take base quality into account appropriately.
7 Dec 18
RT @yguo2k: Check out the MIA talks at @broadinstitute. Very nice and cutting-edge research talks bridging computation/ML and biology/genom…
30 Nov 18
@samuel_barreto8 Hah no worries, it’s good for us to know where are the pain points. We definitely need to do a bet…
30 Nov 18
@samuel_barreto8 Can you tell us what kind of issues you've encountered?
27 Nov 18

- Our favorite tweets from others

Have Cromwell running on AWS Batch, very easy to work with WDL and get things working. Cool stuff!
6 Nov 18
Amazing talk by @dgmacarthur about the expansion of gnomAD and how size and diversity increase filtering power
30 Oct 18
@geoffjentry Who doesn't love a Warp Pig? @WDL_dev and @gatk_dev are on the ball getting stickers out. Was happy to…
22 Oct 18
#ASHG18 VA: call with GATK @gatk_dev. Look for pathogenic / likely pathogenic. leverage ClinVar.
17 Oct 18
If you think your fascination with #GATK hit the roof wait until you meet @gatk_dev team! Has been a wonderful week…
21 Sep 18

See more of our favorite tweets...