A while back, I posted this article about work done by the Intel Bio Team to benchmark the speed and resource utilization of each step in the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF). They published their results as a white paper on the Intel Life Sciences website, which has a section dedicated to GATK (which makes us feel all warm and tingly).
Now the Intel team has published an updated version of the white paper here that extends the work, originally done on a WGS trio, to a cohort of 50 exomes and adds the joint analysis segment of the pipeline (GenotypeGVCFs to VQSR; GVCFs to filtered multisample VCF) for both datasets.
Here it is at last… as in, last release for 2016, and possibly the last point release of GATK 3 ever!
Aside from the usual pile of bug fixes, the new features in this version are actually (almost) all features or improvements that were developed for GATK 4. We backported them to the GATK 3 framework to make them widely available sooner rather than later, since we still have some work to do to make GATK 4 complete enough to become the new standard. Of course there are a lot of other new things in the GATK 4 alpha version that we can't backport (especially those related to speed/performance improvements) because they depend on the new framework. But what we could backport, we did.
The hottest change here is a new model for calculating the QUAL score, but be aware it's there on an opt-in basis, not enabled by default. This also comes with a lower default value for the
-stand_call_conf threshold, and deprecation of the confusing and ultimately rather pointless threshold
-stand_emit_conf. We're also introducing some logic for prioritizing alleles to improve performance in messy regions. And we've got some improvements for MuTect2, although that tool does remain in beta status for now.
As usual, see the release notes for a full list of changes, and read on below for details on what we think you'll care most about.
GATK 3.7 was released on December 12, 2016. Itemized changes are listed below. For more details, see the user-friendly version highlights.
Specifically, I tested the alpha release of Google Genomics Pipelines API that uses the command-line. Down the road, we will post similarly for the UI-driven systems FireCloud and Workbench. In this particular challenge, my aim is to first genotype a trio and then a cohort of 17 whole genome BAMs that are available in the cloud. I need the resulting VCF callsets within a week.
GATK workshops bring you the latest in our methods development. The materials we prepare for workshops often serve as a base for our documentation on new or improved tools and workflows. So not only do GATK workshops cover our established Best Practices, they also give you a taste of what is to come. And let me just say a lot of changes are pouring out of the jar, especially with GATK4.
Let’s get into the logistics of workshops.
Please join the new gatk-workshop group at https://groups.google.com/a/broadinstitute.org/forum/?hl=en#!forum/gatk-workshop to receive emails about upcoming workshops. These emails are different from the group's email updates, so group membership settings should be as shown below with the group Email delivery preference set to Don’t send email updates. You may also browse the posts in the GATK Blog for mention of our workshop schedule.
For information and links for an approaching workshop, we post information on our forum. Look for the announcement box at the top of the GATK Forum homepage. Depending on the hosting institution, a workshop may be open to non-affiliates, and may or may not charge a fee to offset hosting costs.
You may have noticed we’ve been talking about this new thing called WDL--the Workflow Definition Language. We've published a tutorial using WDL to run some GATK tasks, as well as a pipeline implementation of the Best Practices for germline short variant discovery written in WDL. These fully-baked WDL scripts assume you already know what to do with them, but you may be wondering where to start. Whether you need a few pointers to get you started, or you’re completely new to this, we’ve got you covered. (And if you’re just looking for how to run pre-written WDLs, head on over to the executions section. You can still learn a lot from reading the rest of this article too though!)
WDL is designed to be easy to use--"human readable and writable" is our promise. You should think of building a pipeline with WDL like building with legos. The final product (like that full pipeline script I linked before) can look quite complex, but it is a simple matter of going step by step with your WDL building blocks.
I would recommend that you get started by reading our user guide. By reading through and clicking to the next article at the bottom of each page, the user guide will introduce you to all the pieces you can use in your lego-pipeline--from what pieces you'll need all the way through how to test & run your pipeline once you've finished it.
Once you've got a handle on what WDL can do, head over to the tutorials section. In these sequential tutorials, I walk you through how to use those building blocks to implement a small part of the GATK pipeline. Each tutorial builds on the previous one to help you learn to use WDL in new ways without repeating all of your earlier work.
You've read the user guide and you've run through the tutorials; you now have all you need to get started writing your very own WDLs. If you get stuck on something, you can always see how we do things in these real WDL scripts. If you have a more specific question, don't hesitate to post it on our WDL forum. Happy building!
Here's the scoop. We've been working with Intel engineers for some time now, and we've all been enjoying it so much, we decided to commit to the relationship big time.
As announced in this Broad press release, we are taking our collaboration with Intel to the next level. Specifically, we have joined forces to create the "Intel-Broad Center for Genomic Data Engineering", with an initial five-year mission to build out life sciences tools and infrastructure, and boldly grow the genomics community's ability to collaborate across diverse datasets and analysis platforms in ways that no one has done before.
Ahem. In practice this is going to enable us to bring you some key improvements on three fronts: hardware recommendations, genomics software tools, and cross-infrastructure collaboration.
These are the materials that were presented at the November 2015 GATK workshop at the Broad Institute in Cambridge, MA.
|Slide decks presented on Day 1||Google Drive Folder|
|Workshop handout document (agenda and resources)||PDF on Google Drive|
|Variant Discovery Tutorial (Day 2 AM)||PDF on Google Drive|
|Variant Filtering Tutorial (Day 2 PM)||PDF on Google Drive|
|Tutorial data bundle (Day 2 PM)||ZIP on Google Drive|
The weather in Vancouver is awful right now, and that's probably a good thing -- it should keep the outdoorsy types like myself from succumbing to the natural beauty of British Columbia and skipping out on any of the great science lined up for us this week. And rumor is the WIFI is pretty decent!
I sure hope it is, because this afternoon in the GATK workshop we're going to be running some live demos of how to run GATK analyses on the Cloud. We have screencap videos as backup in case technology abandons us, but it's just not the same to play a recording... (for one thing, the recording is probably more reliable than my brain, but shush).
We'll also have a hands-on tutorial on somatic exome CNV analysis with GATK4, and the overall workshop will be peppered with live polls, in an effort to make the experience as interactive and engaging as possible. This is something the ASHG workshop organizers have been pushing for over the past few meetings, and rightly so.
It's a tall order with a crowd of 225 registered users (we get a ballroom!) but we've got a solid 90 minutes lined up to talk about all brand new GATK content. This is going to be fun!
Tomorrow, a bunch of us are packing our bags and heading to Vancouver for the American Society of Human Genetics' Annual Meeting.
We have a busy week ahead of us, between the GA4GH Plenary Meeting, the various workshops that are organized around the ASHG meeting, and the meeting itself, which draws thousands of researchers from across the globe. Our Broad Genomics team this year is going to be pretty active in a variety of events, which you can find detailed here on the website of the Broad Genomics Services.
Soo Hee and I from our little support team will be rather busy as well. We're finalizing preparations for the workshop we're teaching on Tuesday, which will focus on what's hot in GATK4. As a reminder, GATK4 is currently still in "alpha preview" phase, but we expect it to move to beta status over the course of the next quarter, and I personally hold high hopes of releasing it as the officially supported version in early 2017!
In any case, we have some cool live demos and a full CNV pipeline hands-on tutorial to show off at the workshop to a maxed-out audience of 225 people (no pressure...). Speaking of which, the materials for the workshop are now available for download over here. The bundle file contains both a special GATK4 jar and a test dataset. If you'll be joining us in the workshop, please make sure you have downloaded the bundle BEFORE the workshop, as its size is large (~400Mb) and you can't count on the conference center wifi to be good enough to download onsite.
If you're coming to ASHG but are not coming to the workshop (did you wait too long to register? ;) ), you can still come chat with us at the Broad Genomics booth in the exhibition hall. I'll post a detailed schedule of when we'll be hanging out there -- there are some sessions I don't want to miss, but I have yet to compile the final list -- and you can for sure find me at the Meet the Expert event that will take place at the booth. I'll be the so-called expert in the Thursday, October 20th 10:00am - 11:00am slot. You can also follow @gatk_dev on Twitter for the latest schedule developments and/or social event opportunities.
And if you're not coming to Vancouver, either because you blame Canada or you study a different organism and you don't see what all the fuss is about these humans we keep going on about -- well, we'll still see you on the forum, and you can always invite us to teach a workshop at your local institution. We've had a really fantastic series this year and are now taking invitations for 2017. More on that later!