Post a new question instead of continuing an ongoing discussion thread. The exception to this is if your question relates directly to the discussion thread, i.e. comments on the original post or answers a question asked in the thread. To refer to a particular thread, you can include its URL.
Post the question once. This is the case even if you post to the wrong subforum. We can easily move your post to the appropriate one.
Next, I point out specific guidelines for GATK questions, give a formatting tip and explain the motivation behind this note using pie.
These are the materials that we are presenting at the February 2017 GATK workshop in Leuven, Belgium.
|DAY 1: GATK Best Practices talks|
|Slide decks presented on Day 1||Google Drive Folder|
|DAY 2: Germline variant discovery|
|Variant Discovery Tutorial Worksheet (Day 2 AM)||PDF on Google Drive|
|Variant Filtering Tutorial Worksheet (Day 2 PM)||PDF on Google Drive|
|Germline Data Bundle (Day 2)||ZIP on Google Drive|
|DAY 3 AM: Somatic mutation discovery|
|MuTect2 Tutorial Worksheet (Day 3 AM)||PDF on Google Drive|
|MuTect2 Data Bundle (Day 2)||TAR.GZ on Google Drive|
|CNV Tutorial Worksheet (Day 3 AM)||PDF on Google Drive|
|CNV Data Bundle (Day 3 AM)||ZIP on GATK website|
|DAY 3 PM: Pipelining with WDL|
|WDL Pipelining Tutorial Worksheet (Day 3 PM)||PDF on Google Drive|
|WDL Pipelining Data Bundle (Day 2)||ZIP on Google Drive|
I want to give mad props to my team. Every day they handle new questions about obscure error messages or unusual experimental designs, on top of their ongoing efforts to develop new documentation materials and keep up with the latest innovations that the dev team is working on. All of this in service of a community with a very wide range of use cases -- and levels of comfort with the biology and/or computational aspects of genomics. It's a challenging job, especially when the tools and technology keep evolving under your feet, and the science itself refuses to stop mutating. Poetic, I suppose, if inconvenient.
And that's why we are a team with a wide range of strengths, backgrounds and interests as you can see in the 2017 team roster below.
Our mailing address is:
GATK Outreach, c/o G. Van der Auwera
Broad Institute, Room 415M-7100-B
415 Main Street,
Cambridge MA 02142
* We reserve the right to implement random selection at our discretion. I expect it will be a low-tech implementation involving printing out bits of paper, folding them and having a random grad student in the cafeteria pick one out of a hat.
** Nature of prizes also to be determined as I need to find out how much of my outreach budget I can use for this. I can pretty much guarantee that the prizes will have no real monetary value, but will come with serious bragging rights.
A while back, I posted this article about work done by the Intel Bio Team to benchmark the speed and resource utilization of each step in the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF). They published their results as a white paper on the Intel Life Sciences website, which has a section dedicated to GATK (which makes us feel all warm and tingly).
Now the Intel team has published an updated version of the white paper here that extends the work, originally done on a WGS trio, to a cohort of 50 exomes and adds the joint analysis segment of the pipeline (GenotypeGVCFs to VQSR; GVCFs to filtered multisample VCF) for both datasets.
Here it is at last… as in, last release for 2016, and possibly the last point release of GATK 3 ever!
Aside from the usual pile of bug fixes, the new features in this version are actually (almost) all features or improvements that were developed for GATK 4. We backported them to the GATK 3 framework to make them widely available sooner rather than later, since we still have some work to do to make GATK 4 complete enough to become the new standard. Of course there are a lot of other new things in the GATK 4 alpha version that we can't backport (especially those related to speed/performance improvements) because they depend on the new framework. But what we could backport, we did.
The hottest change here is a new model for calculating the QUAL score, but be aware it's there on an opt-in basis, not enabled by default. This also comes with a lower default value for the
-stand_call_conf threshold, and deprecation of the confusing and ultimately rather pointless threshold
-stand_emit_conf. We're also introducing some logic for prioritizing alleles to improve performance in messy regions. And we've got some improvements for MuTect2, although that tool does remain in beta status for now.
As usual, see the release notes for a full list of changes, and read on below for details on what we think you'll care most about.
GATK 3.7 was released on December 12, 2016. Itemized changes are listed below. For more details, see the user-friendly version highlights.
Specifically, I tested the alpha release of Google Genomics Pipelines API that uses the command-line. Down the road, we will post similarly for the UI-driven systems FireCloud and Workbench. In this particular challenge, my aim is to first genotype a trio and then a cohort of 17 whole genome BAMs that are available in the cloud. I need the resulting VCF callsets within a week.
GATK workshops bring you the latest in our methods development. The materials we prepare for workshops often serve as a base for our documentation on new or improved tools and workflows. So not only do GATK workshops cover our established Best Practices, they also give you a taste of what is to come. And let me just say a lot of changes are pouring out of the jar, especially with GATK4.
Let’s get into the logistics of workshops.
Please join the new gatk-workshop group at https://groups.google.com/a/broadinstitute.org/forum/?hl=en#!forum/gatk-workshop to receive emails about upcoming workshops. These emails are different from the group's email updates, so group membership settings should be as shown below with the group Email delivery preference set to Don’t send email updates. You may also browse the posts in the GATK Blog for mention of our workshop schedule.
For information and links for an approaching workshop, we post information on our forum. Look for the announcement box at the top of the GATK Forum homepage. Depending on the hosting institution, a workshop may be open to non-affiliates, and may or may not charge a fee to offset hosting costs.
You may have noticed we’ve been talking about this new thing called WDL--the Workflow Definition Language. We've published a tutorial using WDL to run some GATK tasks, as well as a pipeline implementation of the Best Practices for germline short variant discovery written in WDL. These fully-baked WDL scripts assume you already know what to do with them, but you may be wondering where to start. Whether you need a few pointers to get you started, or you’re completely new to this, we’ve got you covered. (And if you’re just looking for how to run pre-written WDLs, head on over to the executions section. You can still learn a lot from reading the rest of this article too though!)
WDL is designed to be easy to use--"human readable and writable" is our promise. You should think of building a pipeline with WDL like building with legos. The final product (like that full pipeline script I linked before) can look quite complex, but it is a simple matter of going step by step with your WDL building blocks.
I would recommend that you get started by reading our user guide. By reading through and clicking to the next article at the bottom of each page, the user guide will introduce you to all the pieces you can use in your lego-pipeline--from what pieces you'll need all the way through how to test & run your pipeline once you've finished it.
Once you've got a handle on what WDL can do, head over to the tutorials section. In these sequential tutorials, I walk you through how to use those building blocks to implement a small part of the GATK pipeline. Each tutorial builds on the previous one to help you learn to use WDL in new ways without repeating all of your earlier work.
You've read the user guide and you've run through the tutorials; you now have all you need to get started writing your very own WDLs. If you get stuck on something, you can always see how we do things in these real WDL scripts. If you have a more specific question, don't hesitate to post it on our WDL forum. Happy building!
Here's the scoop. We've been working with Intel engineers for some time now, and we've all been enjoying it so much, we decided to commit to the relationship big time.
As announced in this Broad press release, we are taking our collaboration with Intel to the next level. Specifically, we have joined forces to create the "Intel-Broad Center for Genomic Data Engineering", with an initial five-year mission to build out life sciences tools and infrastructure, and boldly grow the genomics community's ability to collaborate across diverse datasets and analysis platforms in ways that no one has done before.
Ahem. In practice this is going to enable us to bring you some key improvements on three fronts: hardware recommendations, genomics software tools, and cross-infrastructure collaboration.