Latest posts

Here are some rules-of-thumb for posting questions

  1. Post a new question instead of continuing an ongoing discussion thread. The exception to this is if your question relates directly to the discussion thread, i.e. comments on the original post or answers a question asked in the thread. To refer to a particular thread, you can include its URL.

  2. Post the question once. This is the case even if you post to the wrong subforum. We can easily move your post to the appropriate one.

  3. Questions relate to running a GATK tool, Picard tool, GATK Best Practice Workflow, WDL script, Cromwell or FireCloud. All other questions, e.g. those about non-GATK tools, you should ask the Biostars or SeqAnswers forums.

Next, I point out specific guidelines for GATK questions, give a formatting tip and explain the motivation behind this note using pie.

Read the whole post
See comments (0)

These are the materials that we are presenting at the February 2017 GATK workshop in Leuven, Belgium.

Materials Link
DAY 1: GATK Best Practices talks
Slide decks presented on Day 1 Google Drive Folder
DAY 2: Germline variant discovery
Variant Discovery Tutorial Worksheet (Day 2 AM) PDF on Google Drive
Variant Filtering Tutorial Worksheet (Day 2 PM) PDF on Google Drive
Germline Data Bundle (Day 2) ZIP on Google Drive
DAY 3 AM: Somatic mutation discovery
MuTect2 Tutorial Worksheet (Day 3 AM) PDF on Google Drive
MuTect2 Data Bundle (Day 2) TAR.GZ on Google Drive
CNV Tutorial Worksheet (Day 3 AM) PDF on Google Drive
CNV Data Bundle (Day 3 AM) ZIP on GATK website
DAY 3 PM: Pipelining with WDL
WDL Pipelining Tutorial Worksheet (Day 3 PM) PDF on Google Drive
WDL Pipelining Data Bundle (Day 2) ZIP on Google Drive
See comments (4)

I want to give mad props to my team. Every day they handle new questions about obscure error messages or unusual experimental designs, on top of their ongoing efforts to develop new documentation materials and keep up with the latest innovations that the dev team is working on. All of this in service of a community with a very wide range of use cases -- and levels of comfort with the biology and/or computational aspects of genomics. It's a challenging job, especially when the tools and technology keep evolving under your feet, and the science itself refuses to stop mutating. Poetic, I suppose, if inconvenient.

And that's why we are a team with a wide range of strengths, backgrounds and interests as you can see in the 2017 team roster below.

So, please say hello in the comments if you appreciate this team's efforts, or really make our day and send us a postcard! One online commenter and one postcard sender will each be chosen at random* and receive a GATK-themed prize**.

Our mailing address is:

GATK Outreach, c/o G. Van der Auwera
Broad Institute, Room 415M-7100-B
415 Main Street,
Cambridge MA 02142

* We reserve the right to implement random selection at our discretion. I expect it will be a low-tech implementation involving printing out bits of paper, folding them and having a random grad student in the cafeteria pick one out of a hat.

** Nature of prizes also to be determined as I need to find out how much of my outreach budget I can use for this. I can pretty much guarantee that the prizes will have no real monetary value, but will come with serious bragging rights.

Read the whole post
See comments (14)

A while back, I posted this article about work done by the Intel Bio Team to benchmark the speed and resource utilization of each step in the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF). They published their results as a white paper on the Intel Life Sciences website, which has a section dedicated to GATK (which makes us feel all warm and tingly).

Now the Intel team has published an updated version of the white paper here that extends the work, originally done on a WGS trio, to a cohort of 50 exomes and adds the joint analysis segment of the pipeline (GenotypeGVCFs to VQSR; GVCFs to filtered multisample VCF) for both datasets.

Read the whole post
See comments (2)

Here it is at last… as in, last release for 2016, and possibly the last point release of GATK 3 ever!

Aside from the usual pile of bug fixes, the new features in this version are actually (almost) all features or improvements that were developed for GATK 4. We backported them to the GATK 3 framework to make them widely available sooner rather than later, since we still have some work to do to make GATK 4 complete enough to become the new standard. Of course there are a lot of other new things in the GATK 4 alpha version that we can't backport (especially those related to speed/performance improvements) because they depend on the new framework. But what we could backport, we did.

The hottest change here is a new model for calculating the QUAL score, but be aware it's there on an opt-in basis, not enabled by default. This also comes with a lower default value for the -stand_call_conf threshold, and deprecation of the confusing and ultimately rather pointless threshold -stand_emit_conf. We're also introducing some logic for prioritizing alleles to improve performance in messy regions. And we've got some improvements for MuTect2, although that tool does remain in beta status for now.

As usual, see the release notes for a full list of changes, and read on below for details on what we think you'll care most about.

Read the whole post
See comments (10)

GATK 3.7 was released on December 12, 2016. Itemized changes are listed below. For more details, see the user-friendly version highlights.

Read the whole post
See comments (0)

Find out and learn some practical steps to cloud debugging.

Specifically, I tested the alpha release of Google Genomics Pipelines API that uses the command-line. Down the road, we will post similarly for the UI-driven systems FireCloud and Workbench. In this particular challenge, my aim is to first genotype a trio and then a cohort of 17 whole genome BAMs that are available in the cloud. I need the resulting VCF callsets within a week.

Read the whole post
See comments (0)

GATK workshops bring you the latest in our methods development. The materials we prepare for workshops often serve as a base for our documentation on new or improved tools and workflows. So not only do GATK workshops cover our established Best Practices, they also give you a taste of what is to come. And let me just say a lot of changes are pouring out of the jar, especially with GATK4.

Let’s get into the logistics of workshops.

Interested in attending a GATK workshop?

Please join the new gatk-workshop group at!forum/gatk-workshop to receive emails about upcoming workshops. These emails are different from the group's email updates, so group membership settings should be as shown below with the group Email delivery preference set to Don’t send email updates. You may also browse the posts in the GATK Blog for mention of our workshop schedule.

For information and links for an approaching workshop, we post information on our forum. Look for the announcement box at the top of the GATK Forum homepage. Depending on the hosting institution, a workshop may be open to non-affiliates, and may or may not charge a fee to offset hosting costs.

Read the whole post
See comments (1)

You may have noticed we’ve been talking about this new thing called WDL--the Workflow Definition Language. We've published a tutorial using WDL to run some GATK tasks, as well as a pipeline implementation of the Best Practices for germline short variant discovery written in WDL. These fully-baked WDL scripts assume you already know what to do with them, but you may be wondering where to start. Whether you need a few pointers to get you started, or you’re completely new to this, we’ve got you covered. (And if you’re just looking for how to run pre-written WDLs, head on over to the executions section. You can still learn a lot from reading the rest of this article too though!)

WDL is designed to be easy to use--"human readable and writable" is our promise. You should think of building a pipeline with WDL like building with legos. The final product (like that full pipeline script I linked before) can look quite complex, but it is a simple matter of going step by step with your WDL building blocks.

I would recommend that you get started by reading our user guide. By reading through and clicking to the next article at the bottom of each page, the user guide will introduce you to all the pieces you can use in your lego-pipeline--from what pieces you'll need all the way through how to test & run your pipeline once you've finished it.

Once you've got a handle on what WDL can do, head over to the tutorials section. In these sequential tutorials, I walk you through how to use those building blocks to implement a small part of the GATK pipeline. Each tutorial builds on the previous one to help you learn to use WDL in new ways without repeating all of your earlier work.

You've read the user guide and you've run through the tutorials; you now have all you need to get started writing your very own WDLs. If you get stuck on something, you can always see how we do things in these real WDL scripts. If you have a more specific question, don't hesitate to post it on our WDL forum. Happy building!

See comments (0)

Here's the scoop. We've been working with Intel engineers for some time now, and we've all been enjoying it so much, we decided to commit to the relationship big time.

As announced in this Broad press release, we are taking our collaboration with Intel to the next level. Specifically, we have joined forces to create the "Intel-Broad Center for Genomic Data Engineering", with an initial five-year mission to build out life sciences tools and infrastructure, and boldly grow the genomics community's ability to collaborate across diverse datasets and analysis platforms in ways that no one has done before.

Ahem. In practice this is going to enable us to bring you some key improvements on three fronts: hardware recommendations, genomics software tools, and cross-infrastructure collaboration.

Read the whole post
See comments (1)

Latest posts

At a glance

Follow us on Twitter

GATK Dev Team


@atogey Yes, that's what we've been doing. Still do most development locally but production/routine work go to cloud.
20 Mar 17
I added a video to a @YouTube playlist MPG Primer: Sequencing and Variant Discovery Pipelines (2016)
20 Mar 17
I added a video to a @YouTube playlist Broad Institute — GATK in the Cloud: Running genomics pipelines at
20 Mar 17
RT @phosphorus: In case you missed our Meetup!Broad Institute: GATK in the Cloud: Running genomics pipelines at any scale:
20 Mar 17
Updated #GATK Presentations page now links directly to archive of slide decks + YouTube playlists #foreveruptodate
20 Mar 17

Our favorite tweets from others

From the @gatk_dev page describing .vcf files: "Don't write home-brewed VCF parsing scripts. It never ends well”
28 Feb 17
Our 3-day course on GATK finished - 38 participants very happy! Big thanks to @gatk_dev team for excellent lessons.
24 Feb 17
@froggleston @dgmacarthur Sounds like ExAC is reaching Uber stage. ‘Uber but for pizza’. ‘ExAC but for wheat’.
14 Jan 17
#ESRenpeinture grad school - postdoc - after postdoc
6 Jan 17
Really happy to have you again this year! @VIBLifeSciences
3 Jan 17
See more of our favorite tweets...
Search blog by tag

appistry ashg ashg16 benchmarks best-practices bug bug-fixed cancer cloud cluster cnv collaboration commandline community compute conferences cram cromwell depthofcoverage diagnosetargets error forum gatk3 gatk4 genotype-refinement genotypegvcfs google grch38 gvcf haploid haplotypecaller hg38 holiday hts htsjdk ibm intel java8 job job-offer jobs license meetings mutect mutect2 ngs outreach pairhmm parallelism patch performance picard pipeline plans ploidy polyploid poster presentations printreads profile promote release release-notes rnaseq runtime saas script sequencing service slides snow speed status support syntax talks team terminology topstory troll tutorial unifiedgenotyper vcf-gz version-highlights versions vqsr wdl webinar workflow workshop xhmm