As part of our job providing support to the GATK user community, our team takes turns traveling to conferences, both to learn what's going on in the field at large and to advertise the latest features of the GATK. I recently attended the Advances in Genome Biology and Technology (AGBT) general meeting in Hollywood, Florida in February. Nice time of year to go there!

When we go to conferences we often do workshops or present posters, but this time was a first: I was there to do a software demo. Well, in fact I had two demos prepared: one about using GATK4 to run commands directly on a Spark cluster, and the other about running GATK workflows on the Cloud using Google's Pipelines API.

What is beagle.

Beagle is a type of dog known for its even temper and intelligence. It is also the name given to the ship Darwin sailed to the Galapagos (the H.M.S. Beagle), where he developed his theory of natural selection from observing finches. It is also the name of a genomics software package known for phasing and imputing genotypes. Beagle also calls genotypes and detects identity-by-descent (IBD), i.e. it can find segments of identical DNA that indicate two individuals are related.

I will be writing a series of posts where I share with you how I take 23andMe raw data to locate IBD segments using Beagle v4.1 (website; doi:10.1534/genetics.113.150029). For a review of the statistical methods and other theory underlying IBD, see doi:10.1534/genetics.112.148825. To see a skipper’s dog on a ship at sea, watch Irving Johnson’s footage of the Peking barque.

A few of us GATKers (among a flood of other Broadies) traveled to Washington, DC this week for the General Meeting of the American Association for Cancer Research (AACR). Here are PDF copies of the posters we presented on Tuesday morning.

Abbreviated title Presenter Link
Somatic mutation discovery with GATK4 Geraldine Van der Auwera PDF
Allelic Copy Number Variation Discovery Aaron Chevalier PDF
Copy Number Variation Discovery in WGS and Exomes Mehrtash Babadi PDF

Incidentally, it's the end of the conference so now 10,000 people are trying to get home, and apparently half of them are going to Boston. I was hoping to catch an earlier flight on standby; the gate attendant laughed so hard. Most of the flights are overbooked to start with. So I have some time to kill until 9 PM. Well, I guess there's plenty of documentation in need of writing!

You may have heard that we've been working on a major new release of GATK that we call GATK4. As we are getting closer to the scheduled transition of GATK4 into beta status (from its current lowly alpha state), we are putting a lot work into fine-tuning the user-facing aspects of the program. We realize that many of our users struggle to make sense of the variety of tools and their numerous options and parameters, and that when something goes wrong, the error messages can seem cryptic and/or overwhelming.

So one of the things we're experimenting with is an interactive support feature that you can invoke directly from the command line, and that should help you figure out solutions to most problems that you might encounter while using GATK. It's not quite fully-featured yet but we'd like to get some feedback to evaluate whether it is helpful to real users, and determine how we can further improve it.

You can download a precompiled jar (fully open source under a BSD license) where this feature is enabled by default, from this page: The command syntax is essentially the same as for the current version of GATK, except you no longer provide -T to specify the tool, and -o is all grown up and is now -O. You can get usage information for any tool by doing e.g. java -jar GenomeAnalysisTk-4_1.jar PrintReads -h the same way as you would with the current GATK.

Please try it out and let us know what you think!

Here are some rules-of-thumb for posting questions

  1. Post a new question instead of continuing an ongoing discussion thread. The exception to this is if your question relates directly to the discussion thread, i.e. comments on the original post or answers a question asked in the thread. To refer to a particular thread, you can include its URL.

  2. Post the question once. This is the case even if you post to the wrong subforum. We can easily move your post to the appropriate one.

  3. Questions relate to running a GATK tool, Picard tool, GATK Best Practice Workflow, WDL script, Cromwell or FireCloud. All other questions, e.g. those about non-GATK tools, you should ask the Biostars or SeqAnswers forums.

Next, I point out specific guidelines for GATK questions, give a formatting tip and explain the motivation behind this note using pie.

These are the materials that we are presenting at the February 2017 GATK workshop in Leuven, Belgium.

Materials Link
DAY 1: GATK Best Practices talks
Slide decks presented on Day 1 Google Drive Folder
DAY 2: Germline variant discovery
Variant Discovery Tutorial Worksheet (Day 2 AM) PDF on Google Drive
Variant Filtering Tutorial Worksheet (Day 2 PM) PDF on Google Drive
Germline Data Bundle (Day 2) ZIP on Google Drive
DAY 3 AM: Somatic mutation discovery
MuTect2 Tutorial Worksheet (Day 3 AM) PDF on Google Drive
MuTect2 Data Bundle (Day 2) TAR.GZ on Google Drive
CNV Tutorial Worksheet (Day 3 AM) PDF on Google Drive
CNV Data Bundle (Day 3 AM) ZIP on GATK website
DAY 3 PM: Pipelining with WDL
WDL Pipelining Tutorial Worksheet (Day 3 PM) PDF on Google Drive
WDL Pipelining Data Bundle (Day 2) ZIP on Google Drive
I want to give mad props to my team. Every day they handle new questions about obscure error messages or unusual experimental designs, on top of their ongoing efforts to develop new documentation materials and keep up with the latest innovations that the dev team is working on. All of this in service of a community with a very wide range of use cases -- and levels of comfort with the biology and/or computational aspects of genomics. It's a challenging job, especially when the tools and technology keep evolving under your feet, and the science itself refuses to stop mutating. Poetic, I suppose, if inconvenient.

And that's why we are a team with a wide range of strengths, backgrounds and interests as you can see in the 2017 team roster below.

So, please say hello in the comments if you appreciate this team's efforts, or really make our day and send us a postcard! One online commenter and one postcard sender will each be chosen at random* and receive a GATK-themed prize**.

Our mailing address is:

GATK Outreach, c/o G. Van der Auwera
Broad Institute, Room 415M-7100-B
415 Main Street,
Cambridge MA 02142

* We reserve the right to implement random selection at our discretion. I expect it will be a low-tech implementation involving printing out bits of paper, folding them and having a random grad student in the cafeteria pick one out of a hat.

** Nature of prizes also to be determined as I need to find out how much of my outreach budget I can use for this. I can pretty much guarantee that the prizes will have no real monetary value, but will come with serious bragging rights.

A while back, I posted this article about work done by the Intel Bio Team to benchmark the speed and resource utilization of each step in the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF). They published their results as a white paper on the Intel Life Sciences website, which has a section dedicated to GATK (which makes us feel all warm and tingly).

Now the Intel team has published an updated version of the white paper here that extends the work, originally done on a WGS trio, to a cohort of 50 exomes and adds the joint analysis segment of the pipeline (GenotypeGVCFs to VQSR; GVCFs to filtered multisample VCF) for both datasets.

Here it is at last… as in, last release for 2016, and possibly the last point release of GATK 3 ever!

Aside from the usual pile of bug fixes, the new features in this version are actually (almost) all features or improvements that were developed for GATK 4. We backported them to the GATK 3 framework to make them widely available sooner rather than later, since we still have some work to do to make GATK 4 complete enough to become the new standard. Of course there are a lot of other new things in the GATK 4 alpha version that we can't backport (especially those related to speed/performance improvements) because they depend on the new framework. But what we could backport, we did.

The hottest change here is a new model for calculating the QUAL score, but be aware it's there on an opt-in basis, not enabled by default. This also comes with a lower default value for the -stand_call_conf threshold, and deprecation of the confusing and ultimately rather pointless threshold -stand_emit_conf. We're also introducing some logic for prioritizing alleles to improve performance in messy regions. And we've got some improvements for MuTect2, although that tool does remain in beta status for now.

As usual, see the release notes for a full list of changes, and read on below for details on what we think you'll care most about.

GATK 3.7 was released on December 12, 2016. Itemized changes are listed below. For more details, see the user-friendly version highlights.

