TL,DR: In a few weeks, we're going to move the website and forum to a new online platform that will scale better with the needs of the research community. The website will still live at https://software.broadinstitute.org/gatk/ but there will be some important changes at the level of the user guide and the support forum in particular. Read on to get the lowdown on where this is coming from, where we're heading and how you can prepare for the upcoming migration.
If it's hot, humid and everyone around you has a name tag, you're probably in Houston, TX for ASHG. I know there's a lot going on and a million different presentations vying for your attention, so I'll cut to the chase: we have several members of our department (Data Sciences Platform) who will be at the Broad Genomics booth #714 in the Exhibition Hall at the following times. Don't miss this opportunity to come chat with us in person and get answers for all your burning questions about the latest exciting developments, whether it's DRAGEN-GATK or Cromwell on Azure that floats your boat, or you just want to learn more about running our fully configured GATK pipelines on Terra. We look forward to seeing you there!
|Day||Time||Team member||Focus area|
|Wednesday 16||12-2pm||Geraldine Van der Auwera||All|
|Thursday 17||10am-12:30pm||Bhanu Gandham||GATK support|
|Thursday 17||12:30-1:30pm||Rob Title||Interactive analysis on Terra|
|Thursday 17||2:30-4:30pm||Sushma Chaluvadi||Terra support|
|Friday 18||11am-12pm||Ruchi Munshi||Cromwell and WDL|
It's a beautiful early autumn day in New England, with small patches of vibrant reds and yellows in the foliage just hinting at the fiery displays to come. Perfect weather for me to de-lurk and bring you some news! (I promise it's not GATK5)
The long and short of it (but mostly the short) is that we've started collaborating with the DRAGEN team at Illumina, led by Rami Mehio, to improve GATK tools and pipelines. There's a press release if you want the official announcement, or you can read on to get the long version from the GATK team's perspective.
Hello! My name is Tiffany; I manage the frontline team that supports GATK and some other software and services developed by the Data Sciences Platform (DSP) at the Broad, like Terra, FireCloud, and WDL. Our team's mission is to provide high-quality support to everyone who uses our tools, with the ultimate goal of enabling all of you to use them effectively. In practice that means we answer your questions on the forum, we develop and maintain many of the resource workspaces in Terra, and we work closely with the DSP User Education team members who write documentation and develop materials for workshops. (We love meeting forum members in person at workshops!)
I wanted to let you know that we're rolling out a new support policy for GATK, and I'd like to explain where this is coming from.
We've been facing a mounting challenge over the past few years: the amount of GATK support requests we get is starting to exceed our capacity, mainly because the scope of the GATK and the size of its user community have both expanded substantially over time.
To put things into perspective, the GATK forum was launched in 2012 to a few hundred users.
The first blog post, dated July 2012, was the release notes for GATK 2.0, in which HaplotypeCaller was introduced as a “New tool”. Crazy, right? (Please, nobody mention ReduceReads, or Geraldine will start crying). Since then, GATK has gone from covering a single primary use case (germline short variants, involving about 10 main tools) to an assortment of pipelines that covers all germline and somatic variant classes (including copy number and structural variation) plus several side uses like mitochondrial variants. Meanwhile, we estimate that at least 60,000 people use or have used the GATK, with about 2,000 new people joining the community every quarter.
As a result, we are seeing on average seventy-two questions a week, often with spikes of even higher numbers following new GATK releases. Here is some data from five weeks this past summer; each bar is a week, with the colors showing the breakdown of questions per day.
During this whole time, we've had just one person dedicated full-time to answering forum questions. It used to be Geraldine, then Sheila, and now it's Bhanu, who has been doing a heroic job since she joined us last year. But today, the volume of incoming requests (and in many cases, the rising level of complexity of the questions) is simply too great for a single person to address, even with occasional reinforcements from others on our team.
We are deeply humbled by the fact that so many people are using GATK, and there is nothing our little frontline team cares about more than providing the best support that we possibly can. So we looked at what we could do to deliver the highest level of quality and service given the constraints we're operating under.
So far, we’ve been focused on addressing all questions as quickly as possible. That might sound great in principle, but the downside is that we're spending time on questions that are either already addressed in the documentation (though the answers are not always easy to find) or pertain to unsupported use cases that take a lot of digging and may not even be answerable without some analysis work. Meanwhile, we're not prioritizing questions (or bug reports) that have a disproportionately high impact (either in terms of severity, urgency or number of people affected). Yet we feel that the research community as a whole would be better served if we were to prioritize those problems. That's why we decided to develop a new support policy that sets some clear priorities and spells out more explicitly what we are able to support versus what we will leave to the community to discuss and hopefully resolve.
In my next blog post, I'll explain a bit more how this will work in practice, to set clear expectations so no one feels left in the lurch. In the meantime, let me know if you have any concerns; I'm open to tweaking the policy as needed so I'd love to get your feedback as we roll this out.
Thanks for reading and being a part of this community!
The videos from the GATK workshop we ran at the Broad Institute in March are now available on YouTube! This complements the slide decks and tutorials, which were already online.
This was an updated version of our classic GATK workshop, a four day bootcamp-style course that empowers participants to understand and run the full range of GATK Best Practice workflows, covering all the basic genomics involved, data processing and variant calling for nearly all major classes of variants (structural variants being the exception -- that work is still in progress). The workshop involves a balanced mix of talks and hands-on tutorials to communicate both the theoretical knowledge and the practical skills necessary to utilize GATK tools and pipelines.
In this updated form, the workshop materials stand largely on their own, with tutorials that can be run on our cloud platform, Terra. So anyone should be able to use them from the comfort of their own home (or lab) as I've written about previously here (overall concept) and in some more practical detail here (tutorial notebook highlights). I encourage you to check these out! We've been getting some good feedback about how useful the tutorials are, whether you're just getting started or have been using GATK for sometime but want to get more deeply familiar with how the tools work.
The videos were the missing piece to fully communicate the theory of what's going on under the hood, since it had been a long time since we'd had the opportunity to record workshop talks. So I'm really thrilled that these are finally online. You can jump directly to the playlist on YouTube or you can check out the more complete workshop information page, which also collates a bunch of other useful resources including slide decks and tutorials.
Reap the benefits of a GATK workshop without the time and expense of an international trip
For many in our community, July heralds the much welcome summer break - whether that means vacationing, or finally getting some work done without the constant disruptions of classes and committees. But for the GATK support team, it's the height of our workshop season! Next week, we'll be in Cambridge, UK; and after that the World Tour heads South to Brazil, Costa Rica and Spain. So even as we pack up our bags, this feels like a good time to highlight how any one of you can take advantage of the workshop materials without leaving your desk (or couch, or porch, or wherever you find yourself this season).
Our standard workshop is a four-day intensive course on, well, everything you need to know to do variant discovery with GATK. It covers the essentials of working with high-throughput sequencing data, the tools and technologies you’ll use and how to use them, and the algorithms and methods at the heart of the GATK Best Practices. The main goal is to equip participants with actionable skills and know-how, so we alternate between lectures on the theory and hands-on exercises to keep people engaged and focused on making those connections.
Starting this year, we’re basing all the hands-on sections on interactive Jupyter notebooks in our cloud platform, Terra. This liberates us from the technical friction we used to face when we had people run on their own laptops, and allows us to really focus on the data analysis instead of spending time installing apps or troubleshooting various operating systems. It also means that it's now trivially easy for us to share materials with everyone beyond the workshop cohort in a way that works right out of the box. I wrote a bit about this last week when I announced that all our new tutorials will be Jupyter notebooks.
We realize that the overwhelming majority of you can't make it to a workshop in person, and that's why I find it especially exciting that we can now provide fully-loaded workspaces on Terra. With those, you can work through all the GATK workshop tutorials (see full list of links at the bottom of this post) at home or at work -- or on the beach, if you're the type who can't let go (no judgment, I'm that type). Check out my previous blog post for a walkthrough of an abridged version of a workshop tutorial so you can decide if it's the sort of thing you'd like to try out. We cover a lot of ground in the workshop tutorials, and this is a really great resource to try at your own pace -- whether you're just getting started or you're well-versed in the classic tools and looking for a primer on the newer material. You can ask us questions on the forum if you get stuck at any point. It may not be the full workshop experience, but I'm confident it will help you move forward in your work.
We share the slide decks from all the lectures, and for some workshops we have recordings posted on YouTube. The latest batch is slated for publication really soon, so watch the blog for an announcement -- and remember you can subscribe to email notifications for both the blog and the forum (eg to get an email when we answer your questions).
So here's a list of all the resources we provide for this. Keep in mind this is a snapshot in time, as we make updates and improvements for every new workshop, so if you read this later than say, mid-August, make sure to check that you're looking at the most recent versions.
Includes slides and workshop tutorial bundles, GATK-related posters, and links to YouTube videos
And as always, the GATK forum!
Earlier this week, I made a big deal about how we plan to develop all of our GATK tutorials as Jupyter Notebooks in Terra going forward. Today I'd like to offer you a concrete look at what we like about using notebooks for GATK tutorials.
I was planning to just walk you through a couple of notebooks in one of our workshop workspaces, but then decided to make a custom workspace and notebook to show you what I mean without the complexity of the full-length tutorials. It's part highlights, featuring a couple of my favorite tutorial scenarios from the workshops that are fairly simple yet quite effective, and part sneak preview of the newest version of the tutorials, which boast cool new features and will be unveiled at the next workshop (Cambridge in July). Oh, and part explainer on what exactly are Jupyter Notebooks anyway?
Overall you can consider this mini-tutorial a stepping stone to being able to use the workshop tutorial workspaces without needing to actually attend a workshop. The workspace docs and the notebook itself both have a lot of explanations about how things work and how to use them in your pursuit of deeper understanding of GATK. So I don't feel the need to go on and on about it here (for once). But I will mention, in case you're on the fence about whether to spend 5 whole minutes checking out the workspace (add 15 to 20 minutes to actually work through the full notebook), it involves running GATK commands, streaming files, and viewing data in IGV -- all without ever leaving the warm embrace of the notebook.
Last week I wrote about how we're using a cloud platform called Terra to make it easier to get started with GATK; and specifically I highlighted the fully loaded workspaces that showcase our Best Practices pipelines, which we think will make it a lot easier to test drive our pipelines end-to-end. This week I want to talk about a complementary approach we're taking, using Jupyter Notebooks on Terra to teach the step-by-step details of what happens inside the pipelines. Though before we get into the nitty gritty of how it works, I'd like to take some time to walk you through why we're taking this particular approach.
Writing a good tutorial is not that hard, in theory. You state the problem, provide a command line, then give a few instructions for poking at the outputs and you discuss what happened. The hardest part should be choosing what details and parameters to explain vs. what to leave alone to avoid confusing newcomers. Right? Well… In practice, the hardest part is often providing the inputs and instructions in such a way that most people will be able to run it in their own, unique and precious computing environment without some amount of head scratching and at least three pages of alternative instructions for this system or that system. Ugh.
We've run dozens of workshops where the setup is that we provide a PDF of instructions and a data bundle, and participants run commands in the terminal on their laptops. Inevitably some non-trivial amount of time ends up being spent debugging environment settings, typos and character encodings. That's just not a good use of anypony's time. Plus we want to be able to demonstrate larger-scale analyses with full-size inputs, not just the usual snippets of data whittled down to be convenient to download and move around. (Genomic data is getting big, if you haven't noticed.)
So earlier this year, we converted all our workshop tutorials to Jupyter Notebooks, an increasingly popular medium for combining live, executable code and documentation content, hosted on Terra.
And no kidding, it's been transformative. So far this year we've done three "GATK bootcamp" workshops (4 days long, 50% hands-on tutorials) and in every one of them the verdict was the same: notebooks FTW. Compared to our old approach, we spend so much less time troubleshooting technical issues and so much more actually exploring and discussing what the tools are doing, what the data looks like and so on -- you know, the interesting stuff. Not unexpectedly, the Notebooks-based approach is also proving to be extremely popular with participants who have less experience with command line environments.
In my next post later this week, I'll walk you through one of the notebooks from our most recent workshop. My goal is is to show how you can take advantage of these resources to level up your understanding of how GATK tools work even if you can't make it to one of our workshops in person.
Of course if you're too impatient to wait for the guided tour, feel free to sneak a peek at the notebooks I plan to demo, which you can find in this workshop workspace in the Terra Showcase. If you read my post on the Best Practices pipelines from last week, you might have already signed up on Terra and claimed your free credits… but if you haven't, please go ahead and do that now, because you're going to want to clone the workspace and open the notebooks in interactive mode.
Go to http://app.terra.bio and you'll be asked to log in with a Google identity. If you don't have one already, you can create one, and choose to either create a new Gmail account for it or associate your new Google identity with your existing email address. See this article for step-by-step instructions on how to register if needed. Once you've logged in, look for the big green banner at the top of the screen and click "Start trial" to take advantage of the free credits program. As a reminder, access to Terra is free but Google charges you for compute and storage; the credits (a $300 value) will allow you to try out the resources I'm describing here for free. To clone a workspace, open it, expand the workspace action menu (three-dot icon, top right) and select the "Clone" option. In the cloning dialog, select the billing project we created for you with your free credits. The resulting workspace clone belongs to you. Have fun!
Last week, I wrote about a new initiative we're kicking off to make it easier to get started with GATK. Part of that involves making it easier for anyone to try out the Best Practices workflows without having to do a ton of work up front. That's a pretty big can of worms, because for a long time the Best Practices were really meant to describe at a high level the key GATK (and related) tools/steps you need to run for a particular type of analysis (e.g. germline short variant discovery). They weren't intended to provide an exact recipe of commands and parameters… Yet that's what many of you have told us you want.
For the past couple of years we've been providing actual reference implementations in the form of workflows written in the Workflow Description Language, but that still leaves you with a big old learning curve to overcome before you can actually run them. And we know that for many of you, that learning curve can feel both overwhelming and unwarranted - especially when you're in the exploratory phase of a project and you're not even sure yet that you'll end up using GATK.
To address that problem, we've set up all the GATK Best Practices workflows in public workspaces on our cloud platform, Terra. These workspaces feature workflows that are fully configured with all commands and parameters, as well as resource files and example data you need to run them right out of the box. All it takes is a click of a button! (Almost. There's like three clicks involved, for real).
Let me show you one of these workspaces, and how you would use it to try out Best Practices pipelines. It should take about 15 mins if you follow along and actually click all the things. Or you can just read through to get a sense of what's involved.
GATK has always been kind of a beast to get started with -- command-line program, many different tools under the hood, complex algorithms, multi-step pipelines, scale of computational resources involved... Plenty of challenges to go around, especially if you don't have a lot of computational experience.
We want to make it easier for anyone to try out the GATK Best Practices without investing a whole lot of time and effort up front. To that end, we're now using a cloud-based platform called Terra to share the GATK Best Practices as fully-configured pipelines that work right out of the box on example data that we provide, complemented by Jupyter Notebooks that walk you through the logic, operation and results of each step. We've already been using this approach in our popular workshop series with encouraging results, and we're planning to convert all our tutorials to Jupyter Notebooks that can be run in Terra. We don't expect all of you to adopt Terra for your work, but this feels like the best way we can empower you to get started with GATK.
The Terra platform is developed by our colleagues in the Data Sciences Platform at the Broad; it's free to access and we have funding to give every new account $300 in credits to cover computing & storage costs (which are billed by Google Cloud), so anyone can go in and try the pipelines at no cost and minimal effort. If you previously heard of FireCloud, this is essentially the same platform, but with a redesigned interface to make it more user-friendly.
We've set up the Best Practices pipelines in fully-furnished workspaces so you can poke at them, see how they work and examine the results they produce on example data. Then --where I think it gets really exciting-- you can upload your own data to test how the pipelines perform on that. When a new version comes out, you can test it quickly and decide whether the new results make it worth upgrading or whether you can wait until the next version. (The GATK engine team is developing some additional infrastructure to publish systematic benchmarks for every release but that's still a few months down the road at least.) We're also working to provide utilities for doing common ancillary tasks like converting between formats; for example, if you received FASTQs from your sequence provider and you want to use our pre-processing workflow that takes in unmapped BAMs.
We've been using Terra in our most recent workshops, and we're really encouraged by the responses we’ve gotten so far as well as the educational opportunities it offers. The user-friendly access to cloud compute capabilities means participants can run full-scale pipelines without worrying about computational infrastructure. The support for Jupyter Notebooks makes it way easier to do interactive hands-on tutorials during workshops AND distribute the workshop materials for self-service learning for anyone who can't make it to a workshop.
There's a lot to unpack on this topic, so we're going to roll out a series of blog posts explaining what you can do with the GATK resources we publish in Terra, how to get started and where to go from there. Stay tuned and make sure to follow the blog or @gatk_dev on Twitter.
See Events calendar for full list and dates
See Events calendar for full list and dates