Reap the benefits of a GATK workshop without the time and expense of an international trip
For many in our community, July heralds the much welcome summer break - whether that means vacationing, or finally getting some work done without the constant disruptions of classes and committees. But for the GATK support team, it's the height of our workshop season! Next week, we'll be in Cambridge, UK; and after that the World Tour heads South to Brazil, Costa Rica and Spain. So even as we pack up our bags, this feels like a good time to highlight how any one of you can take advantage of the workshop materials without leaving your desk (or couch, or porch, or wherever you find yourself this season).
Our standard workshop is a four-day intensive course on, well, everything you need to know to do variant discovery with GATK. It covers the essentials of working with high-throughput sequencing data, the tools and technologies you’ll use and how to use them, and the algorithms and methods at the heart of the GATK Best Practices. The main goal is to equip participants with actionable skills and know-how, so we alternate between lectures on the theory and hands-on exercises to keep people engaged and focused on making those connections.
Starting this year, we’re basing all the hands-on sections on interactive Jupyter notebooks in our cloud platform, Terra. This liberates us from the technical friction we used to face when we had people run on their own laptops, and allows us to really focus on the data analysis instead of spending time installing apps or troubleshooting various operating systems. It also means that it's now trivially easy for us to share materials with everyone beyond the workshop cohort in a way that works right out of the box. I wrote a bit about this last week when I announced that all our new tutorials will be Jupyter notebooks.
We realize that the overwhelming majority of you can't make it to a workshop in person, and that's why I find it especially exciting that we can now provide fully-loaded workspaces on Terra. With those, you can work through all the GATK workshop tutorials (see full list of links at the bottom of this post) at home or at work -- or on the beach, if you're the type who can't let go (no judgment, I'm that type). Check out my previous blog post for a walkthrough of an abridged version of a workshop tutorial so you can decide if it's the sort of thing you'd like to try out. We cover a lot of ground in the workshop tutorials, and this is a really great resource to try at your own pace -- whether you're just getting started or you're well-versed in the classic tools and looking for a primer on the newer material. You can ask us questions on the forum if you get stuck at any point. It may not be the full workshop experience, but I'm confident it will help you move forward in your work.
We share the slide decks from all the lectures, and for some workshops we have recordings posted on YouTube. The latest batch is slated for publication really soon, so watch the blog for an announcement -- and remember you can subscribe to email notifications for both the blog and the forum (eg to get an email when we answer your questions).
So here's a list of all the resources we provide for this. Keep in mind this is a snapshot in time, as we make updates and improvements for every new workshop, so if you read this later than say, mid-August, make sure to check that you're looking at the most recent versions.
Includes slides and workshop tutorial bundles, GATK-related posters, and links to YouTube videos
And as always, the GATK forum!
Earlier this week, I made a big deal about how we plan to develop all of our GATK tutorials as Jupyter Notebooks in Terra going forward. Today I'd like to offer you a concrete look at what we like about using notebooks for GATK tutorials.
I was planning to just walk you through a couple of notebooks in one of our workshop workspaces, but then decided to make a custom workspace and notebook to show you what I mean without the complexity of the full-length tutorials. It's part highlights, featuring a couple of my favorite tutorial scenarios from the workshops that are fairly simple yet quite effective, and part sneak preview of the newest version of the tutorials, which boast cool new features and will be unveiled at the next workshop (Cambridge in July). Oh, and part explainer on what exactly are Jupyter Notebooks anyway?
Overall you can consider this mini-tutorial a stepping stone to being able to use the workshop tutorial workspaces without needing to actually attend a workshop. The workspace docs and the notebook itself both have a lot of explanations about how things work and how to use them in your pursuit of deeper understanding of GATK. So I don't feel the need to go on and on about it here (for once). But I will mention, in case you're on the fence about whether to spend 5 whole minutes checking out the workspace (add 15 to 20 minutes to actually work through the full notebook), it involves running GATK commands, streaming files, and viewing data in IGV -- all without ever leaving the warm embrace of the notebook.
Last week I wrote about how we're using a cloud platform called Terra to make it easier to get started with GATK; and specifically I highlighted the fully loaded workspaces that showcase our Best Practices pipelines, which we think will make it a lot easier to test drive our pipelines end-to-end. This week I want to talk about a complementary approach we're taking, using Jupyter Notebooks on Terra to teach the step-by-step details of what happens inside the pipelines. Though before we get into the nitty gritty of how it works, I'd like to take some time to walk you through why we're taking this particular approach.
Writing a good tutorial is not that hard, in theory. You state the problem, provide a command line, then give a few instructions for poking at the outputs and you discuss what happened. The hardest part should be choosing what details and parameters to explain vs. what to leave alone to avoid confusing newcomers. Right? Well… In practice, the hardest part is often providing the inputs and instructions in such a way that most people will be able to run it in their own, unique and precious computing environment without some amount of head scratching and at least three pages of alternative instructions for this system or that system. Ugh.
We've run dozens of workshops where the setup is that we provide a PDF of instructions and a data bundle, and participants run commands in the terminal on their laptops. Inevitably some non-trivial amount of time ends up being spent debugging environment settings, typos and character encodings. That's just not a good use of anypony's time. Plus we want to be able to demonstrate larger-scale analyses with full-size inputs, not just the usual snippets of data whittled down to be convenient to download and move around. (Genomic data is getting big, if you haven't noticed.)
So earlier this year, we converted all our workshop tutorials to Jupyter Notebooks, an increasingly popular medium for combining live, executable code and documentation content, hosted on Terra.
And no kidding, it's been transformative. So far this year we've done three "GATK bootcamp" workshops (4 days long, 50% hands-on tutorials) and in every one of them the verdict was the same: notebooks FTW. Compared to our old approach, we spend so much less time troubleshooting technical issues and so much more actually exploring and discussing what the tools are doing, what the data looks like and so on -- you know, the interesting stuff. Not unexpectedly, the Notebooks-based approach is also proving to be extremely popular with participants who have less experience with command line environments.
In my next post later this week, I'll walk you through one of the notebooks from our most recent workshop. My goal is is to show how you can take advantage of these resources to level up your understanding of how GATK tools work even if you can't make it to one of our workshops in person.
Of course if you're too impatient to wait for the guided tour, feel free to sneak a peek at the notebooks I plan to demo, which you can find in this workshop workspace in the Terra Showcase. If you read my post on the Best Practices pipelines from last week, you might have already signed up on Terra and claimed your free credits… but if you haven't, please go ahead and do that now, because you're going to want to clone the workspace and open the notebooks in interactive mode.
Go to http://app.terra.bio and you'll be asked to log in with a Google identity. If you don't have one already, you can create one, and choose to either create a new Gmail account for it or associate your new Google identity with your existing email address. See this article for step-by-step instructions on how to register if needed. Once you've logged in, look for the big green banner at the top of the screen and click "Start trial" to take advantage of the free credits program. As a reminder, access to Terra is free but Google charges you for compute and storage; the credits (a $300 value) will allow you to try out the resources I'm describing here for free. To clone a workspace, open it, expand the workspace action menu (three-dot icon, top right) and select the "Clone" option. In the cloning dialog, select the billing project we created for you with your free credits. The resulting workspace clone belongs to you. Have fun!
Last week, I wrote about a new initiative we're kicking off to make it easier to get started with GATK. Part of that involves making it easier for anyone to try out the Best Practices workflows without having to do a ton of work up front. That's a pretty big can of worms, because for a long time the Best Practices were really meant to describe at a high level the key GATK (and related) tools/steps you need to run for a particular type of analysis (e.g. germline short variant discovery). They weren't intended to provide an exact recipe of commands and parameters… Yet that's what many of you have told us you want.
For the past couple of years we've been providing actual reference implementations in the form of workflows written in the Workflow Description Language, but that still leaves you with a big old learning curve to overcome before you can actually run them. And we know that for many of you, that learning curve can feel both overwhelming and unwarranted - especially when you're in the exploratory phase of a project and you're not even sure yet that you'll end up using GATK.
To address that problem, we've set up all the GATK Best Practices workflows in public workspaces on our cloud platform, Terra. These workspaces feature workflows that are fully configured with all commands and parameters, as well as resource files and example data you need to run them right out of the box. All it takes is a click of a button! (Almost. There's like three clicks involved, for real).
Let me show you one of these workspaces, and how you would use it to try out Best Practices pipelines. It should take about 15 mins if you follow along and actually click all the things. Or you can just read through to get a sense of what's involved.
GATK has always been kind of a beast to get started with -- command-line program, many different tools under the hood, complex algorithms, multi-step pipelines, scale of computational resources involved... Plenty of challenges to go around, especially if you don't have a lot of computational experience.
We want to make it easier for anyone to try out the GATK Best Practices without investing a whole lot of time and effort up front. To that end, we're now using a cloud-based platform called Terra to share the GATK Best Practices as fully-configured pipelines that work right out of the box on example data that we provide, complemented by Jupyter Notebooks that walk you through the logic, operation and results of each step. We've already been using this approach in our popular workshop series with encouraging results, and we're planning to convert all our tutorials to Jupyter Notebooks that can be run in Terra. We don't expect all of you to adopt Terra for your work, but this feels like the best way we can empower you to get started with GATK.
The Terra platform is developed by our colleagues in the Data Sciences Platform at the Broad; it's free to access and we have funding to give every new account $300 in credits to cover computing & storage costs (which are billed by Google Cloud), so anyone can go in and try the pipelines at no cost and minimal effort. If you previously heard of FireCloud, this is essentially the same platform, but with a redesigned interface to make it more user-friendly.
We've set up the Best Practices pipelines in fully-furnished workspaces so you can poke at them, see how they work and examine the results they produce on example data. Then --where I think it gets really exciting-- you can upload your own data to test how the pipelines perform on that. When a new version comes out, you can test it quickly and decide whether the new results make it worth upgrading or whether you can wait until the next version. (The GATK engine team is developing some additional infrastructure to publish systematic benchmarks for every release but that's still a few months down the road at least.) We're also working to provide utilities for doing common ancillary tasks like converting between formats; for example, if you received FASTQs from your sequence provider and you want to use our pre-processing workflow that takes in unmapped BAMs.
We've been using Terra in our most recent workshops, and we're really encouraged by the responses we’ve gotten so far as well as the educational opportunities it offers. The user-friendly access to cloud compute capabilities means participants can run full-scale pipelines without worrying about computational infrastructure. The support for Jupyter Notebooks makes it way easier to do interactive hands-on tutorials during workshops AND distribute the workshop materials for self-service learning for anyone who can't make it to a workshop.
There's a lot to unpack on this topic, so we're going to roll out a series of blog posts explaining what you can do with the GATK resources we publish in Terra, how to get started and where to go from there. Stay tuned and make sure to follow the blog or @gatk_dev on Twitter.
Do you ever get frustrated by the current state of the GATK documentation? Do you find it hard to find information about what tools you should use and how they work? And when you do find relevant documentation, do you ever find it unclear and difficult to understand?
Yeah, so do we. The amount of new tools and algorithms has exploded in recent years, and despite our best efforts, it can be tough for us to keep our educational materials clear and up to date. So we're preparing to make a big push to address these problems.
Specifically, we have a science writer position open in our User Education team: https://lnkd.in/eHWiQF3
Seriously, if this is you, go ahead and apply today.
Think about it -- you help your friend get a great job AND you'll get better GATK documentation out of the deal. It's perfect. What are you waiting for, send them the link now!
Do you like having a life? Year after year the Broad Institute places very high on those "Best places to work" lists like this one from Working Mother magazine. You can check out the full WM ratings to understand what makes it so great.
And finally, here's a video about what it's like to work in our part of the Broad Institute, the Data Sciences Platform to finish convincing you that joining us is a great idea. If you've watched any of our workshop videos, you may recognize a few familiar faces!
GATKv126.96.36.199 introduces streamlined somatic calling with fewer errors, fewer false-negatives and optimized sensitivity and precision due to several major advances in the Mutect2 pipeline. We hope the changes will help make your work more efficient, more accurate and less expensive, benefits that will be worth the slight annoyance of the occasional command line change to the workflow. Read to the bottom for what you need to know to run and take advantage of the new pipeline.
We fixed several bugs that were responsible for error messages about invalid log probabilities, infinities, NaNs etc. We also resolved an issue where CalculateContamination worked poorly on very small gene panels.
FilterMutectCalls now filters based on a single quantity, the probability that a variant is not a somatic mutation, regardless of cause. Previously, each had its own threshold. We have removed parameters such as -normal-artifact-lod, -max-germline-posterior, -max-strand-artifact-probability, -max-contamination-probability, and even -tumor-lod. FilterMutectCalls automatically determines the probability threshold that optimizes the "F score," the harmonic mean of sensitivity and precision. Users can tweak results in favor of more or less sensitivity by modifying a single parameter, the variable beta (the relative weight of sensitivity versus precision in the harmonic mean). Setting beta to a value greater than its default filters for greater sensitivity and setting it lower filters for greater precision.
We had long suspected that modeling the spectrum of subclonal allele fractions would help distinguish somatic variants from errors. For example, if every somatic variant in a tumor occurred in 40% of cells, we would know to reject anything with an allele fraction significantly different from 20%. In the Bayesian framework of Mutect2 this means that we can model the read counts of somatic variants with binomial distributions. We account for an unknown number of subclones with a Dirichlet process binomial mixture model. Because CNVs, small subclones, and genetic drift of passenger mutations all contribute allele fractions that don’t match a few discrete values, this is still an oversimplification. Therefore, we include a couple of beta-binomials in the mixture to account for a background spread of allele fractions while still benefiting from clustering. Finally, we use these binomial and beta-binomial likelihoods to refine the tumor log odds calculated by Mutect2, which assume a uniform distribution of allele fractions.
Ever since our original foray into synthetic data creation, I've been looking for an opportunity to follow up on that project. It is absolutely obvious to me that there is a huge unmet need for researcher-friendly synthetic sequence data resources, i.e. generic synthetic datasets you can use off the shelf, plus user-friendly tooling to generate customized datasets on demand. I'm also fairly confident that it wouldn't actually be very hard, technically speaking, to start addressing that need. The catch though is that this sort of resource generation work is not part of my remit, beyond what is immediately useful for education, frontline support and outreach purposes. Despite a surprisingly common misconception, I don't run the GATK development team! (I just blog about it a lot)
So when the irrepressible Ben Busby reached out to ask if we wanted to participate in the FAIR Data Hackathon at BioIT World, I recruited a few colleagues and signed us up to do a hackathon project called Bringing the Power of Synthetic Data Generation to the Masses. The overall mission: turn the synthetic data tooling we had developed for the ASHG workshop into a proper community resource.
If you're not familiar with FAIR, it's a set of principles and protocols for making research resources more Findable, Accessible, Interoperable and Reusable. Look it up, it's important.
Part 2 of a series on the theme of synthetic data | Start with part 1: Fake your data -- for Science!
Last year my team and I collaborated with a genetics researcher, Dr. Matthieu Miossec, to reproduce an analysis from a paper he had co-authored, as a case study for a workshop. The paper reported variants associated with a particular type of congenital heart disease called Tetralogy of Fallot; a pretty classic example of a germline analysis that uses GATK for variant calling. Between the information provided in the preprint and Matthieu's generous cooperation, we had everything we needed to reimplement the analysis in a more portable, fully shareable form… Except the data, because as with many human genetics studies, the original exome data was private.
In illo tempore, with the exuberance of youth filling my sails, I thought it should be simple enough to create some synthetic data to stand in for the original. I am wiser now, and grayer.
Part 1 of a series on the theme of synthetic data
Don't get me wrong, I'm not suddenly advocating for fraudulent research. What I'm talking about is creating synthetic sequence data for testing pipelines, sharing tools and generally increasing the computational reproducibility of published studies, so that we can all more easily build on each other's work.
The majority of the effort around computational reproducibility has so far focused on better ways to share and run code, as far as I can tell. With great results -- it's been transformative to see the community adopt tooling like version control, containers and Jupyter notebooks. Yet you can give me all the containers and notebooks in the world; if I don't have appropriate data to run that code on, none of it helps me.
Most of the genomic data that gets generated for human biomedical research is subject to very strict access restrictions. These protections exist for good reason, but on the downside, they make it much harder to train researchers in key methodologies until after they have been granted access to specific datasets — if they can get access at all. There are certainly open datasets like 1000 Genomes and ENCODE that can be used beyond their original research purposes for some types of training and testing. However they don't cover the full range of what is needed in the field in terms of technical characteristics (eg exome vs WGS, amount of coverage, number of samples for scale testing etc); not by a long way.
That's where fake data comes in -- we can create synthetic datasets to use as proxies for the real data. This is not a new idea of course; people have been using synthetic data for some time, as in the ICGC-TCGA DREAM Mutation challenges, and there is already a rather impressive range of command-line software packages available for generating synthetic genomic data. It's even possible to introduce (or "spike in") variants into sequencing data, real or fake, on demand. So that's all pretty cool. But in practice these packages tend to be mostly used by savvy tool developers for small-scale testing and benchmarking purposes, and rarely (if ever? send me links!) by biomedical researchers for providing reproducible research supplements.
And frankly, it's no surprise. It's actually kinda hard.
See Events calendar for full list and dates
See Events calendar for full list and dates