By Eric Banks, Director, Data Sciences Platform at the Broad Institute
Last week I wrote about our efforts to develop a data processing pipeline specification that would eliminate batch effects, in collaboration with other major sequencing centers. Today I want to share our implementation of the resulting "Functional Equivalence" pipeline spec, and highlight the cost-centric optimizations we've made that make it incredibly cheap to run on Google Cloud.
For a little background, we started transitioning our analysis pipelines to Google Cloud Platform in 2016. Throughout that process we focused most of our engineering efforts on bringing down compute cost, which is the most important factor for our production operation. It's been a long road, but all that hard work really paid off: we managed to get the cost of our main Best Practices analysis pipeline down from about $45 to $5 per genome! As you can imagine that kind of cost reduction has a huge impact on our ability to do more great science per research dollar -- and now, we’re making this same pipeline available to everyone.
By Eric Banks, Director, Data Sciences Platform and original member of the GATK development team
Ever since the GATK started getting noticed by the research community (mainly as a result of our contribution to the 1000 Genomes Project), people have asked us to share the pipelines we use to process data for variant discovery. Historically we have shied away from providing our actual scripts, not because we didn't want to share, but because the scripts themselves were very specific to the infrastructure we were using at the Broad. Fortunately we've been able to move beyond that thanks to the development of WDL and Cromwell, which allow potentially limitless portability of our pipeline scripts.
But it was also because there is a fair amount of wiggle room in terms of how to implement a pipeline to achieve correct results, depending on whether you care more about speed, cost or other factors. So instead we formulated "Best Practices", which I'll talk more about in a minute, to provide a blueprint of what are the key steps in the pipeline.
Today though we're taking that idea a step further: in collaboration with several other major genomics institutions, we defined a "Functional Equivalence" specification that is intended to standardize pipeline implementations, with the ultimate goal of eliminating batch effects and thereby promoting data interoperability. That means if you use a pipeline that follows this specification, you can rest assured that you will be able to analyze your results against all compatible datasets, including huge resources like gnomAD and TOPMed.
We have a new tutorial, Tutorial#11136, that outlines how to call somatic short variants, i.e. SNVs and indels, with GATK4 Mutect2. The tutorial provides small example data to follow along with.
Full-length Mutect2-compatible human germline resources are available on our [FTP server]( https://software.broadinstitute.org/gatk/download/bundle) and at gs://gatk-best-practices/. The resources are simplified from the gnomAD resource and retain population allele frequencies. Mutect2 and GetPileupSummaries are the two tools in the workflow that each require a germline resource.
If you want to run the Somatic Short Variant Discovery Best Practices workflow using WDL, be sure to check out the official Mutect2 WDL script in the gatk-workflows repository. @bshifaw and other engineers optimize the scripts in the repository to run efficiently in the cloud. Furthermore, the scripts come with example JSON format inputs files filled out with publically-accessible cloud data.
For other Mutect2-related scripts, e.g. towards panel of normals generation, check out the gatk repository's scripts/mutect2_wdl directory. Our developers update these scripts on a continual basis.
If you are new to somatic calling, be sure to read Article#11127. It gives an overview of what traditional somatic calling entails. For one, somatic calling is NOT just a difference between two callsets in that germline variant sites are excluded from consideration.
For those switching from GATK3 MuTect2, Blog#10911 will bring you up to speed on the differences.
If you are interested in simply calling differences between two samples, Blog#11315 outlines an off-label two-pass Mutect2 workflow. Off-label means the workflow is not a part of the Best Practices and is therefore unsupported. However, if given enough community interest, we may be convinced to further flesh out the workflow. Please do post to the forum to express interest.
Given my years as a biochemist, if given two samples to compare, my first impulse is to want to know what are the functional differences, i.e. differences in proteins expressed between the two samples. I am interested in genomic alterations that ripple down the central dogma to transform a cell.
Please note the workflow that follows is NOT a part of the Best Practices. This is an illustrative, unsupported workflow. For the official Somatic Short Variant Calling Best Practices workflow, see Tutorial#11136.
To call every allele that is different between two samples, I have devised a two-pass workflow that takes advantage of Mutect2 features. This workflow uses Mutect2 in tumor-only mode and appropriates the
--germline-resource argument to supply a single-sample VCF with allele fractions instead of population allele frequencies. The workflow assumes the two case samples being compared originate from the same parental line and the ploidy and mutation rates make it unlikely that any site accumulates more than one allele change.
Over the past two weeks and a bit, the GATK 4.0(.0.0) package has been downloaded nearly eight thousand times. That's... not too shabby! Let's see if y'all can take it to 8,000 before we cut the 18.104.22.168 release :)
Yes, I plan to explain the version numbering system in an upcoming blog post.
Looking back at our download records, it outdoes any previous release we've ever done by a factor of nearly four. Interestingly, it comes after a major slump in download numbers over the past six months, a.k.a. since we announced the GATK4 beta and the open-sourcing at the Bio-IT World meeting in May 2017. It looks like a lot of people were holding their breath waiting for the 4.0 release... I hope it was worth the wait.
Two weeks ago, for the official release of GATK version 4.0, we held a live online event that was both a launch party and a comprehensive if condensed overview of everything that's new in GATK4. Over the course of two hours, members of the GATK development team and a great lineup of external guests gave presentations about the new capabilities, discussed their implications in small panels and answered questions from the online audience.
I had the privilege of serving as host -- and unintentional comic relief, between forgetting panelists and bumping into the set furniture -- so I'm probably biased, but I'd say it was the most fun-yet-informative event we've done so far on GATK. Not that we do a lot of events -- and it's mostly just workshops -- but this felt pretty special. We had a great time doing it, and lots of people showed up to watch and ask questions. So we're now considering doing others in a similar vein, though they would each be focused on a specific topic and have more time for answering questions from the online audience. If that sounds like something you'd be interested in, let us know in the comments!
The brand new 4.0 version of the GATK was released -- at long last! -- on Tuesday Jan 9, 2018.
In lieu of our traditional version highlights, for this release we have collected the following resources:
Coming soon: The GATK4 migration guide will detail the key differences at the level of tools and command lines that you should watch out for when you upgrade to using GATK4 in your own work.
In just a few days, we'll be releasing GATK4 into general availability -- that's right, the big 4.0! To mark the occasion we are hosting a launch event that will be livestreamed on the Broad Institute's Facebook. Here's a short URL if you'd like to share it: broad.io/facebook.
The launch event is going to be a two-hour whistle-stop tour of what's new and shiny in GATK4. My fellow members of the Data Sciences Platform and GATK development team will give short presentations on key features, then we'll have some panel discussions to dig a bit deeper into the technical underpinnings and implications of these features. For the panels we'll be joined by a really exciting lineup of special guests from the University of California Santa Cruz, Yale School of Medicine, Intel, IBM Research, Verily Life Sciences, Amazon Web Services, Cloudera, Alibaba Cloud, and Microsoft Genomics. Details below the fold.
We should also have some time to take questions from the online audience, so be sure to log in and ask your questions in the comments section of the livestream. We'll also be checking the forums and Twitter for those of you who don't have a Facebook account. To be clear, you don't need an account to watch the video stream.
We hope you'll join us to celebrate this important milestone!
With less than a week to go before the big day (aaaaaaah), we're putting the finishing touches on some important updates to the website and the documentation.
Starting Tuesday Jan 9, the primary supported version will be 4.0, so all the documentation displayed by default on the website will be the 4.0 documentation. That covers not just the Tool Docs, which have always been systematically versioned, but also the forum-based peripheral docs that are more general and typically do not change from one version to the next. In the case of the move to GATK4, a majority of these peripheral doc articles are affected by a range of changes, from minor points of syntax to major shifts in functionality (e.g. switching from
-nct to Spark for multithreading). Here's how we're planning to deal with that.
What's new in GATK4? In this short video, Laura Gauthier explains how the speed and scalability of joint calling is dramatically improved in GATK4 thanks to the Intel GenomicsDB datastore.
See Events calendar for full list and dates
See Events calendar for full list and dates