Machine learning or ML is one of the hottest buzzwords (buzzphrases?) in genomics today, along with data science, artificial intelligence (AI) and deep learning (DL). And as with all good buzzwords, it's very unfashionable to admit that you don't know exactly what they mean. So here's an intro-level overview of these terms and where they fit in the GATK world. If after reading this you find yourself craving more substance about the exciting new ML methods being developed in GATK4, don't despair -- we plan to follow up next week with a more detailed post written by Lee Lichtenstein, GATK's leader of somatic computational method development and all-around data science nerd.
It's finally Spring in Boston; the trees are sprouting leaves again, everything is turning green and gloriously alive -- and Bio-IT World is starting, which makes it official! Many of you may not know or care about Bio-IT, since it's more a biotech trade show than a scientific meeting, but for us it has become a springtime tradition to announce important developments there. These announcements have often focused on strategic/roadmap level plans -- for example that's where we broke the news last year that GATK4 would be fully open-source to a standing ovation (whoo!) -- but this year we're in a position to talk about the new capabilities we're actually delivering, and that feels really good. To quote the inevitable Steve Jobs, real artists ship, and boy are we shipping.
We have two major themes that we're developing this year: (1) democratization of the Best Practices pipelines, which includes everything from increasing access to ease of deployment, standardization and optimization for cost and speed; and (2) application of machine learning to improve accuracy and scalability in established pipelines as well as tackle new areas like germline CNV discovery.
If I believed in fate, I would say she has a twisted sense of humor. As you may know, today (April 25) was National DNA Day in the USA. And as it happens, on this very day, our friends and colleagues in the Broad's Genomics Platform hit a major milestone: they sequenced their 100,000th human whole genome! How exciting, right? And a complete coincidence!
But that's not the twist. The twist is that, knowing this was going to happen (because the Genomics Platform is a well-oiled machine), they had planned a celebratory livestream of the preparation and loading of the sequencer... only for the broadcast to fail because another sequencer flooded the bandwidth when it started uploading data to its mothership! Yep, even on DNA Day there can be such a thing as too much DNA.
Fortunately someone recorded the whole event on a local laptop so we can still enjoy all ~20 minutes of Eric Lander, Stacey Gabriel, Sheila Dodge and Andy Hollinger discussing the nature and implications of this milestone while Erin LaRoche does the actual work of loading the sequencer (Novaseq in action!) with samples from the Gabriella Miller Kids First project, including lucky number 100,000.
So click here to view the full video on Facebook and find out what the Broad will sequence once we run out of people.... Then head over to reddit/science for Aviv Regev's "Ask Me Anything" session on Thursday April 26.
As I type this, four of my colleagues are kicking off a workshop in Taipei, hosted by Taiwan's National Applied Research Laboratories (NARLabs). For the next four days, they will be leading a cohort of ~40 participants through everything from core concepts to practical details involved in understanding and applying the GATK Best Practices for variant discovery. They will alternate between lectures and hands-on exercises, covering all current GATK workflows (germline and somatic short variants, plus somatic CNVs) as well as the pipelining systems that we recommend. And for the first time they'll be doing all of it with GATK4. Whoo! We've been working on this new formula and the updated materials for several months so as far as I'm concerned it's the most exciting event that I'm unable to attend this year...
But this is only the first of many workshops! In a few weeks, another workshop crew will be traveling to Montreal to deliver the same training course at McGill University. In April, I'll be taking a crew to Beijing, hosted by Intel China, which I'm really looking forward to given how much is happening in the genomics space over there; among other cool prospects I'm thrilled that we're going to get to try out the brand new Cromwell backend developed by Alibaba Cloud for running GATK pipelines in China. We're also working on scheduling another workshop in early May in Qatar, and we already have a couple of others scheduled in July (Cambridge UK) and September (Seville, Spain). Possibly others TBD, stay tuned...
By Eric Banks, Director, Data Sciences Platform at the Broad Institute
Last week I wrote about our efforts to develop a data processing pipeline specification that would eliminate batch effects, in collaboration with other major sequencing centers. Today I want to share our implementation of the resulting "Functional Equivalence" pipeline spec, and highlight the cost-centric optimizations we've made that make it incredibly cheap to run on Google Cloud.
For a little background, we started transitioning our analysis pipelines to Google Cloud Platform in 2016. Throughout that process we focused most of our engineering efforts on bringing down compute cost, which is the most important factor for our production operation. It's been a long road, but all that hard work really paid off: we managed to get the cost of our main Best Practices analysis pipeline down from about $45 to $5 per genome! As you can imagine that kind of cost reduction has a huge impact on our ability to do more great science per research dollar -- and now, we’re making this same pipeline available to everyone.
By Eric Banks, Director, Data Sciences Platform and original member of the GATK development team
Ever since the GATK started getting noticed by the research community (mainly as a result of our contribution to the 1000 Genomes Project), people have asked us to share the pipelines we use to process data for variant discovery. Historically we have shied away from providing our actual scripts, not because we didn't want to share, but because the scripts themselves were very specific to the infrastructure we were using at the Broad. Fortunately we've been able to move beyond that thanks to the development of WDL and Cromwell, which allow potentially limitless portability of our pipeline scripts.
But it was also because there is a fair amount of wiggle room in terms of how to implement a pipeline to achieve correct results, depending on whether you care more about speed, cost or other factors. So instead we formulated "Best Practices", which I'll talk more about in a minute, to provide a blueprint of what are the key steps in the pipeline.
Today though we're taking that idea a step further: in collaboration with several other major genomics institutions, we defined a "Functional Equivalence" specification that is intended to standardize pipeline implementations, with the ultimate goal of eliminating batch effects and thereby promoting data interoperability. That means if you use a pipeline that follows this specification, you can rest assured that you will be able to analyze your results against all compatible datasets, including huge resources like gnomAD and TOPMed.
We have a new tutorial, Tutorial#11136, that outlines how to call somatic short variants, i.e. SNVs and indels, with GATK4 Mutect2. The tutorial provides small example data to follow along with.
Full-length Mutect2-compatible human germline resources are available on our [FTP server]( https://software.broadinstitute.org/gatk/download/bundle) and at gs://gatk-best-practices/. The resources are simplified from the gnomAD resource and retain population allele frequencies. Mutect2 and GetPileupSummaries are the two tools in the workflow that each require a germline resource.
If you want to run the Somatic Short Variant Discovery Best Practices workflow using WDL, be sure to check out the official Mutect2 WDL script in the gatk-workflows repository. @bshifaw and other engineers optimize the scripts in the repository to run efficiently in the cloud. Furthermore, the scripts come with example JSON format inputs files filled out with publically-accessible cloud data.
For other Mutect2-related scripts, e.g. towards panel of normals generation, check out the gatk repository's scripts/mutect2_wdl directory. Our developers update these scripts on a continual basis.
If you are new to somatic calling, be sure to read Article#11127. It gives an overview of what traditional somatic calling entails. For one, somatic calling is NOT just a difference between two callsets in that germline variant sites are excluded from consideration.
For those switching from GATK3 MuTect2, Blog#10911 will bring you up to speed on the differences.
If you are interested in simply calling differences between two samples, Blog#11315 outlines an off-label two-pass Mutect2 workflow. Off-label means the workflow is not a part of the Best Practices and is therefore unsupported. However, if given enough community interest, we may be convinced to further flesh out the workflow. Please do post to the forum to express interest.
Given my years as a biochemist, if given two samples to compare, my first impulse is to want to know what are the functional differences, i.e. differences in proteins expressed between the two samples. I am interested in genomic alterations that ripple down the central dogma to transform a cell.
Please note the workflow that follows is NOT a part of the Best Practices. This is an illustrative, unsupported workflow. For the official Somatic Short Variant Calling Best Practices workflow, see Tutorial#11136.
To call every allele that is different between two samples, I have devised a two-pass workflow that takes advantage of Mutect2 features. This workflow uses Mutect2 in tumor-only mode and appropriates the
--germline-resource argument to supply a single-sample VCF with allele fractions instead of population allele frequencies. The workflow assumes the two case samples being compared originate from the same parental line and the ploidy and mutation rates make it unlikely that any site accumulates more than one allele change.
Over the past two weeks and a bit, the GATK 4.0(.0.0) package has been downloaded nearly eight thousand times. That's... not too shabby! Let's see if y'all can take it to 8,000 before we cut the 22.214.171.124 release :)
Yes, I plan to explain the version numbering system in an upcoming blog post.
Looking back at our download records, it outdoes any previous release we've ever done by a factor of nearly four. Interestingly, it comes after a major slump in download numbers over the past six months, a.k.a. since we announced the GATK4 beta and the open-sourcing at the Bio-IT World meeting in May 2017. It looks like a lot of people were holding their breath waiting for the 4.0 release... I hope it was worth the wait.
Two weeks ago, for the official release of GATK version 4.0, we held a live online event that was both a launch party and a comprehensive if condensed overview of everything that's new in GATK4. Over the course of two hours, members of the GATK development team and a great lineup of external guests gave presentations about the new capabilities, discussed their implications in small panels and answered questions from the online audience.
I had the privilege of serving as host -- and unintentional comic relief, between forgetting panelists and bumping into the set furniture -- so I'm probably biased, but I'd say it was the most fun-yet-informative event we've done so far on GATK. Not that we do a lot of events -- and it's mostly just workshops -- but this felt pretty special. We had a great time doing it, and lots of people showed up to watch and ask questions. So we're now considering doing others in a similar vein, though they would each be focused on a specific topic and have more time for answering questions from the online audience. If that sounds like something you'd be interested in, let us know in the comments!
See Events calendar for full list and dates
See Events calendar for full list and dates