It's a beautiful early autumn day in New England, with small patches of vibrant reds and yellows in the foliage just hinting at the fiery displays to come. Perfect weather for me to de-lurk and bring you some news! (I promise it's not GATK5)
The long and short of it (but mostly the short) is that we've started collaborating with the DRAGEN team at Illumina, led by Rami Mehio, to improve GATK tools and pipelines. There's a press release if you want the official announcement, or you can read on to get the long version from the GATK team's perspective.
If you're not familiar with DRAGEN, the name stands for Dynamic Read Analysis for GENomics and refers to a secondary analysis platform originally created by a company called Edico Genome, which was acquired by Illumina last year. The DRAGEN team became widely known for making genomic data processing insanely fast on special hardware, but they're not just a speed shop. They have top-notch computational biology expertise: when they reimplemented GATK tools like HaplotypeCaller in DRAGEN, they made some clever tweaks that improved the scientific accuracy of the results. They've done this for other tools as well, and they've also developed their own novel algorithms for other use cases.
That alone is already a big motivation for us to team up with them: they have great ideas for improving our tools and pipelines, and they're willing to share them. Works for us! Then there's the bigger picture of what this means for the kind of research we are working to enable. Both of our teams feel pretty strongly that as the amount of genomic data generation snowballs, particularly in the biomedical field, it's really important to ensure that the results of different studies can be cross-analyzed. For that to be possible, we need to standardize secondary analysis as much as possible to minimize batch effects. We believe that by working together to consolidate our methods and pipeline development efforts, we can remove a major source of heterogeneity in the ecosystem.
So what does that mean in practice?
Rest assured GATK itself is still going to be GATK, developed by our team at the Broad and released under the same BSD-3 open-source license you know and love. Any improvements that the DRAGEN team contributes to GATK tools will be integrated into the GATK codebase under the same BSD-3 license.
Beyond code improvements to GATK itself, there will also be some changes to the composition of the Best Practices pipelines. For example, we're going to replace BWA with the DRAGEN aligner, which is quite a bit faster, in our DNA pre-processing pipelines (full details and benchmarking results to follow). To reflect the collaborative nature of the work, any pipelines we co-develop with the DRAGEN team will be named DRAGEN-GATK Best Practices.
All the software involved in the DRAGEN-GATK pipelines will be fully open source and available in Github, including a new open source version of the DRAGEN aligner, and we'll continue to publish WDL workflows for every pipeline in Github and in Terra workspaces. Importantly, it will all still be runnable on normal hardware, whether you're doing your work on a local server, on-premises HPC or in the cloud. We'll also continue to provide free support for all GATK tools and pipelines, and as part of that we're going to work with the DRAGEN team to make sure we can provide the same level of high quality support for the tools that they provide.
The DRAGEN team also plans to produce a hardware-accelerated version of any DRAGEN-GATK Best Practices pipeline that we co-develop, which Illumina will offer on the commercial DRAGEN system. We won't touch that work at all (it's not our jam), but we will run comparative evaluations to validate that the hardware-accelerated version of any given pipeline produces results that are functionally equivalent to the "universal" open source software version. To be clear, it won't be just a rubber-stamp approval; we're highly motivated to make sure that the pipeline implementations are functionally equivalent because our colleagues in the Broad’s Genomics Platform are planning to switch some of the Broad's production pipelines to the DRAGEN hardware version for projects where speed is a critical factor.
On that note, what I personally find the most exciting about this partnership is that going forward, everyone in the research community will be able to take advantage of the best ideas from both our teams regardless of whether they want the "regular" software or a hardware-accelerated version. You could even switch between the two within the course of a project and still be able to cross-analyze the outputs. Over the years, I've had to tell a lot of people "sorry, you're going to have to reprocess everything with the same pipeline" so this feels like a huge step in the right direction.
Okay, this sounds great -- so when will the improved tools and pipelines be available?
We're already actively working on porting over improvements from the DRAGEN team, so if you follow the GATK repository on Github you should start seeing relevant commits and pull requests any day now. Barring any unforeseen complications, the tool improvements should roll out into regular GATK releases over the next couple of months, and we expect to release the first full DRAGEN-GATK pipeline (for germline short variants) in the first quarter of 2020. We'll post updates here on the blog about how it's going and what you can expect to see as the code rolls in and the release calendar firms up.
In the meantime, don't hesitate to reach out to us if you have any questions that aren't addressed here or in the press release. Note that if you're going to be at the ASHG meeting in Houston later this month, Angel Pizarro and I will be talking about this collaboration at the Illumina Informatics Summit that precedes the conference on Tuesday Oct 15, and I will be available at the Broad Genomics booth in the exhibit hall at ASHG itself on Wednesday Oct 16 if you'd like to discuss this in person. I hope to see a lot of you there!
See Events calendar for full list and dates
See Events calendar for full list and dates