Today, several members of our extended group are talking at the BioIT World meeting in Boston, and the Broad mothership is putting out a handful of announcements that are related to GATK. Among other communications there's a press release accompanied by a blog post on the Broad Institute blog, which unveil a landmark agreement we have reached with several major cloud vendors. I'd like to take a few minutes to discuss what is at stake, both in terms of what we're doing, and of how this will affect the wider GATK community.

These announcements all boil down to two things: we built a platform to run the Broad's GATK analysis pipelines in the cloud instead of our local cluster, and we're making that platform accessible to the wider community following a "Software as a Service" (SaaS) model.

Now, before we get any further into discussing what that entails, I want to reassure everyone that we will continue to provide the GATK software as a downloadable executable that can be used anywhere, whether locally on your laptop, on your institution's server farm or computing cluster, or on a cloud platform if you've already got that set up for yourself. The cloud-based service we're announcing is just one more option that we're making available for running GATK. And it should go without saying that we'll continue to provide the same level of support as we have in the past to everyone through the GATK forum; our commitment to that mission is absolute and unwavering.

Alright, so what's happening exactly? Read on to find out!

As discussed recently on this very blog, we've been migrating a substantial portion of the Broad's production genomic analysis pipelines to the cloud. This move was motivated in large part by a need for greater elasticity to deal with the onslaught of massive projects periodically hammering our datacenter (I'm looking at you, @dgmacarthur) as well as a drive toward increased cost-efficiency. But it was also a recognition that the mind-boggling rate at which genomic data is generated (roughly doubling every 8 months!) means we have to adapt how we share and interact with these frankly staggering amounts of data.

To that end, we've been working elbow to elbow with Google engineers for the past eighteen months; in short, they taught us how to cloud and we taught them how to genome. Together we built a system capable of operating our GATK Best Practices production pipelines at scale on the Google Cloud Platform (GCP), using Cromwell and WDL to define and execute the actual workflows. We've also been working closely with a team from the Intel Life Sciences division to solve some of the key challenges involved in scaling up to the next order of dataset magnitude, resulting in a new kind of database that will enable us to perform joint calling on tens, even hundreds of thousands of genomes at a time.

We're already running the Broad's whole genomes on this new platform, and eventually we plan to migrate most if not all our research pipelines (exomes, RNA etc) as well. As a corollary, all of the analysis results produced by the cloud-based pipeline are delivered to researchers through cloud-based workspaces within which they can kick off further analyses. That way, what happens on the cloud stays on the cloud, as far into the process as possible (in part to minimize egress charges).

From my perspective the most immediate upshot of this is that it finally puts us within reach of the holy grail of reproducibility: given the pipeline WDL scripts and resource datasets (both of which we plan to share freely) anyone will be able to reproduce our pipeline processing on their own instance of the Google Cloud with complete independence.

That being said, standing up and administering your own cloud-based service is not exactly trivial, and we know there's a lot of demand for push-button solutions, so we built our system to double as a Software as a Service (SaaS) platform that we can make publicly available for the convenience of the wider community. We plan to make this service accessible to everyone, Broadies and non-Broadies alike, including commercial/for-profit organizations, under the same conditions. Exact pricing has yet to be determined, but it will certainly include the cloud vendor's compute costs, and there will be no separate licensing cost for for-profit use.

We're also opening resale of GATK as a service to commercial SaaS vendors in order to maximize the options available to the community. Illumina has signed on as the first to offer GATK as a service through BaseSpace via Cromwell+WDL, and we're working with all the major cloud computing vendors mentioned above to ensure that the Cromwell+WDL pipelining solution will work as seamlessly and cost-effectively on their platforms as it does today on Google Cloud Platform.

Our ultimate goal here is to reduce the amount of effort that goes into standing up and maintaining implementations of GATK Best Practices worldwide, so that all those resources can be refocused on more interesting work. Personally, I expect that these new developments will contribute to making the GATK Best Practices more readily accessible and affordable to all, and I'm looking forward to being able to announce the availability of the new service later this year!

Return to top

dvelayutham on 6 Apr 2016

<>. That is really good news for CROs and service providers. Please update once everything is finalized and on road.

jchambers on 6 Apr 2016

Any plans to roll-out GATK to Amazon AWS?

Geraldine_VdAuwera on 6 Apr 2016

@jchambers Yes, we're working on it. I can't give you an ETA however -- I would guess we're still a few months out from having anything we can share.

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


RT @dgmacarthur: We’re looking for a talented computational biologist to help drive the development of new methods for the diagnosis of rar…
21 Mar 18
@SeqComplete Please use the gatk wrapper script rather than calling the jars directly. Follow the quick start guide…
19 Mar 18
Join us on Tuesday March 20, we'll be talking about running #GATK4 pipelines on @BroadFireCloud, and how we optimiz…
16 Mar 18
#GATK Forums will be largely unattended today while we huddle for warmth and battle polar bears for food
13 Mar 18
@mikegloud @zaczap You could also come to one of our in-person workshops, we have a bunch coming up
9 Mar 18

- Our favorite tweets from others

@gatk_dev We'll be happy to host a survival boot camp for you guys in Finland! Complementary polar bear provided.
13 Mar 18
@gatk_dev @ericschmidt Standardised, accessible, $5. Wow.
6 Mar 18
I am very grateful to the software from academia that "just works", like GATK, BWA, Picard, Bedtools, SAMtools, Kal…
3 Mar 18
@gatk_dev @BroadFireCloud @github We've been hacking GATK for years to play well with fish and butterflies! My enth…
23 Feb 18
@BroadFireCloud also does call caching!!!!!! Perfect for tinkering with multi-step pipelines.
23 Feb 18

See more of our favorite tweets...