Today, several members of our extended group are talking at the BioIT World meeting in Boston, and the Broad mothership is putting out a handful of announcements that are related to GATK. Among other communications there's a press release accompanied by a blog post on the Broad Institute blog, which unveil a landmark agreement we have reached with several major cloud vendors. I'd like to take a few minutes to discuss what is at stake, both in terms of what we're doing, and of how this will affect the wider GATK community.

These announcements all boil down to two things: we built a platform to run the Broad's GATK analysis pipelines in the cloud instead of our local cluster, and we're making that platform accessible to the wider community following a "Software as a Service" (SaaS) model.

Now, before we get any further into discussing what that entails, I want to reassure everyone that we will continue to provide the GATK software as a downloadable executable that can be used anywhere, whether locally on your laptop, on your institution's server farm or computing cluster, or on a cloud platform if you've already got that set up for yourself. The cloud-based service we're announcing is just one more option that we're making available for running GATK. And it should go without saying that we'll continue to provide the same level of support as we have in the past to everyone through the GATK forum; our commitment to that mission is absolute and unwavering.

Alright, so what's happening exactly? Read on to find out!


As discussed recently on this very blog, we've been migrating a substantial portion of the Broad's production genomic analysis pipelines to the cloud. This move was motivated in large part by a need for greater elasticity to deal with the onslaught of massive projects periodically hammering our datacenter (I'm looking at you, @dgmacarthur) as well as a drive toward increased cost-efficiency. But it was also a recognition that the mind-boggling rate at which genomic data is generated (roughly doubling every 8 months!) means we have to adapt how we share and interact with these frankly staggering amounts of data.

To that end, we've been working elbow to elbow with Google engineers for the past eighteen months; in short, they taught us how to cloud and we taught them how to genome. Together we built a system capable of operating our GATK Best Practices production pipelines at scale on the Google Cloud Platform (GCP), using Cromwell and WDL to define and execute the actual workflows. We've also been working closely with a team from the Intel Life Sciences division to solve some of the key challenges involved in scaling up to the next order of dataset magnitude, resulting in a new kind of database that will enable us to perform joint calling on tens, even hundreds of thousands of genomes at a time.

We're already running the Broad's whole genomes on this new platform, and eventually we plan to migrate most if not all our research pipelines (exomes, RNA etc) as well. As a corollary, all of the analysis results produced by the cloud-based pipeline are delivered to researchers through cloud-based workspaces within which they can kick off further analyses. That way, what happens on the cloud stays on the cloud, as far into the process as possible (in part to minimize egress charges).

From my perspective the most immediate upshot of this is that it finally puts us within reach of the holy grail of reproducibility: given the pipeline WDL scripts and resource datasets (both of which we plan to share freely) anyone will be able to reproduce our pipeline processing on their own instance of the Google Cloud with complete independence.

That being said, standing up and administering your own cloud-based service is not exactly trivial, and we know there's a lot of demand for push-button solutions, so we built our system to double as a Software as a Service (SaaS) platform that we can make publicly available for the convenience of the wider community. We plan to make this service accessible to everyone, Broadies and non-Broadies alike, including commercial/for-profit organizations, under the same conditions. Exact pricing has yet to be determined, but it will certainly include the cloud vendor's compute costs, and there will be no separate licensing cost for for-profit use.

We're also opening resale of GATK as a service to commercial SaaS vendors in order to maximize the options available to the community. Illumina has signed on as the first to offer GATK as a service through BaseSpace via Cromwell+WDL, and we're working with all the major cloud computing vendors mentioned above to ensure that the Cromwell+WDL pipelining solution will work as seamlessly and cost-effectively on their platforms as it does today on Google Cloud Platform.

Our ultimate goal here is to reduce the amount of effort that goes into standing up and maintaining implementations of GATK Best Practices worldwide, so that all those resources can be refocused on more interesting work. Personally, I expect that these new developments will contribute to making the GATK Best Practices more readily accessible and affordable to all, and I'm looking forward to being able to announce the availability of the new service later this year!


Return to top

dvelayutham on 6 Apr 2016


<>. That is really good news for CROs and service providers. Please update once everything is finalized and on road.

jchambers on 6 Apr 2016


Any plans to roll-out GATK to Amazon AWS?

Geraldine_VdAuwera on 6 Apr 2016


@jchambers Yes, we're working on it. I can't give you an ETA however -- I would guess we're still a few months out from having anything we can share.

Geraldine_VdAuwera on 6 Apr 2016


Update on the AWS question: it's coming sooooooooon! Soon enough that we're starting to plan publicity, a webinar and stuff like that.

Hongchao_Lu on 6 Apr 2016


Hi, we want to use AWS Batch to run Cromwell in AWS cloud ASAP. Could you tell us the current status of the plan. I see a [document](https://github.com/broadinstitute/cromwell/blob/e9f47c923ab7ec0cf6b4c6b2ae45e66d0d88e907/docs/tutorials/AwsBatch101.md "document") , however, it seems be not complete and not worked. Any training plan for this? Thanks.

Ruchi on 6 Apr 2016


Hello @Hongchao_Lu, You're totally correct, Cromwell's AWS Batch support is incomplete. We plan to start picking up work on it next week, and our target is to have it available early August. If you have questions about specific plans, feel free to email me directly. Thanks!




- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

RT @BioCodePapers: GATK PathSeq: A customizable computational tool for the discovery and identification of microbial sequences in libraries…
10 Jul 18
RT @xdopazo: Still some vacancies in the GATK workshop in Seville https://t.co/Wmh8HeqmbY do not miss it! @gatk_dev @ClinicalBioinfo @FProg…
9 Jul 18
Holiday notice: The #GATK forum is on break today as we celebrate US Independence Day. Barring any alien invasion o… https://t.co/IyPKilBhru
4 Jul 18
@StevenNHart @delagoya Thanks for the suggestion, will look into this.
27 Jun 18
@delagoya We could definitely consider that assuming there’s a good way to manage this cleanly. Would love to discu… https://t.co/gNsUNYEXNl
26 Jun 18

- Our favorite tweets from others

Davide Sampietro presenting our work on an #FPGA implementation of the #pairhmm step of the @gatk_dev pipeline by… https://t.co/LU2m4QOtUy
11 Jul 18
@delagoya @gatk_dev Might want to try the builder design pattern for docker. https://t.co/v43xc3Ut0j
26 Jun 18
.@chapmanb shows #bcbio validation graphs: include sensitivity and precision. Compare different tool versions again… https://t.co/OvDXmw6p8x
26 Jun 18
@gatk_dev That’s tough... Thank you so much for maintaining GATK services!
24 Jun 18

See more of our favorite tweets...