By Eric Banks, Director, Data Sciences Platform at the Broad Institute

Last week I wrote about our efforts to develop a data processing pipeline specification that would eliminate batch effects, in collaboration with other major sequencing centers. Today I want to share our implementation of the resulting "Functional Equivalence" pipeline spec, and highlight the cost-centric optimizations we've made that make it incredibly cheap to run on Google Cloud.

For a little background, we started transitioning our analysis pipelines to Google Cloud Platform in 2016. Throughout that process we focused most of our engineering efforts on bringing down compute cost, which is the most important factor for our production operation. It's been a long road, but all that hard work really paid off: we managed to get the cost of our main Best Practices analysis pipeline down from about $45 to $5 per genome! As you can imagine that kind of cost reduction has a huge impact on our ability to do more great science per research dollar -- and now, we’re making this same pipeline available to everyone.


The Best Practices pipeline I'm talking about is the most common type of analysis done on a 30x WGS: germline short variant discovery (SNPs and indels). This pipeline covers taking the data from unmapped reads all the way to an analysis-ready BAM or CRAM (i.e. the part covered by the Functional Equivalence spec), then either a single-sample VCF or an intermediate GVCF, plus 15 steps of quality control metrics collected at various points in the pipeline, totalling $5 in compute cost on Google Cloud. As far as I know this is the most comprehensive pipeline available for whole-genome data processing and germline short variant discovery (without skimping on QC and important cleanup steps like base recalibration).

Let me give you a real-world example of what this means for an actual project. In February 2017, our production team processed a cohort of about 900 30x WGS samples through our Best Practices germline variant discovery pipeline; the compute costs totalled $12,150 or $13.50 per sample. If we had run the version of this pipeline we had just one year prior (before the main optimizations were made), it would have cost $45 per sample; a whopping $40,500 total! Meanwhile we've made further improvements since February, and if we were to run this same pipeline today, the cohort would cost only $4,500 to analyze.

2016 2017 Today
# of Whole Genomes Analyzed 900 900 900
Total Compute Cost $40,500 $12,150 $4,500
Cost per Genome Analyzed $45 $13.50 $5

For the curious, the most dramatic reductions we saw came from using different machine types for each of the various tasks (rather than piping data between tasks), leveraging GCP’s preemptible VMs, and most recently incorporating NIO to minimize the amount of data localization involved. You can read more about these approaches on Google's blog. At this point the single biggest culprit for cost in the pipeline is BWA (the genome mapper), a problem which its author Heng Li is actively working to address through a much faster (but equivalently accurate) mapper. Once Heng's new mapper is available, we anticipate the cost per genome analyzed to drop below $3.

On top of the low cost of operating the pipeline, the other huge bonus we get from running this pipeline on the cloud is that we can get any number of samples done in the time it takes to do just one, due to the staggeringly elastic scalability of the cloud environment. Even though it takes a single genome 30 hours to run through the pipeline (and we're still working on speeding that up), we're able to process genomes at a rate of one every 3.6 minutes, and we've been averaging about 500 genomes completed per day.

We're making the workflow script for this pipeline available in Github under an open-source license so anyone can use it, and we're also providing it as a preconfigured pipeline in FireCloud, the pipelining service we run on Google Cloud. Anyone can access FireCloud for free, you just need to pay Google for any compute and storage costs you incur when running the pipelines. So to be clear, when you run this pipeline on your data in FireCloud, all $5 of compute costs will go directly to the cloud provider; we won't make any money off of it. And there are no licensing fees involved at any point!

As a cherry on the cake, our friends at Google Cloud Platform are sponsoring free credits to help first-time users get started with FireCloud: the first 1,000 applicants can get $250 worth of credits to cover compute and storage costs. You can learn more here on the FireCloud website if you're interested.

Of course, we understand that not everyone is on Google Cloud, so we are actively collaborating with other cloud vendors and technology partners to expand the range of options for taking advantage of our optimized pipelines. For example, the Chinese cloud giant Alibaba Cloud is developing a backend for Cromwell, the execution engine we use to run our pipelines. And it's not all cloud-centric either; we are also collaborating with our long-time partners at Intel to ensure our pipelines can be run optimally on on-premises infrastructure without compromising on quality.

In conclusion, this pipeline is the result of two years' worth of hard work by a lot of people, both on our team and on the teams of the institutions and companies we collaborate with. We're all really excited to finally share it with the world, and we hope it will make it easier for everyone in the community to get more mileage out of their research dollars, just like we do.


Return to top

Mon 12 Feb 2018

jaideepjoshi on 12 Feb 2018


Great Blog. Great info. I ran the pipeline in FireCloud using the NA12878 (small) successfully. If I want to run the same pipeline on my own infrastructure in-house, I am assuming I could export the WDL from FireCloud and modify it for my environment, however is there a way I can get the same sample input data, including the (small) BAM, that the pipeline uses when run in FireCloud? Thanks again.

ebanks on 12 Feb 2018


Yes, @jaideepjoshi, you can export everything to your local environment. I'd highly recommend that you consider using Cromwell as your execution manager, since then you can just use the WDL without having to rewrite it from scratch. Cromwell (https://github.com/broadinstitute/cromwell) supports various local backends. The input data (both the example dataset as well as the resources used in the pipeline) are all available in public Google cloud buckets, so you can just download them (ideally using Google's 'gsutil' tool). Good luck!

Geraldine_VdAuwera on 12 Feb 2018


Also, the WDL scripts live in a github repo if you want to subscribe to that to be able to pull updates directly from there. I think we link to the github location in the FC workspace — if not yet we will start doing so. (Tagging @bshifaw who manages this content) Note also that the version in FC is optimized for Google so if you want to run locally some things may not work (this is a temporary limitation), but we have a universal version that can run locally out of the box. And Intel makes a version that is optimized for running on local infrastructure. If you tell us more about what infrastructure you’re using we may be able to give you more specific advice.

jaideepjoshi on 12 Feb 2018


Thanks ebanks/Geraldine_VdAuwera. I got the pipeline to work on a single Centos server, using the input files (downloaded), Cromwell and the docker images specific in the WDL. There were quite a few modifications I had to make to the WDLs to run, for example changing String to File for the (local) input files. Also, even though the pipeline is running local, there is something in the WDL that makes it necessary to do hash lookups of the docker images. This makes me unable to run behind a proxy. The next step is to run this on a Spark Cluster. The question is: Can I run the pipeline using Cromwell and WDL on a Spark Cluster ? I DO NOT want to run the Spark Tools. I simply want to run the entire pipeline "spark-submit cromwell-*.jar run *.wdl --inputs *.json" as a Spark Job ? Is that possible ? What would I have to change in the Cromwell-*.jar file to make it happen?

jaideepjoshi on 12 Feb 2018


I think I cant just run spark-submit cromwell-.jar ...

Sheila on 12 Feb 2018


@jaideepjoshi Hi, I am asking someone from the team for some help and will get back to you asap. -Sheila

Sheila on 12 Feb 2018


@jaideepjoshi Hi again, From the developer: "You can currently run individual tasks within a WDL on a Spark cluster (by running the Spark-based GATK tools), but it's not possible to run an entire WDL on a Spark cluster unless cromwell were to implement a Spark-based backend." -Sheila




- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

@YungangXu @NatureBiotech Interesting question, might need to be a bit more specific. #GATK is a toolkit that cover… https://t.co/Q04BhgPgSb
25 Sep 18
@PoisonEcology GATK code overall is at https://t.co/x5Pv5EoNYU, happy to point you to specific classes if needed. U… https://t.co/l67ctR2zG5
25 Sep 18
Many thanks to @xdopazo and colleagues for hosting our #GATK workshop crew in beautiful Sevilla! Lovely location an… https://t.co/S6iXKviMNz
21 Sep 18
@vanilla Forums and docs are back online; everything should be working properly now.
20 Sep 18
Service alert: forums and most docs are currently down due to a @vanilla database outage. Will advise when service… https://t.co/JbHVrYANhQ
20 Sep 18

- Our favorite tweets from others

If you think your fascination with #GATK hit the roof wait until you meet @gatk_dev team! Has been a wonderful week… https://t.co/KwfHm1SzNh
21 Sep 18
@xdopazo @gatk_dev @ClinicalBioinfo @FProgresoysalud @INB_Official @CIBERER @jpflorido Thank you very much for such amazing time!
21 Sep 18
Workshop "From reads to disease variants". Big thanks to @gatk_dev staff for sharing #GATK4 variant calling apps' e… https://t.co/kwjCA9HCy6
21 Sep 18
The couse "From reads to disease variants" https://t.co/Wmh8HeqmbY ends today. Thanks @gatk_dev instructors for the… https://t.co/TSf9L7JJMM
21 Sep 18

See more of our favorite tweets...