Today I'm delighted to introduce WDL, pronounced widdle, a new workflow description language that is designed from the ground up as a human-readable and -writable way to express tasks and workflows.

As a lab-grown biologist (so to speak), I think analysis pipelines are amazing. The thought that you can write a script that describes a complex chain of processing and analytical operations, then all you need to do every time you have new data is feed it through the pipe, and out come results -- that's awesome. Back in the day, I learned to run BLAST jobs by copy-pasting gene sequences into the NCBI BLAST web browser window. When I found out you could write a script that takes a collection of sequences, submit them directly to the BLAST server, then extracts the results and does some more computation on them… mind blown. And this was in Perl, so there was pain involved, but it was worth it -- for a few weeks afterward I almost had a social life! In grad school! Then it was back to babysitting cell cultures day and night until I could find another excuse to do some "in silico" work. Where "in silico" was the hand-wavy way of saying we were doing some stuff with computers. Those were simpler times.

So it's been kind of a disappointment for the last decade or so that writing pipelines was still so flipping hard. I mean smart, robust pipelines that can understand things like parallelism, dependencies of inputs and outputs between tasks, and resume intelligently if they get interrupted. Sure, in the GATK world we have Queue, which is very useful in many respects, but it's tailored to specific computing cluster environments like LSF, and writing scala scripts is really not trivial if you're new to it. I wouldn't call Queue user-friendly by a long shot. Plus, we only really use it internally for development work -- to run our production pipelines, we've actually been using a more robust and powerful system called Zamboni. It's a real workhorse, and has gracefully handled the exponential growth of sequencer output up to this point, but it's more complicated to use -- I have to admit I haven't managed to wrap my head around how it works.

Fortunately I don't have to try anymore: our engineers have developed a new pipelining solution that involves WDL, the Workflow Description Language I mentioned earlier, and an execution engine called Cromwell that can run WDL scripts anywhere, whether locally or on the cloud.

It's portable, super user-friendly by design, and open-source (under BSD) -- we’re eager to share it with the world!


In a nutshell, WDL was designed to make it easy to write scripts that describe analysis tasks and chain those tasks into workflows, with built-in support for advanced features like scatter-gather parallelism. This came at the cost of some additional headaches for the engineering team when they built Cromwell, the execution engine that runs WDL scripts -- but that was a conscious design decision, to push off the complexity onto the engineers rather than leave it in the lap of pipeline authors.

You can learn more about WDL at https://software.broadinstitute.org/wdl/, a website we put together to provide detailed documentation on how to write WDL scripts and run them on Cromwell. It's further enriched by a number of tutorials and real GATK-focused workflows provided as examples, and in the near future we'll also share our GATK Best Practices production pipeline scripts.

Yes, I just said that, we're going to be sharing full-on production scripts! In the past we refused to share scripts because they weren't portable across systems and they were difficult to read, so we were concerned that sharing them would generate more support burden than we could handle. But now with that we have a solution that is both portable and user-friendly, we're not so worried -- and we're very excited about this as an opportunity to improve reproducibility among those who adopt our Best Practices in their work.


Of course it won't be the first time someone comes up with a revolutionary way to do things that ends up not working all that well in the real world. To that I say -- have you ever heard of the phrase "eating your own dog food", or "dogfooding" for short? It refers to a practice in software engineering where developers use their own product for running critical systems. I'm happy to say that we are now dogfooding Cromwell and WDL in our own production pipelines; a substantial part of the Broad's own analysis work, which involves processing terabytes worth of whole genome sequencing data, is now done on the cloud using Cromwell to execute pipelines written entirely in WDL.

That's how we know it works at scale, it's solid, and it does what we need it to do. That being said, there are a few additional features that are defined in WDL but not yet fully implemented in Cromwell, as well as a wishlist of convenience functions requested by our own methods development team, which the engineering team will pursue over the coming months. So its capabilities will keep growing as we add convenience functions to address our own and the wider community's pipelining needs.

Finally, I want to reiterate that WDL itself is meant to be very accessible to people who are not engineers or programmers; our own methods development team and various scientific collaborators have all commented that it's very easy to write and understand. So please try it out, and let us know what you think in the WDL forum!


Return to top

Geraldine_VdAuwera on 31 Mar 2016


Minor edit to clarify that [Cromwell](https://github.com/broadinstitute/cromwell) and [WDL](https://github.com/broadinstitute/wdl) are open-sourced under a BSD license.




- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

It's hot, it's humid, it's #ASHG19 in Houston, TX. Join us at @broadgenomics booth 714 in the exhibition hall to ch… https://t.co/An0WXnYw7z
16 Oct 19
Interested in hearing more about our DRAGEN-GATK partnership with @illumina? Fill out this survey to let us know yo… https://t.co/7Fadggm7Rp
16 Oct 19
RT @datadriveby: GATK and DRAGEN collaboration presented by @VdaGeraldine of @gatk_dev and @delagoya of @illumina at #ASHG19. Interesting t…
15 Oct 19
Questions about our new partnership with @illumina DRAGEN? Check out the blog post and handy graphic that explains… https://t.co/fBnjh45E7o
1 Oct 19
Enter the DRAGEN-GATK: Get the lowdown on our freshly announced collaboration with the @illumina DRAGEN team at https://t.co/3ILTJZ09e5
30 Sep 19

- Our favorite tweets from others

DRAGEN-GATK roadmap looking very interesting. Several complementary options will be available for running stuff on-… https://t.co/jxizQkM3q6
15 Oct 19
As a prior card carrying bioinformatician, it’s great to see @illumina and @broadinstitute coming together to solve… https://t.co/xGZqi8NmT4
15 Oct 19
GATK and DRAGEN collaboration presented by @VdaGeraldine of @gatk_dev and @delagoya of @illumina at #ASHG19. Intere… https://t.co/nbE8HGoOfu
15 Oct 19
In a new collaboration, the @gatk_dev team and the @illumina DRAGEN Bio-IT Platform are co-developing open-source g… https://t.co/oPjjk1lBqY
30 Sep 19
Do you want to learn about sequencing data analysis in an amazing city? Register now at @gatk_dev workshop "From re… https://t.co/ISBVX2Xwr5
3 Sep 19

See more of our favorite tweets...