Add Plumbing


By plumbing, we mean chaining tasks together to form sophisticated pipelines. How do we do that in WDL?

With effortless grace, that's how.

 

Simple connections

At this point you know how to include multiple tasks in your workflow script. If you were paying attention in the section about variables you even know how to connect the output of one task to the input of the next. That enables you to build linear or simple branching and merging workflows of any length, with single or multiple inputs and outputs connected between tasks -- all without any other kind of code than what we've already covered.

Switching and iterating logic

In addition to those basic connection capabilities, you'll sometimes need to be able to switch between alternate pathways and iterate over sets of data, either in series or in parallel. For that, we're going to need some more code, but don't worry... you don't have to write any of it! We're going to take advantage of various aspects of WDL syntax and some handy built-in features that make it really easy to add this logic with little effort. Our favorite feature of this type in WDL is scatter-gather parallelism, which is a great way to speed up execution when you need to apply the same command to subsets of data that can be treated as independent units. There are a couple of additional features defined in WDL (loops and conditionals) that are not yet fully supported by the Cromwell execution engine, but this functionality is being actively developed.

Efficiency through code re-use

Finally, you'll find you often need to do similar things in different contexts. Rather than copying the relevant code into several places -- which produces more code overall to maintain and update -- you can be lazy and re-use the code smartly, either through task aliasing if your problem involves e.g. running the same tool in different ways, or by importing workflows wholesale. <- Sorry, the latter hasn't been fully implemented yet, but it's on the development roadmap.


You can read more about each of these options by navigating between the tabs above.


With all these options for assembling calls to tasks into workflows, we can build some fairly sophisticated data processing and analysis pipelines. Admittedly this can seem a bit overwhelming at first, but don't worry -- we're here to help! Once you're doing going through the WDL basics, we'll point you to a series of tutorials that show how to apply each option to some classic use cases that we encounter in genomics. But first, we'll show you how to check that the syntax of a workflow is valid. No judgment on the sanity of the workflow design though!

Linear Chaining

The simplest way to chain tasks together in a workflow is a linear chain, where we feed the output of one task to the input of the next, like so:

This is easy to do because WDL allows us to refer to the output of any task (declared appropriately in the task's output block) within the call statement of another task (and indeed, anywhere else in the workflow block), using the syntax task_name.output_variable. So here, we simply specify in the call to stepB that we want it to use stepA.out as the value of the input variable in, and it's the same rule for stepC.

  call stepB { input: in=stepA.out }
  call stepC { input: in=stepB.out }

This relies on a principle called hierarchical naming that allows us to identify components by their parentage.


Generic example script

To put this in context, here is what the code for the workflow illustrated above would look like in full:

workflow LinearChain {
  File firstInput
  call stepA { input: in=firstInput }
  call stepB { input: in=stepA.out }
  call stepC { input: in=stepB.out }
}

task stepA {
  File in
  command { programA I=${in} O=outputA.ext }
  output { File out = "outputA.ext" }
}

task stepB {
  File in
  command { programB I=${in} O=outputB.ext }
  output { File out = "outputB.ext" }
}

task stepC {
  File in
  command { programC I=${in} O=outputC.ext }
  output { File out = "outputC.ext" }
}

Concrete example

Let’s look at a concrete example of linear chaining in a workflow designed to pre-process some DNA sequencing data (MarkDuplicates), perform an analysis on the pre-processed data (HaplotypeCaller), then subset the results of the analysis (SelectVariants).

The workflow involves three tasks:

  • MarkDuplicates takes in a File bamFile and produces a File metrics and a File dedupBAM.
  • HaplotypeCaller takes in a File bamFile and produces a File rawVCF.
  • SelectVariants takes in a File VCF and a String type to specify whether to select INDELs or SNPs. It produces a File subsetVCF, containing only variants of the specified type.

Concrete example script

This is what the code for the workflow illustrated above would look like:

workflow LinearChainExample {
  File originalBAM
  call MarkDuplicates { input: bamFile=originalBAM }
  call HaplotypeCaller { input: bamFile=MarkDuplicates.dedupBam }
  call SelectVariants { input: VCF=HaplotypeCaller.rawVCF, type="INDEL" }
}

task MarkDuplicates {
  File bamFile
  command {
    java -jar picard.jar MarkDuplicates \
    I=${bamFile} O=dedupped.bam M= dedupped.metrics
  }
  output {
    File dedupBam = "dedupped.bam"
    File metrics = "dedupped.metrics"
  }
}

task HaplotypeCaller {
  File bamFile
  command {
    java -jar GenomeAnalysisTK.jar \
      -T HaplotypeCaller \
      -R reference.fasta \
      -I ${bamFile} \
      -o rawVariants.vcf
  }
  output {
    File rawVCF = "rawVariants.vcf"
  }
}

task SelectVariants {
  File VCF
  String type
  command {
    java -jar GenomeAnalysisTK.jar \
      -T SelectVariants \
      -R reference.fasta \
      -V ${VCF} \
      --variantType ${type} \
      -o rawIndels.vcf
  }
  output {
    File subsetVCF = "rawIndels.vcf"
  }
}

Note that here for simplicity we omitted the handling of index files, which has to be done explicitly in WDL. For examples of how to do that, see the Tutorials and Real Workflows.

Multi-input / Multi-output

The ability to connect outputs to inputs described in Linear Chaining, which relies on hierarchical naming, allows you to chain together tools that produce multiple outputs and accept multiple inputs, and specify exactly which output feeds into which input.

As the outputs for stepB are named differently, we can specify where exactly each output goes in the next step's input fields.

call stepC { input: in1=stepB.out1, in2=stepB.out2 }

Generic example script

In context, this sort of plumbing would look as follows:

workflow MultiOutMultiIn {

  File firstInput

  call stepA { input: in=firstInput }
  call stepB { input: in=stepA.out }
  call stepC { input: in1=stepB.out1, in2=stepB.out2 }
}

task stepA {

  File in

  command { programA I=${in} O=outputA.ext }
  output { File out = "outputA.ext" }
}

task stepB {

  File in

  command { programB I=${in} O1=outputB1.ext O2=outputB2.ext }
  output {
    File out1 = "outputB1.ext"
    File out2 = "outputB2.ext" }
}

task stepC {

  File in1
  File in2

  command { programB I1=${in1} I2=${in2} O=outputC.ext }
  output { File out = "outputC.ext" }
}

Concrete example

This workflow uses Picard’s splitVcfs tool to separate the original VCF into two VCF files, one containing only SNPs and the other containing only Indels. The next step takes those two separated VCFs and recombines them into one.

For this toy example, we have defined two tasks:

  • splitVcfs takes in a File VCF and outputs a File snpOut and a File indelOut.
  • CombineVariants takes in a File VCF1 and a File VCF2, producing a File VCF.

Concrete example script

The workflow described above, in its entirety, would look like this:

workflow MultiOutMultiInExample {

  File inputVCF

  call splitVcfs { input: VCF=inputVCF }
  call CombineVariants { input: VCF1=splitVcfs.indelOut, VCF2=splitVcfs.snpOut }
}

task splitVcfs {

  File VCF

  command {
    java -jar picard.jar SplitVcfs \
        I=${VCF} \
        SNP_OUTPUT=snp.vcf \
        INDEL_OUTPUT=indel.vcf
  }
  output {
    File snpOut = "snp.vcf"
    File indelOut = "indel.vcf"
  }
}

task CombineVariants {

  File VCF1
  File VCF2

  command {
    java -jar GenomeAnalysisTK.jar \
        -T CombineVariants
        -R reference.fasta \
        -V ${VCF1} \
        -V ${VCF2} \
        --genotypemergeoption UNSORTED \
        -o combined.vcf
  }
  output {
    File VCF = "combined.vcf"
  }
}

Note that here for simplicity we omitted the handling of index files, which has to be done explicitly in WDL. For examples of how to do that, see the Tutorials and Real Workflows.

Coding error? No articles to display.

Scatter-Gather Parallelism

Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one). For a more detailed introduction on parallelism, you can read about it in-depth here.

To do this, we use the scatter function from the WDL standard library, which will produce parallelizable jobs running the same task on each input in an array, and output the results as an array as well.

Array[File] inputFiles

  scatter (oneFile in inputFiles) {
    call stepA { input: in=oneFile }
  }
  call stepB { input: files=stepA.out }

The magic here is that the array of outputs is produced and passed along to the next task without you ever making an explicit declaration about it being an array. Even though the output of stepA looks like a single file based on its declaration, just referring to stepA.out in any other call statement is sufficient for WDL to know that you mean the array grouping the outputs of all the parallelized stepA jobs.

In other words, the scatter part of the process is explicit while the gather part is implicit.


Generic example script

To put this in context, here is what the code for the workflow illustrated above would look like in full:

workflow ScatterGather {

  Array[File] inputFiles

  scatter (oneFile in inputFiles) {
    call stepA { input: in=oneFile }
  }
  call stepB { input: files=stepA.out }
}

task stepA {

  File in

  command { programA I=${in} O=outputA.ext }
  output { File out = "outputA.ext" }
}

task stepB {

  Array[File] files

  command { programB I=${files} O=outputB.ext }
  output { File out = "outputB.ext" }
}

Concrete example

Let’s look at a concrete example of scatter-gather parallelism in a workflow designed to call variants individually per-sample (HaplotypeCaller), then perform joint genotyping on all the per-sample GVCFs together (GenotypeGVCFs).

The workflow involves two tasks:

  • HaplotypeCallerERC takes in a File bamFile and produces a File GVCF.
  • GenotypeGVCFs takes in an Array[File] GVCF and produces a File rawVCF.

Concrete example script

This is what the code for the workflow illustrated above would look like:

workflow ScatterGatherExample {

  Array[File] sampleBAMs

  scatter (sample in sampleBAMs) {
    call HaplotypeCallerERC { input: bamFile=sample }
  }
  call GenotypeGVCF { input: GVCFs=HaplotypeCallerERC.GVCF }
}

task HaplotypeCallerERC {

  File bamFile

  command {
    java -jar GenomeAnalysisTK.jar \
        -T HaplotypeCaller \
        -ERC GVCF \
        -R reference.fasta \
        -I ${bamFile} \
        -o rawLikelihoods.gvcf
  }
  output {
    File GVCF = "rawLikelihoods.gvcf"
  }
}

task GenotypeGVCF {

  Array[File] GVCFs

  command {
    java -jar GenomeAnalysisTK.jar \
        -T GenotypeGVCFs \
        -R reference.fasta \
        -V ${GVCFs} \
        -o rawVariants.vcf
  }
  output {
    File rawVCF = "rawVariants.vcf"
  }
}

Note that here for simplicity we omitted the handling of index files, which has to be done explicitly in WDL. For examples of how to do that, see the Tutorials and Real Workflows.

Additionally, please note that we did not expressly handle delimiters for Array data types in this example. To learn explicitly about how to specify the ${GVCFs} input, please see this tutorial.

Task Aliasing

When you need to call a task more than once in a workflow, you can use task aliasing. It would be tedious to copy-paste a task's definition and change the name each time you needed to use it again in the workflow. This method, termed copy and paste programming, is simple enough up front but difficult to maintain in the long run. Imagine you found a typo in one of your tasks--you'd need to fix that typo in every single pasted task! However, using WDL's built-in task aliasing feature, you can call the same task code and assign it an alias. Then, following the principle of hierarchical naming, in order to access the output of an aliased task we use the alias, rather than the original task name.

To use an alias, we use the syntax call taskName as aliasName.

call stepA as firstSample { input: in=firstInput }
call stepA as secondSample { input: in=secondInput }
call stepB { input: in=firstSample.out }
call stepC { input: in=secondSample.out }

Generic example script

The workflow and its tasks altogether would look as follows:

workflow taskAlias {

  File firstInput
  File secondInput

  call stepA as firstSample { input: in=firstInput }
  call stepA as secondSample { input: in=secondInput }
  call stepB { input: in=firstSample.out }
  call stepC { input: in=secondSample.out }
}

task stepA {

  File in

  command { programA I=${in} O=outputA.ext }
  output { File out = "outputA.ext" }
}

task stepB {

  File in

  command { programB I=${in} O=outputB.ext }
  output { File out = "outputB.ext" }
}

task stepC {

  File in

  command { programC I=${in} O=outputC.ext }
  output { File out = "outputC.ext" }
}

Concrete example

Let's take a look at this task aliasing concept inside a real-world example; using the GATK, this workflow separates SNPs and Indels into distinct vcf files using the same task, select, but two distinct aliased calls, selectSNPs and selectIndels. The distinct outputs of these calls are then hard filtered by separate tasks designed specifically for them, hardFilterSNP and hardFilterIndel, respectively

For this toy example, we have defined three tasks:

  • select takes in a String type, specifying "SNP" or "Indel", and a File rawVCF, outputting a File rawSubset that contains only variants of the specified type.
  • hardFilterSNP takes in a File rawSNPs and outputs a File filteredSNPs.
  • hardFilterIndels takes in a File rawIndels and outputs a File filteredIndels.

Concrete example script

The above workflow description would look like this in its entirety:

workflow taskAliasExample {

  File rawVCFSample

  call select as selectSNPs { input: type="SNP", rawVCF="rawVCFSample" }
  call select as selectIndels { input: type="INDEL", rawVCF="rawVCFSample" }
  call hardFilterSNP { input: rawSNPs=selectSNPs.rawSubset }
  call hardFilterIndel { input: rawIndels=selectIndels.rawSubset }
}

task select {

  String type
  File rawVCF

  command {
    java -jar GenomeAnalysisTK.jar \
      -T SelectVariants \
      -R reference.fasta \
      -V ${rawVCF} \
      -selectType ${type} \
      -o raw.${type}.vcf
  }
  output {
    File rawSubset = "raw.${type}.vcf"
  }
}

task hardFilterSNP {

  File rawSNPs

  command {
    java -jar GenomeAnalysisTK.jar \
        -T VariantFiltration \
        -R reference.fasta \
        -V ${rawSNPs} \
        --filterExpression "FS > 60.0" \
        --filterName "snp_filter" \
        -o .filtered.snps.vcf
  }
  output {
    File filteredSNPs = ".filtered.snps.vcf"
  }
}

task hardFilterIndel {

  File rawIndels

  command {
    java -jar GenomeAnalysisTK.jar \
        -T VariantFiltration \
        -R reference.fasta \
        -V ${rawIndels} \
        --filterExpression "FS > 200.0" \
        --filterName "indel_filter" \
        -o filtered.indels.vcf
  }
  output {
    File filteredIndels = "filtered.indels.vcf"
  }
}

Note that here for simplicity we omitted the handling of index files, which has to be done explicitly in WDL. For examples of how to do that, see the Tutorials and Real Workflows.


Return to top
Add Variables Validate Syntax