[2] (howto) Write a simple multi-step workflow
Tutorials | Created 2016-03-14 | Last updated 2017-08-31


Comments (4)

Requirements

This walkthrough assumes a (very) basic understanding of GATK’s tools and that you have read the Getting Started guide. You should have installed the necessary tools, and gone through the previous tutorial, as we will be starting off using its script. Lastly, you will need to download the zip bundle containing this tutorial's data. We will use a toy dataset: NA12878, subset to chromosome 20. The files in the bundle are described in the enclosed README.


Introduction

In this tutorial, you will learn how to pass inputs to a task call, write a branched multi-step workflow, and apply task aliasing. You’ll also learn a neat trick for reducing the number of times you need to input the same data to a group of tasks. We will expand the script you have written in the previous tutorial by adding a step called select. This task separates SNP and indel variants called by haplotypeCaller into separate files. This is something we occasionally need to do to process SNPs and indels differently, for example when applying hard filters to a variant callset. Let’s take a look at our workflow diagram:

It is always useful to begin your pipeline writing by sketching out a diagram like the one above. As you will see in the sections to come, we will frequently reference this diagram. It acts as an outline for both your workflow and each individual task.


Write your script

Copy over the helloHaplotypeCaller script you created previously into a new text file, which we will call simpleVariantSelection.wdl.


Workflow

Before we get down to the nitty-gritty of writing the new task, let’s first take a look at how our workflow is going to fit the tasks together. In the diagram above you saw how haplotypeCaller‘s output fed into the next step’s input; we are going to tell Cromwell exactly how to do that. We already have a workflow that simply calls the haplotypeCaller step, so now let’s call each of the remaining steps:

workflow SimpleVariantSelection {
  call haplotypeCaller { input: }
  call select { input: }
  call select { input: }
}

Our workflow makes a call to each task, following that flow diagram from earlier. However, those two identical calls to select will cause trouble later. Cromwell doesn’t allow us to call the same task twice due to the fact that the second call would overwrite the data generated from the previous one. Luckily, we can use something called task aliasing to give the task a name that’s specific to that exact call to it, like so:

workflow SimpleVariantSelection {
  call haplotypeCaller { input: }
  call select as selectSNPs { input: }
  call select as selectIndels { input: }
}

It’s similar to declaring a variable name for your inputs--you are just giving that task call a variable name that you can reference it by. When we want to do something with the output, we will call selectSNPs.rawSubset, rather than select.rawSubset--but that’s a topic for another tutorial.

Let’s try adding in the inputs next. The task haplotypeCaller is the very start of our workflow, and it requires an input bam. This file will be passed in directly to the task, so there is no need to specify it within the workflow here. For the remaining tasks, we must specify the inputs that will be generated from previous steps in the workflow. To specify an input, follow the format below

input: inputname=taskname.outputname

Therefore, to tell our select step to take the rawVCF output from haplotypeCaller, we write input: rawVCF=haplotypeCaller.rawVCF. Simply go through each task call, and follow our flow diagram to properly assign inputs.

workflow SimpleVariantSelection {
  call haplotypeCaller { input: }
  call select as selectSNPs { 
    input: 
      type=, 
      rawVCF=haplotypeCaller.rawVCF
  }
  call select as selectIndels {
    input: 
      type=, 
      rawVCF=haplotypeCaller.rawVCF
  }
}

The task, select, has an input called type, which does not accept an input from any earlier step. We can specify the type as INDEL or SNP simply by passing in a string.

workflow SimpleVariantSelection {
  call haplotypeCaller { input: }
  call select as selectSNPs { 
    input: 
      type="SNP", 
      rawVCF=haplotypeCaller.rawVCF
  }
  call select as selectIndels {
    input: 
      type="INDEL", 
      rawVCF=haplotypeCaller.rawVCF
  }
}

Now, all the inputs have been assigned, and the workflow should be complete, right? Not quite--we have a few variables that each task is going to need. Originally, we declared these in the haplotypeCaller task definition, but now we will set up pass-through variables. We aren’t changing anything in haplotypeCaller, but we will declare additional variables in the workflow and hand these variables as an input to each command, as shown below. This way, when we generate our inputs file, we only need to tell Cromwell where the tool, reference files, and sample name are a single time. If you’ve ever met a toddler, you’ll know how tedious it can be to answer the same question over and over again.

workflow SimpleVariantSelection {
  File gatk
  File refFasta
  File refIndex
  File refDict
  String name

  call haplotypeCaller {
    input: 
      sampleName=name, 
      RefFasta=refFasta, 
      GATK=gatk, 
      RefIndex=refIndex, 
      RefDict=refDict
  }
  call select as selectSNPs {
    input: 
      sampleName=name, 
      RefFasta=refFasta, 
      GATK=gatk, 
      RefIndex=refIndex, 
      RefDict=refDict, 
      type="SNP",
      rawVCF=haplotypeCaller.rawVCF
  }
  call select as selectIndels {
    input: 
      sampleName=name, 
      RefFasta=refFasta, 
      GATK=gatk, 
      RefIndex=refIndex, 
      RefDict=refDict, 
      type="INDEL", 
      rawVCF=haplotypeCaller.rawVCF
  }
}

That’s it! Your workflow is complete.


Tasks

haplotypeCaller

Please reference the previous tutorial for instructions on how to write this task if you do not already have it written.


select

This task uses GATK’s SelectVariants to separate indels from SNPs in a vcf. To begin, set up an outline that includes the required section stubs and pass-through variables:

task select {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName

  command {}
  output {}
}

Following our initial diagram, the two inputs we need to add to this task are String type and File rawVCF. For our command, we will be calling SelectVariants as we would on the command line, but plugging in our variables as we have done before. The command looks like so:

    java -jar ${GATK} \
      -T SelectVariants \
      -R ${RefFasta} \
      -V ${rawVCF} \
      -selectType ${type} \
      -o ${sampleName}_raw.${type}.vcf

Our last job for this task is to assign the output from SelectVariants to a variable, rawSubset. We follow the same format as before, Type outputVariable = “output_from_command”, which you can see in our completed task below:

task select {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  String type
  File rawVCF

  command {
    java -jar ${GATK} \
      -T SelectVariants \
      -R ${RefFasta} \
      -V ${rawVCF} \
      -selectType ${type} \
      -o ${sampleName}_raw.${type}.vcf
  }
  output {
    File rawSubset = "${sampleName}_raw.${type}.vcf"
  }
}

Running the pipeline

As we did in our first tutorial, it is good practice to first validate your WDL script before running it to catch any silly syntax errors. Your next step is to generate an input file. We have done so using the command below, then filled in the inputs within simpleVariantSelection_inputs.json using our favorite text editor. If you don’t know which file is which in the bundle, don’t forget to check the README.

java -jar wdltool.jar inputs simpleVariantSelection.wdl > simpleVariantSelection_inputs.json

With your completed inputs file, you can now run your script locally (i.e. on your own computer). We will run using the command below:

java -jar cromwell.jar run simpleVariantSelection.wdl simpleVariantSelection_inputs.json

When you do, Cromwell should begin showing update messages on your terminal until the workflow is complete.


Check your results

If you’ve done everything correctly, Cromwell will finish and display the paths to the output from each step. Our pipeline has taken in pre-processed bam files and called variants with haplotypeCaller. We then split the variants into SNPs and indels. Let’s check one of our outputs, SimpleVariantSelection.selectIndels.rawSubset. The simplest way to do so is to view the vcf as a text file, either using the more command in the terminal or by opening the file in a text editor. (Syntax: more <filename>)

#CHROM  POS ID  REF ALT QUAL  FILTER  INFO  FORMAT  NA12878
20  9999996 . A ACT 534.73  . AC=2;AF=1.00;AN=2;DP=12;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.82;SOR=5.136 GT:AD:DP:GQ:PL  1/1:0,12:12:39:572,39,0
20  10001436  . A AAGGCT  1222.73 . AC=2;AF=1.00;AN=2;DP=29;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=32.17;SOR=0.836 GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0

Admittedly, it isn’t the easiest to read. The column titles are the first line, and each subsequent line is information on a single variant site. If you’d prefer, GATK has a tool, VariantsToTable, which is used to extract annotations of interest into a more readable format. The resulting .table file can be opened in RStudio, Matlab, or Excel. You can find a .table output (for which all the annotations were displayed) in the .zip bundle attached to this article, or see below for a preview.

Now take a moment to pat yourself on the back. You’ve constructed a multi-step workflow, and you’re well on your way to writing entire pipelines! Continue on in your journey by adding a few more tasks, and learning the merge half of workflow plumbing in the next tutorial.


Return to top Comment on this article