By Jose Soto, Software Engineer in the Broad Institute's Data Sciences Platform

Over the past few months, I spent a lot of time optimizing our pipelines for cost (including for the $5 Genome Analysis Pipeline), so I thought I'd share a few of the tricks I find especially effective, starting with what we call dynamic sizing.

The idea is that on a pay-as-you-go cloud like Google Cloud Platform, the rate you pay for compute is based on the technical specs of the machines you use; the beefier the machine, the higher the rate. So you can save yourself a lot of money by making sure you don't use machines that are beefier than you need.

Here's how we do that in our pipelines.


When you're running a WDL workflow on the cloud, every task in your WDL needs to specify how big a disk it should be run on. Most of the time, when you're first writing your WDL, you'll just put in an overestimated value hardcoded directly into the disk runtime attribute of the task so you can get it to succeed and move on. Looks something like this:

 runtime {
    docker: "favorite_org/my_favorite_docker:favorite_version"
    memory: "3000 MB"
    disk: "local-disk 500 HDD"   ## hardcoded disk size (500) and type (HDD)
  }

Once you're satisfied that it works, you'll iterate on your WDL to give yourself more control by parameterizing this sort of setting. You put in a variable that is referenced in the disk runtime attribute so that you can pass in a value for it through the inputs JSON. The idea here is that if you know something about what the task is doing (e.g. reading in a BAM -> writing a BAM, or reading in a BAM -> writing a VCF) you can make a much better guesstimate than some hardcoded value and save some $$. That looks something like this.

task t {
  Int disk_for_my_task

  ...
  ...

  runtime {
    docker: "favorite_org/my_favorite_docker:favorite_version"
    memory: "3000 MB"
    disks: "local-disk " + disk_for_my_task + " HDD"   ## disk now passed in from the task input
  }
  ...
  ...
}

One drawback to this solution is that now every time you call this task, you have to come up with some value for disk that should probably maybe work for your specific set of inputs --and this can vary wildly depending from run to run if your inputs themselves vary a lot. It’s always super annoying when a task fails because it ran out of disk space, so having to guess every time your input changes can quickly become very frustrating.

But why not use the WDL itself to come up with these values for you? After all, when you run it, it has access to the specific input files it's going to run on. So let's just have it calculate the disk size it will need based on the size of the data it's been given. No more guessing required! Well, maybe a little bit of guessing to estimate the total size including outputs...

To accomplish this, we get to play with some fun WDL functions like size, ceil, and float which you can read about in the specification. We can do the calculation at the workflow level and just pass in the resulting variable to the task using the disk_for_my_task input we already wired up earlier. Here's what our task looks like now:

workflow w {
  File ref_fasta
  File ref_dict
  File input_bam

  # Plan on some adjustable disk padding 
  Int disk_pad = 10

  # Calculate disk size based on inputs, padding and expected output
  Int disk_for_my_task = ceil(size(ref_fasta, "GB") + size(ref_dict, "GB") + size(input_bam, "GB")) + disk_pad + 5

  call CollectBamMetrics {
    input:
      input_bam = input_bam,
      ref_fasta = ref_fasta,
      ref_dict = ref_dict,
      disk_for_my_task = disk_for_my_task 
  }
}

So here you see we're adding up the sizes of the main input files, rounding up the total with ceil() (which is short for "ceiling"), and adding a bit of adjustable padding plus 5 GB hardcoded to account for the output, which in this case we know will be a text file of metrics so nothing huge.

One drawback to this way of going about things is that now you have boilerplate size calculations for each call in your WDL, which ends up generating a lot of clutter. To clean that up and make the workflow section look nicer, you can push down the size calculations into the task itself. On the other hand, keeping it in the workflow allows you to share values like multipliers or whatever scheme you come up with between calls. There's no right way to do this, it depends what you're going for and what you care about most.

That said it would be way better to have a function you could use in your task that summed up the sizes of your File inputs automagically and you could just use that. Then you could do something like:

 runtime {
    docker: "my_favorite_docker:latest"
    memory: "3000 MB"
    disks: "local-disk " + summed_inputs() * 2 + 10 + " HDD"
  }

where summed_inputs() would automagically return the sum of all your inputs, and you could add inline any additional arithmetic you need for the autosizing. In this case we're doubling the size of the inputs and adding 10 GB of padding. Sadly this function doesn’t exist yet, but there is an open ticket to create it...

Anyway, try these out and let me know how it goes -- especially if you find new ways to use them!


Return to top

Fri 9 Mar 2018
Comment on this article


- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

WDL Dev Team

@WDL_dev

@awscloud @BroadGenomics Jamie the Cromwell pig runs on AWS Batch! Read the blog post by @RuchMunsh at… https://t.co/n6AK77ljQo
1 Oct 18
@sminot @iprophage Tadaaa Cromwell is now on AWS, see https://t.co/gT0KFnmGC5 and https://t.co/375wew4Alk
1 Oct 18
@sminot @iprophage Like seriously we’re talking days on a scale that a preschooler can count to. Keep an eye on the… https://t.co/JFGyviqGSL
22 Sep 18
@sminot @iprophage Coming sooooon...
22 Sep 18
@vanilla Aaand we're back; everything should be working properly now.
20 Sep 18

- Our favorite tweets from others

@sminot So true! If the workflow manager side of you is interested, check this out. I’ve been getting more into thi… https://t.co/GmXZKac9b5
21 Sep 18
We had a lot of fun in the last day of the #GATK workshop. A lot to learn and practice on their workflow descriptiv… https://t.co/bTMSyT8W60
20 Sep 18
Today at #GATK course, pipelining with WDL, Cromwell and Firecloud! @ClinicalBioinfo @FProgresoysalud @gatk_dev https://t.co/V4bLinpoPh
20 Sep 18
Pretty cool. An #rstats shiny dashboard for Cromwell (and thus both #OpenWDL and #commonwl) https://t.co/Qg9ZBa1slX
4 Aug 18
Reminder for #GCCBSOC attendees, there's a BoF for #OpenWDL at 12:30 in PAB 332.
27 Jun 18

See more of our favorite tweets...