(How to) Execute Workflows from the gatk-workflows Git Organization
Tutorials | Created 2018-07-26 | Last updated 2018-08-29


Comments (0)

The gatk-workflows git organization houses a set of repositories containing workflows contributed by the Broad Institute and optimized versions of these workflows contributed by Intel to take advantage of the latest technologies like FPGA processors to accelerate time and performance. The workflows made available include several types of genomic analysis methods using GATK’s Best Practices, such as Data Preprocessing for Variant Discovery, Somatic Sequence Analysis using Mutect, and simpler workflows used for sequence format conversion.

The provided workflows have an accompanying JSON file containing references, resources, default parameters, and input bam files used to test the workflow on the users given platform. The document below will guide users on executing an example workflow on the Google Cloud Platform as well as running the workflow locally.

Please note that Broad is moving towards a cloud-centric computing environment, thus the provided workflows are designed and intended to work on the cloud. Some of these workflows may need to be modified by the user before executing on a local environment.

Key Google Cloud Buckets

Broad References Broad Public Datasets Gatk Test Data

Running Workflows Using Google Cloud Platform


General Prerequisites:

Tool Prerequisites:

Instructions:

  1. Setup your working directory
    • Make a directory to test workflows then change into that directory.
      mkdir gatk-workflows
      cd gatk-workflows
  2. Download latest release of Cromwell, the java excutable that will run the WDL.
    wget https://github.com/broadinstitute/cromwell/releases/download/33.1/cromwell-33.1.jar
  3. Clone the repository you would like to execute. In this example we will being executing validate bam from the seq-format-validation repository.
    git clone https://github.com/gatk-workflows/seq-format-validation.git
  4. Once you’ve successfully cloned the repository, seq-format-validation directory will appear in your gatk-workflows working directory. The seq-format-validation directory has multiple files but we are only concerned with the WDL and its json. We’ll be running the validate-bam.wdl workflow using its accompanying json file validate-bam.inputs.json. The json contains some required/optional parameters needed to run the workflow, including the path to a test input file located in a Google Cloud bucket.
  5. We have our WDL and we have a json file, but we need one more file to run on Google Cloud. This would be a configuration file to indicate to Cromwell that we would like to execute our workflow on the cloud. You can create your own configuration using the instructions found on Cromwell Documentation. In this example we'll name our conf file google-adc.conf and copy the contents below into our file.
    • Create and edit conf file
      vim google-adc.conf
    • Copy the contents below into the file
      include required(classpath("application"))
      google {
      application-name = "cromwell"
      auths = [
      { 
      name = "application-default"
      scheme = "application_default"
      }
      ]
      }
      engine {
      filesystems {
      gcs {
      auth = "application-default"
      }
      }
      }
      backend {
      default = "JES"
      providers {
      JES {
      actor-factory = "cromwell.backend.impl.jes.JesBackendLifecycleActorFactory"
      config {
      // Google project
      project = "<google-project-id>"
      compute-service-account = "default"
      // Base bucket for workflow executions
      root = "<google-bucket>/cromwell-execution"
      // Polling for completion backs-off gradually for slower-running jobs.
      // This is the maximum polling interval (in seconds):
      maximum-polling-interval = 600
      // Optional Dockerhub Credentials. Can be used to access private docker images.
      dockerhub {
        // account = ""
        // token = ""
      }
      genomics {
        // A reference to an auth defined in the `google` stanza at the top.  This auth is used to create
        // Pipelines and manipulate auth JSONs.
        auth = "application-default"
        // Endpoint for APIs, no reason to change this unless directed by Google.
        endpoint-url = "https://genomics.googleapis.com/"
      }
      filesystems {
        gcs {
          // A reference to a potentially different auth for manipulating files via engine functions.
          auth = "application-default"
        }
      }
      }
      }
      }
      }
      system {
      input-read-limits {
      lines = 1280000
      bool = 7
      int = 19
      float = 50
      string = 1280000
      json = 1280000
      tsv = 1280000
      map = 1280000
      object = 1280000
      }
      }
  6. At this point your directory structure should look like this
    |-gatk-workflows/
           |-cromwell-33.1.jar
           |-google-adc.conf
           |-seq-format-validation/
                      |-LICENSE
                      |-README.md   
                      |-Generic.google-papi.options.json
                      |-Validate-bam.inputs.json
                      |-validate-bam.wdl
  7. Before you execute the workflow you'll need to two pieces of information: 1) the project that will pay for the run, and 2) where to store your output files.
    • The set project name can be determined by entering gcloud info in your terminal. The project name will be listed under "Current Properties"
      Current Properties:
      [core]
      project: [your-project-name]
      account: [your-account@gmail.com]
      disable_usage_reporting: [True]
      [compute]
      region: [us-central1]
      zone: [us-central1-a]
    • The location of your bucket is completely up to you. It can be one you create or one that was designated to you by the project owner. An example would be gs://my-bucket/
  8. It's time to execute the workflow.
    java -Dconfig.file=google-adc.conf -Dbackend.providers.JES.config.project=<your-project-name> - 
    Dbackend.providers.JES.config.root=gs://<my-bucket>/ -jar cromwell-33.1.jar run ./seq-format-validation/validate- 
    bam.wdl --inputs ./seq-format-validation/validate-bam.inputs.json
  9. While the workflow is running, Cromwell will print out logs to your screen (lots of it). Once it completes it will print out a message indicating the run was successful. Also, it will print out the Google bucket location of the output files that was generated by your workflow.
    • You can copy the output file to a local directory using gsutil cp. For example:
      gsutil cp gs://my-bucket/path/to/output /path/to/local/directory/

Running Workflows Locally


Tool Prerequisites:

Docker Git

Instructions:

  1. Setup your working directory.
    • Make a working directory to test workflows then change into that directory.
      mkdir gatk-workflows
      cd gatk-workflows
    • Make a directory to store input files.
      mkdir inputs
  2. Download latest release of Cromwell, the java excutable that will run the WDL.
    wget https://github.com/broadinstitute/cromwell/releases/download/33.1/cromwell-33.1.jar
  3. Clone the repository you would like to execute. In this example we will be executing validate bam from the seq-format-validation repository.
    git clone https://github.com/gatk-workflows/seq-format-validation.git
  4. Once you’ve successfully cloned the repository, seq-format-validation directory will be in your gatk-workflows working directory. The seq-format-validation directory has multiple files but we are only concerned with the wdl and its json. We’ll be running the validate-bam.wdl workflow using its accompanying json file validate-bam.inputs.json. The json contains some required/optional parameters needed to run the workflow, including the path to the input file located in a Google Cloud bucket. Since we're running this locally we’ll need to first download any files mentioned in the json. In this case we’ll only need to download the input files but the same instructions can be used for reference/resource files. *Special note, because this is a local demo and the size of the medium bam file is 18 GB, we’ll only download and work with the small bam file. The input files listed in the json file are stated to be in the following google buckets "gs://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam", "gs://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_med.hg38.bam" The base Google bucket name is gs://gatk-test-data, a weblink to this Google bucket is provided in this document under the subtitle Key Google Cloud buckets. We use this web link to take us to the Google bucket of interest in our web browser then use the file path provided in the json (e.g. /wgs_bam/NA12878_24RG_hg38/NA12878_24RG_med.hg38.bam) to locate the files in the bucket. The following files can be downloaded by clicking on the file names.
  5. Once the file is downloaded, be sure it is moved to gatk-workflows/inputs/ directory.
  6. Next we’ll edit the json file so that all gs:// file paths are replaced local file paths.
    • change directories to the cloned repository
      cd seq-format-validation
    • Edit the json file to replace the input file path
      vim validate-bam.inputs.json
    • After replacing the input file paths you should have something like this
      {
      ##Comment1:Input,
      ValidateBamsWf.bam_array: [
      "/home/username/gatk-workflows/inputs/NA12878_24RG_small.hg38.bam"],
      ##Comment2:Parameter,
      ValidateBamsWf.ValidateBAM.validation_mode: SUMMARY,
      ##Comment3:Runtime - uncomment the lines below and supply a valid docker container to override the default,
      ValidateBamsWf.ValidateBAM.mem_size: 1 GB,
      ValidateBamsWf.ValidateBAM.disk_size: 100,
      ##ValidateBamsWf.ValidateBAM.gatk_path_override: String (optional),
      ##ValidateBamsWf.gatk_docker_override: String (optional)
      }
  7. Change back to the main working directory
    cd ../
    • At this point your directory structure should look like this
      |-gatk-workflows/
         |-cromwell-33.1.jar
         |-inputs/
         |          |-NA12878_24RG_small.hg38.bam
         |-seq-format-validation/
                    |-LICENSE
                    |-README.md   
                    |-Generic.google-papi.options.json
                    |-Validate-bam.inputs.json
                    |-validate-bam.wdl
  8. It's time to execute the workflow
    java -jar cromwell-33.1.jar run ./seq-format-validation/validate-bam.wdl --inputs ./seq-format-validation/validate-bam.inputs.json
  9. While the workflow is running cromwell will print out logs to your screen (lots of it). Once it completes it will print out a message indicating the run was successful. Also, it will print out the location of the output files that was generated by your workflow.

    • Side note After the workflow completes you’ll see two directories cromwell-executions and cromwell-workflow-logs. cromwell-workflow-logs will have a log file for each job you execute, while cromwell-executions will contain outputs generated by your executed job. cromwell-executions directory level will have a directory for each workflow you’ve run. We’ve only run ValidateBamsWf (title of workflow found in the WDL script in workflow block) so that will be the only folder that you will see.

Important Notes

  • It is the user’s responsibility to alter the json to meet their needs, example json files should not be used in production without being customized and vetted by principle scientists.

Return to top Comment on this article