FireCloud is now powered by Terra! -- STARTING MAY 1st, 2019 THIS WEBPAGE WILL NO LONGER BE UPDATED.
From now on, please visit the Terra Help Center for documentation, tutorials, roadmap and feature announcements.
Want to talk to a human? Email the helpdesk, post feature requests or chat with peers in the community forum.
FIRECLOUD | Doc #11536 | I need to process readgroup-level files but there's no obvious place for them in the data model

# I need to process readgroup-level files but there's no obvious place for them in the data modelSolutions to Problems | Created 2018-02-28 | Last updated 2018-03-05

Ah, the eternal question of how to deal with readgroups in a world where the data model does not acknowledge their existence. In a not-so-distant future where rainbows and unicorns roam the clouds, we'll have a more flexible data model where you can explicitly put read groups one level under the samples entity table.

In the meantime, here's how we recommend dealing with readgroup-level files:

### 1. Set up the data model

In your samples table, declare individual samples as the actual samples you expect to have once your readgroup data will be merged.

### 2. Attach readgroups FoFN to each sample

For each sample, provide a "file of file names" (which we commonly call FoFN) containing a list of paths to the readgroup files (typically FastQs or uBAMs) in your bucket. We typically use the gsutil command line utility (admittedly outside of FC) to generate the FoFN of readgroup files.

For example, if your readgroup file paths all contain the sample name in their filename, then you could run a command to get a list of all file paths containing a particular sample name within a shared folder: gsutil ls gs://bucket/path/to/readgroup_bams_folder/*sampleName* > sampleName.RG_bams.list

Adapt the WDL you want to run to take in a FoFN input, and then use the read_lines() function to convert the contents of the FoFN into an array of readgroup file paths. In FireCloud, edit your Method Configuration to run on sample as the root entity.

File file_of_filenames
Array[File] flowcell_unmapped_bams = read_lines(file_of_filenames)

Within that WDL, you can then run a scatter across the readgroup files.

The scatter block would look something like this:

# run on the readgroup files in parallel
scatter (unmapped_bam in flowcell_unmapped_bams) {
call something_that_maps_bams {
input:
input_bam = unmapped_bam
}
}  

### 5. Aggregate per sample

Optionally, you can run something that merges readgroup files per-sample. The output of the scattered tasks will be arrays of whatever the task produces, so you can easily feed that to a merge operation that takes an array.

So your call would look something like this:

   # output from something_that_maps_bams is automatically gathered into an array
# when the call's output is referenced from outside of the scatter block.
Array[File] mapped_bams = something_that_maps_bams.output_mapped_bam

}
In FireCloud, link the final output of the WDL as a sample attribute, eg call it this.output in the method configuration. If the output of your WDL is a sample-level aggregate (eg per-sample bam) then you should be all set to proceed from there. If the output is not yet at the aggregated sample level (eg it's an intermediate thing per-read group) you can glob it and run whatever step is next on the glob, or something to that effect. We can advise you on the particulars if needed; please ask questions in the comment thread.