FireCloud is now powered by Terra! -- STARTING MAY 1st, 2019 THIS WEBPAGE WILL NO LONGER BE UPDATED.
From now on, please visit the Terra Help Center for documentation, tutorials, roadmap and feature announcements.
Want to talk to a human? Email the helpdesk, post feature requests or chat with peers in the community forum.

Heads up if you use the TARGET and/or TCGA data workspaces: we're planning an update that will change the paths to the data files. This will disable access to the original files in any clones of the original workspaces that retain the old paths, and it will affect call caching in any new clones of the original workspaces that you create after the update rolls out in a few weeks.

Read on to understand what this change entails and why it's an important improvement that is worth making.

What's changing?

The file paths will have a new URL structure. After the update, file paths will no longer start with “gs://” followed by a bucket path. Instead, they will follow this structure: “dos://”.

What are the consequences for your work?

If you are working in clones of the original master workspaces, the original data files will no longer be available at the old "gs://" paths. So if you want to run new analyses in those previously existing workspaces, you will need to update the metadata to use the new "dos://" paths. However, any output files you previously derived from analyzing those datasets will be unaffected and will remain accessible.

The first time you run workflows on the data using the new paths, in any workspace (old or new), call caching will NOT kick in even if you have previously run those workflows on the same data. The workflows will have to run in full; this is because call caching uses the file paths as part of the algorithm that identifies whether a given computation has already been run with the same starting conditions.

Why are we making this change?

We understand that this update has the potential to disrupt your work, so please rest assured we are not making this decision lightly. In a nutshell, the switch in file path structure brings substantial benefits that we believe are worth risking some disruption.

In the past, the Genomic Data Commons (GDC) only released new datasets a few times a year, but the frequency at which we receive data has been increasing steadily, which has been making it more difficult to manage the new content and make it available in a timely way. This update will enable us to respond more efficiently and in a standardized manner to new datasets released by the GDC. As a result, you can expect to see updates to TARGET and TCGA workspaces happen much more quickly than in the past, and we'll also be able to provide access to new GDC datasets from our platform as they become available.

For context, this move is part of larger shift within the GDC towards location-agnostic URLs, which allow the physical data to relocate without changing references to those data. This follows changes in standards of how datasets are stored across the Data Commons Framework (DCF), which we expect will provide a more standardized and streamlined experience going forward. We are keen to align with the GDC on this effort and we look forward to seeing its benefits materialize for everyone.

Let us know if you have any concerns or questions; as always we are here to help.

Return to top

birger on 19 Mar 2019

Geraldine, Does this apply to just the hg38 TCGA and TARGET workspaces, which didn't reference files via gs:// URLs but rather GDC UUIDs, or does it also apply to the hg19 TCGA workspaces? -Chet

Geraldine_VdAuwera on 19 Mar 2019

Hi Chet, that's a good question -- @abaumann can you shed some light on that?

birger on 19 Mar 2019

I'm a bit confused because the hg38 workspaces never contained gs: urls. They just contained GDC UUIDs. As a temporary solution we provided workflows for retrieving files from the GDC based on the UUIDs, but the plan was for the GDC and FireCloud to implement UUID to URL resolution. It looks like instead of doing that you and the GDC are introducing a new type of location-independent URL (which should really be called a URI - universal resource identifier). Will these "dos" paths incorporate the GDC file UUIDs? Will they be included in any file manifests downloaded from the GDC?

abaumann on 19 Mar 2019

For hg38 you are correct those never had gs:// urls, it had the uuids and your downloader got that data. The gs:// urls would be replaced with dos:// uris for the hg19 workspaces. For the hg38 workspaces, these uuids would be replaced with dos://. It's not location dependent - these URIs resolve to google bucket URLs (or files in AWS, etc). There are different resolvers out there that can all take these uuids and resolve them to URLs. The uuids for these DOS urls are the same uuids as your workspaces had, but looking through your workspaces they included also the file names, which these new workspaces will not - they would only have dos://. Does this resolve the question you have about manifests given you will still have the same uuids (just in dos uri form)? We put in the actual uri so that we can know how to handle it in the ui for instance - click and see a preview. If we didn't have this we couldn't otherwise identify between DOS uuids and other uuids (like the workflow, submission, and other uuids that our system also uses for instance).

birger on 19 Mar 2019

I included the file names in the hg38 workspaces to support the file downloaders. The filenames should not be required once we move to "dos" URIs. Will the manifest provide dos URIs or just have the uuids, and we will construct the dos urls from the uuids?

birger on 19 Mar 2019

Could you provide an example of a dos:// url? I tried adding the url "dos://0141f91f-b350-45df-bbb8-983007daf27c", where the body of the url is the uuid of a file on the GDC, to a firecloud workspace. The Firecloud GUI identifies it as a file url, but fails to resolve it to the file on cloud storage. I suspect I'm using the incorrect syntax for incorporating a file's GDC UUID into the dos url. Thanks.

- Recent posts

- Follow us on Twitter



RT @TerraBioApp: Terra #OpenScience Contest -- You be the judge! Over the past month we ran a contest in which four teams created workspace…
26 Jun 19
FireCloud project resources are affected by this GCP outage as well.
2 Jun 19
RT @jklemm: Also available on @BroadFireCloud where it was leveraged to process all of the RNA-Seq data from TCGA and GTEx through STAR-fus…
29 May 19
RT @TerraBioApp: Do you have a pet workflow or a favorite notebook? Have you thought about sharing them with the world, but keep pushing it…
18 May 19
RT @jklemm: Great meeting this week with #NCICloud and Data Commons Framework teams discussing cancer research priorities for #NCICommons.…
15 May 19

- Our favorite tweets from others

See the theme? Green!
24 Jul 19
I will be introducing Terra to aspiring bioinformatics researchers later this month. I discovered FireCloud (predec…
2 May 19
Pipelines API is used by a number of popular tools, such as Firecloud/Terra from @broadinstitute. @BroadFireCloud
11 Apr 19
The macaque genome isn't finished so it has over 200K contigs. I call them the Rhesus pieces.
15 Mar 19

See more of our favorite tweets...