Heads up if you use the TARGET and/or TCGA data workspaces: we're planning an update that will change the paths to the data files. This will disable access to the original files in any clones of the original workspaces that retain the old paths, and it will affect call caching in any new clones of the original workspaces that you create after the update rolls out in a few weeks.

Read on to understand what this change entails and why it's an important improvement that is worth making.


What's changing?

The file paths will have a new URL structure. After the update, file paths will no longer start with “gs://” followed by a bucket path. Instead, they will follow this structure: “dos://”.

What are the consequences for your work?

If you are working in clones of the original master workspaces, the original data files will no longer be available at the old "gs://" paths. So if you want to run new analyses in those previously existing workspaces, you will need to update the metadata to use the new "dos://" paths. However, any output files you previously derived from analyzing those datasets will be unaffected and will remain accessible.

The first time you run workflows on the data using the new paths, in any workspace (old or new), call caching will NOT kick in even if you have previously run those workflows on the same data. The workflows will have to run in full; this is because call caching uses the file paths as part of the algorithm that identifies whether a given computation has already been run with the same starting conditions.

Why are we making this change?

We understand that this update has the potential to disrupt your work, so please rest assured we are not making this decision lightly. In a nutshell, the switch in file path structure brings substantial benefits that we believe are worth risking some disruption.

In the past, the Genomic Data Commons (GDC) only released new datasets a few times a year, but the frequency at which we receive data has been increasing steadily, which has been making it more difficult to manage the new content and make it available in a timely way. This update will enable us to respond more efficiently and in a standardized manner to new datasets released by the GDC. As a result, you can expect to see updates to TARGET and TCGA workspaces happen much more quickly than in the past, and we'll also be able to provide access to new GDC datasets from our platform as they become available.

For context, this move is part of larger shift within the GDC towards location-agnostic URLs, which allow the physical data to relocate without changing references to those data. This follows changes in standards of how datasets are stored across the Data Commons Framework (DCF), which we expect will provide a more standardized and streamlined experience going forward. We are keen to align with the GDC on this effort and we look forward to seeing its benefits materialize for everyone.

Let us know if you have any concerns or questions; as always we are here to help.


Return to top

birger on 19 Mar 2019


Geraldine, Does this apply to just the hg38 TCGA and TARGET workspaces, which didn't reference files via gs:// URLs but rather GDC UUIDs, or does it also apply to the hg19 TCGA workspaces? -Chet

Geraldine_VdAuwera on 19 Mar 2019


Hi Chet, that's a good question -- @abaumann can you shed some light on that?

birger on 19 Mar 2019


I'm a bit confused because the hg38 workspaces never contained gs: urls. They just contained GDC UUIDs. As a temporary solution we provided workflows for retrieving files from the GDC based on the UUIDs, but the plan was for the GDC and FireCloud to implement UUID to URL resolution. It looks like instead of doing that you and the GDC are introducing a new type of location-independent URL (which should really be called a URI - universal resource identifier). Will these "dos" paths incorporate the GDC file UUIDs? Will they be included in any file manifests downloaded from the GDC?

abaumann on 19 Mar 2019


For hg38 you are correct those never had gs:// urls, it had the uuids and your downloader got that data. The gs:// urls would be replaced with dos:// uris for the hg19 workspaces. For the hg38 workspaces, these uuids would be replaced with dos://. It's not location dependent - these URIs resolve to google bucket URLs (or files in AWS, etc). There are different resolvers out there that can all take these uuids and resolve them to URLs. The uuids for these DOS urls are the same uuids as your workspaces had, but looking through your workspaces they included also the file names, which these new workspaces will not - they would only have dos://. Does this resolve the question you have about manifests given you will still have the same uuids (just in dos uri form)? We put in the actual uri so that we can know how to handle it in the ui for instance - click and see a preview. If we didn't have this we couldn't otherwise identify between DOS uuids and other uuids (like the workflow, submission, and other uuids that our system also uses for instance).

birger on 19 Mar 2019


I included the file names in the hg38 workspaces to support the file downloaders. The filenames should not be required once we move to "dos" URIs. Will the manifest provide dos URIs or just have the uuids, and we will construct the dos urls from the uuids?

birger on 19 Mar 2019


Could you provide an example of a dos:// url? I tried adding the url "dos://0141f91f-b350-45df-bbb8-983007daf27c", where the body of the url is the uuid of a file on the GDC, to a firecloud workspace. The Firecloud GUI identifies it as a file url, but fails to resolve it to the file on cloud storage. I suspect I'm using the incorrect syntax for incorporating a file's GDC UUID into the dos url. Thanks.




- Recent posts



- Follow us on Twitter

FireCloud

@BroadFireCloud

If you missed it, Junko will be back at the Broad booth 3048 Monday at 9am — swing by for a FireCloud demo or to he… https://t.co/6VhPWvuqJq
1 Apr 19
Want to see FireCloud in action, and hear more about our upcoming upgrade to being powered by Terra? Come to our de… https://t.co/u7uOkPkk1R
28 Mar 19
RT @NCIDataSci: Looking to analyze #data without downloading large data sets? Use the #NCICloud Resources! These three resources support yo…
28 Mar 19
@jamesaeddy Going to phone a friend on that one — @geoffjentry care to explain?
19 Mar 19

- Our favorite tweets from others

Pipelines API is used by a number of popular tools, such as Firecloud/Terra from @broadinstitute. @BroadFireCloudhttps://t.co/z06kx9PRBg
11 Apr 19
The macaque genome isn't finished so it has over 200K contigs. I call them the Rhesus pieces.
15 Mar 19
It's been years in the making, but 500k structural variants (SVs) from 15k genomes in #gnomAD are out 🎉today🎉! Pre… https://t.co/5mYCvan5ou
15 Mar 19
@fdmts @tanyacash21 and R. Munshi @broadinstitute are beginning their session on hype vs. reality in cloud capabili… https://t.co/26Pdlh7IMw
12 Mar 19
watching "Genomics TV" https://t.co/M6NuEyKGuz #bioinformatics #GATK @broadinstitute https://t.co/jHteePNKcp
5 Mar 19

See more of our favorite tweets...