Cromwell version 32 was released last Thursday evening, June 7th, and if you saw the release notes, you may have wondered what we meant by “File read limits." For those who didn’t see the notes, we explained that this feature is improving Cromwell and thus FireCloud stability, but didn’t get into too much detail about it. In this post, I’ll explain how this will help stability by framing the problem, solution, and potential impacts.
Problem FireCloud will slow down or completely stop working if a user plugs in a large file into the WDL read_lines() call. The read_lines() call is frequently used to ingest a list of filenames, genomic intervals and things like that for scattering purposes. Users have accidentally set this to read in a bam file causing Cromwell’s memory to load up and thus slowing down the engine. This vulnerability means that one user can take down the service.
Solution File read limits is a Cromwell configuration option that limits what can be read in through a WDL read_lines() call. By establishing a lower limit, the system is safeguarded from being taken down by a single user. FireCloud’s Cromwell configuration now uses a file read limit of 1 MB. Limits are also used for other read_X functions and can be found here.
If you see an error message like this, you've been impacted: "File
Help If you find yourself blocked by these changes, please don't hesitate to reach out for help. We want to ensure that the benefit of making the system more stable outweighs any individual disruptions.
By Eric Weitz, software engineer, Data Sciences Platform at the Broad Institute
Have you heard of the FAIR principles? They are a set of guidelines proposed as part of a growing movement to make data more Findable, Accessible, Interoperable, and Reusable. As this movement gains traction, we are seeing more FAIR-related activities at major meetings and conferences. For example the recent Bio-IT World meeting in Boston included a conference track dedicated to FAIR, as well as a hackathon.
I was part of a team of four people from the Broad Institute's Data Sciences Platform that participated in the Bio-IT hackathon. Our goal: make data more FAIR in Single Cell Portal, which is built on top of FireCloud. In addition to improving the Single Cell Portal’s scientific data management, the hackathon also gave our team a chance to work with developers from other organizations in a manner that was uniquely nimble.
We are excited to introduce a new Featured workspace that demonstrates the GenoMetric Query Language (GMQL) created by a team from Politecnico di Milano in Italy. For some context on Featured workspaces, please read our previous blog post.
GMQL is a high-level, declarative language supporting queries over thousands of heterogeneous datasets and samples; as such, it enables genomic “big data” analysis. Based on Hadoop framework and the Apache Spark platform, GMQL is designed to be highly scalable, flexible, and simple to use. You can try the system here through its several interfaces, with documentation and biological query examples on ENCODE, TCGA and other public datasets or clone the Featured workspace and launch an example analysis.
The GMQL 101 workspace features three methods, each with increasing levels of complexity to give you a taste of how the query language works. One method shows how to join two datasets, and then extracts a third dataset based on a specific condition: pairs of regions that are less than 1000 bases a part. The second method takes a VCF and performs an epigenomic analysis using gene annotation and Chip-Seq results. It shows how you can select high confidence regions, use RefSeq annotations to find regions that overlap a gene, and count the mutations falling within the high confidence regions. Finally, the third method is a combination of GATK4’s Mutect 2 pipeline and the second method, showing an epigenomic analysis from start (calling somatic variants) to finish (annotating variants). For any GMQL-specific questions or problems you can visit the GMQL GitHub page.
Many thanks to Luca Nanni, Arif Canakoglu, Pietro Pinoli, and Stefano Ceri for putting together this workspace. It takes a lot of thought and effort to create a valuable learning resource like this, and we are still figuring out the most successful way to do this. Please share your thoughts in the Comments section below on the effectiveness of this workspace and any other Featured workspaces you try out. If you are interested in featuring examples of your methods in this way, please tell us here, and we can talk to you about the process.
More and more method developers are using the Method Repository to make their pipelines publicly accessible to the FireCloud community. By making the methods public other researchers can use them instead of building their own, similar methods. However, just providing the method on it’s own, without a configuration, or documentation limits reusability. This post is about how Featured workspaces solved this problem for GATK4 methods and how an outside group will contribute the first third-party Featured workspace, demonstrating that any developer can do this.
Featured workspaces hold the latest version of a method, configured to work out of the box on an accompanying example dataset. This means you can launch the method without doing any setup, e.g., finding data or configuring pipelines. You can see the required inputs and configuration settings clearly, and once launched, check out all the outputs that it produces. When you are ready to launch it on your own data, all you need to do is replace the example dataset with your own, following the guidance in the docs. This takes the guesswork out of configuring a method on your own dataset. All together, these workspaces should make it easy to reuse methods.
We originally developed this “packaging” for methods with a group of GATK method developers we work closely with, to help people learn and test the most up-to-date GATK4 pipelines. These Featured workspaces went live once many of the GATK4 tools left beta status around January 2018 (GATK4 launch). People are interested in other pipelines besides GATK, and tomorrow we will announce a new Featured workspace put together by a team from Politecnico di Milano showcasing a different tool. Stay tuned!
Interested in putting together a workspace like this and having it featured? Let us know in this sign-up survey. We can walk you through the process we just went through with our friends from Politecnico di Milano.
We are planning some system upgrades to address some of the recent stability issues that have affected reliability of service in FireCloud. The upgrade process will cause a temporary interruption of service; we estimate the interruption may last up to 30 minutes. We do not yet have a specific time to announce; we expect it will be in the afternoon or evening (EST) of Sunday, May 27. We will post an update here when we are able to narrow down the window of time more precisely. Thank you for your patience while we work to improve the quality of service in FireCloud.
UPDATE: This issue described below has been resolved.
Due to an individual user's submission that amounts to a very large number of jobs (~60k), all new workflow submissions are currently being held in the queue (with status
QueuedInCromwell). To be clear, as far as we can tell this is NOT a FireCloud malfunction; it seems to be a Google Cloud limitation that we are encountering for the first time. We are working with GCP support and evaluating options to unblock the queue, hopefully without interrupting that one very ambitious and totally legitimate submission. We will strive to resume normal workflow throughput by Monday morning EST.
We understand that this is causing many of you considerable inconvenience, yet we are hopeful that this case will provide an opportunity to push back the current limitations to the next level. Please remember that what we are all doing here, together, is blazing a new trail; building a new model for how we do science at scale, collaboratively. The fact that these scaling problems are arising at all demonstrates that we are on the right path, that the research community needs this level of scalability. And we will do everything in our power to deliver it.
Thank you for your patience and stay tuned for updates.
Broad Institute’s Genomics Platform & Data Science Platform announce the general availability of the FireCloud DataShuttle 0.1.1. The FireCloud DataShuttle allows users to easily browse files, download and upload data directly between FireCloud workspaces & Google buckets and your local drives, and monitor the status of these transfers.
The FireCloud DataShuttle was developed to facilitate the work of researchers and project managers who transfer a high volume of files and desire a more efficient and clearer process.