By Robert Title, Engineering Manager, Data Sciences Platform at the Broad Institute
We are excited to announce that we have substantially improved the way you interact with Jupyter Notebooks in FireCloud. We hope this will increase your productivity and empower you to collaborate more effectively. These changes are publicly available as of today, Sep 12, 2018, and can be accessed in the Notebooks tab of your FireCloud workspace. Read on for a more detailed description of what is changing, and why it is better.
By Moran Cabili, product manager, Data Sciences Platform at the Broad Institute
We heard from many of you --both new FireCloud users and experienced WDL pipeline developers-- that you need to be able to quickly test that a WDL workflow can be run successfully on FireCloud. Until now, you were required to reference an entity in the workspace data model, which took extra effort and tended to confuse newcomers. We are happy to announce that this speed bump has been eliminated; you can now bypass the data model and even upload a JSON of inputs to get your WDL up and running in record time.
Today we bring to you a new facet of our forum: the feature request section!
Within this new section, you will be able to suggest new features, upvote existing ones, and see the status of features we implement. We can better gauge the number of people interested in a particular feature by the vote count, which helps to determine which features we work on first. So, if you represent a group of people who all care about a certain feature, ask everyone to vote on your feature request!
Features can be requested in this category of the forum. Simply click the blue
New Idea button to start your thread. Be as clear as possible in your title so other users will be able to see what your feature is about and will be more likely to vote on it. If you see a feature you like, please upvote it so we know you'd like to see it implemented.
We've pre-populated this section with all the feature requests that have been posted in the last two months. You can search for your own feature requests by looking at discussions you've started. If we missed yours, or if there's a feature from further back than May, you can ask us to move it by tagging @KateN in the thread. Or, simply create a new request.
Take a peek at what's been posted and vote for what you want to see!
Cromwell version 32 was released last Thursday evening, June 7th, and if you saw the release notes, you may have wondered what we meant by “File read limits." For those who didn’t see the notes, we explained that this feature is improving Cromwell and thus FireCloud stability, but didn’t get into too much detail about it. In this post, I’ll explain how this will help stability by framing the problem, solution, and potential impacts.
Problem FireCloud will slow down or completely stop working if a user plugs in a large file into the WDL read_lines() call. The read_lines() call is frequently used to ingest a list of filenames, genomic intervals and things like that for scattering purposes. Users have accidentally set this to read in a bam file causing Cromwell’s memory to load up and thus slowing down the engine. This vulnerability means that one user can take down the service.
Solution File read limits is a Cromwell configuration option that limits what can be read in through a WDL read_lines() call. By establishing a lower limit, the system is safeguarded from being taken down by a single user. FireCloud’s Cromwell configuration now uses a file read limit of 1 MB. Limits are also used for other read_X functions and can be found here.
If you see an error message like this, you've been impacted: "File
Help If you find yourself blocked by these changes, please don't hesitate to reach out for help. We want to ensure that the benefit of making the system more stable outweighs any individual disruptions.
By Eric Weitz, software engineer, Data Sciences Platform at the Broad Institute
Have you heard of the FAIR principles? They are a set of guidelines proposed as part of a growing movement to make data more Findable, Accessible, Interoperable, and Reusable. As this movement gains traction, we are seeing more FAIR-related activities at major meetings and conferences. For example the recent Bio-IT World meeting in Boston included a conference track dedicated to FAIR, as well as a hackathon.
I was part of a team of four people from the Broad Institute's Data Sciences Platform that participated in the Bio-IT hackathon. Our goal: make data more FAIR in Single Cell Portal, which is built on top of FireCloud. In addition to improving the Single Cell Portal’s scientific data management, the hackathon also gave our team a chance to work with developers from other organizations in a manner that was uniquely nimble.
We are excited to introduce a new Featured workspace that demonstrates the GenoMetric Query Language (GMQL) created by a team from Politecnico di Milano in Italy. For some context on Featured workspaces, please read our previous blog post.
GMQL is a high-level, declarative language supporting queries over thousands of heterogeneous datasets and samples; as such, it enables genomic “big data” analysis. Based on Hadoop framework and the Apache Spark platform, GMQL is designed to be highly scalable, flexible, and simple to use. You can try the system here through its several interfaces, with documentation and biological query examples on ENCODE, TCGA and other public datasets or clone the Featured workspace and launch an example analysis.
The GMQL 101 workspace features three methods, each with increasing levels of complexity to give you a taste of how the query language works. One method shows how to join two datasets, and then extracts a third dataset based on a specific condition: pairs of regions that are less than 1000 bases a part. The second method takes a VCF and performs an epigenomic analysis using gene annotation and Chip-Seq results. It shows how you can select high confidence regions, use RefSeq annotations to find regions that overlap a gene, and count the mutations falling within the high confidence regions. Finally, the third method is a combination of GATK4’s Mutect 2 pipeline and the second method, showing an epigenomic analysis from start (calling somatic variants) to finish (annotating variants). For any GMQL-specific questions or problems you can visit the GMQL GitHub page.
Many thanks to Luca Nanni, Arif Canakoglu, Pietro Pinoli, and Stefano Ceri for putting together this workspace. It takes a lot of thought and effort to create a valuable learning resource like this, and we are still figuring out the most successful way to do this. Please share your thoughts in the Comments section below on the effectiveness of this workspace and any other Featured workspaces you try out. If you are interested in featuring examples of your methods in this way, please tell us here, and we can talk to you about the process.