I'm delighted to introduce the first major version update to GATK4, version 184.108.40.206! This release includes several exciting new analysis pipelines and tons of improvements to existing tools, many of which are now officially out of beta (YAY!).
You can check out the full release notes on Github to get a sense of the scale of this release, but fair warning, it's a lot. In fact, we felt there was far too much in this release to even give a satisfying overview in a single blog post, so we decided to develop a series of nine blog posts that each cover one of the main functional areas of improvement. The table below lists the nine posts along with a short summary for each. Each blog post was written by the lead developer(s) on that project; it outlines the history of the challenge at hand, the approach that they developed to solve it, and future development prospects.
We plan to publish two posts per week starting tomorrow, so keep an eye out for them, subscribe to forum notifications or follow @gatk_dev on Twitter! We'll add links to the table as the posts become available.
And now without further ado I present to you GATK4.1!!!
I can't quite believe it's been a full year since we released GATK4! The tools have evolved a lot since then, and as a matter of fact we're due for another major version release very soon indeed. So we're going to be talking a lot more about that over the next few weeks -- specifically, the GATK developers are going to tell you all about their latest work in a series of guest posts on this very blog.
But before we get to all that cool new stuff, I want to take a moment to introduce you to the wonderful people who most recently joined us (the DSP* Communications Team) in our mission to help all of you make effective use of our tools in your work.
DSP = Data Sciences Platform of the Broad Institute*
PathSeq is a computational pathogen discovery pipeline in the Genome Analysis Toolkit (GATK) for detecting microbial organisms from short-read deep sequencing of a host organism, such as human. The pipeline detects microbial reads in the host organism by performing read quality filtering, subtracting reads derived from the host, aligning the remaining (non-host) reads to a reference microbe genome, and finally generating a table of the detected microbial organisms. The GATK version improves on the previous version of the pipeline by incorporating faster computational approaches, broadening the use cases of the pipeline, and integrates the pipeline in GATKs Apache Spark framework enabling parallelized data processing (Mark et al., 2018). We've written in detail in our documentation on how to use PathSeq, but I have a particularly intriguing story to share about how I used the PathSeq workflow in FireCloud to quickly identify the cause of mysteriously low sequencing alignment rates.
I first heard about this specific problem when a project manager in the sequencing lab told me that they were seeing low alignment rates on multiple samples from the same project, and asked if I could help. We normally see alignment rates (as reported from Picard’s CollectAlignmentSummaryMetrics) above 99%, but this cohort of samples was producing rates between 60% and 95%, requiring the lab to sequence more in order to reach the agreed-upon coverage for the project (which doesn’t include unaligned reads, of course).
I suspected bacterial contamination since (by manual inspection) the unaligned reads did not seem to be artifactual (for example they all had pretty random-seeming sequence, not all the same). To approach this problem, I used the new GATK4 PathSeq Workflow (publication, how-to tutorial) and a small Python script. In this document I’ll walk you through how I used PathSeq on FireCloud using workflows and the beta “Notebooks” feature to quickly identify that the unaligned reads all belong to a single bacterial genus, Burkholderia.
PathSeq Data Bundle and Documentation :
We're aware that there is an unusually large amount of spam flooding the forum right now and are working with our host to address this issue. We may need to restrict access on a temporary basis. Our deepest apologies for any inconvenience this causes you and thank you for your patience!
Dear GATK users,
I am writing this blog post to let you know it has been a wonderful 4+ years working on the forum and answering your questions. Thank you all for giving me a job for all this time and keeping me entertained :) I enjoyed learning all about the GATK while answering your questions. But the time has come for me to move on and learn some new skills.
I am not leaving the forum unattended, however. We have a new team member joining, along with a few other people who will be able to help you all out. There will be another blog post soon to introduce the new team members.
I trust these new team members will do a great job, and I wish you all the best in your journeys.
We've had a very active workshop season so far, and just because it's almost summer doesn't mean we're slowing down. Later this month we'll be at the GCC/BOSC conference in Portland, OR, teaching a 2.5 hr GATK4 workshop, as well as assisting colleagues who are teaching a WDL pipelining workshop. There's still some space open so register now if you'd like to join us!
In July we'll be in Cambridge, UK to teach our now-classic 4-day workshop; it's fully booked at this point but there's a waitlist you can add yourself to here. Even if you don't get in, it tells us how many people would have liked to attend but couldn't, and that helps us determine how many more workshops we need to organize and where.
In September we'll be teaching the same 4-day workshop formula in Seville, Spain, augmented with a 5th day on variant interpretation taught by the host institution. Registration for this workshop just opened here.
As always, there will be more -- and if you're interested in hosting us at your institution, just let me know in the comments or over private message.
Machine learning or ML is one of the hottest buzzwords (buzzphrases?) in genomics today, along with data science, artificial intelligence (AI) and deep learning (DL). And as with all good buzzwords, it's very unfashionable to admit that you don't know exactly what they mean. So here's an intro-level overview of these terms and where they fit in the GATK world. If after reading this you find yourself craving more substance about the exciting new ML methods being developed in GATK4, don't despair -- we plan to follow up next week with a more detailed post written by Lee Lichtenstein, GATK's leader of somatic computational method development and all-around data science nerd.
It's finally Spring in Boston; the trees are sprouting leaves again, everything is turning green and gloriously alive -- and Bio-IT World is starting, which makes it official! Many of you may not know or care about Bio-IT, since it's more a biotech trade show than a scientific meeting, but for us it has become a springtime tradition to announce important developments there. These announcements have often focused on strategic/roadmap level plans -- for example that's where we broke the news last year that GATK4 would be fully open-source to a standing ovation (whoo!) -- but this year we're in a position to talk about the new capabilities we're actually delivering, and that feels really good. To quote the inevitable Steve Jobs, real artists ship, and boy are we shipping.
We have two major themes that we're developing this year: (1) democratization of the Best Practices pipelines, which includes everything from increasing access to ease of deployment, standardization and optimization for cost and speed; and (2) application of machine learning to improve accuracy and scalability in established pipelines as well as tackle new areas like germline CNV discovery.
If I believed in fate, I would say she has a twisted sense of humor. As you may know, today (April 25) was National DNA Day in the USA. And as it happens, on this very day, our friends and colleagues in the Broad's Genomics Platform hit a major milestone: they sequenced their 100,000th human whole genome! How exciting, right? And a complete coincidence!
But that's not the twist. The twist is that, knowing this was going to happen (because the Genomics Platform is a well-oiled machine), they had planned a celebratory livestream of the preparation and loading of the sequencer... only for the broadcast to fail because another sequencer flooded the bandwidth when it started uploading data to its mothership! Yep, even on DNA Day there can be such a thing as too much DNA.
Fortunately someone recorded the whole event on a local laptop so we can still enjoy all ~20 minutes of Eric Lander, Stacey Gabriel, Sheila Dodge and Andy Hollinger discussing the nature and implications of this milestone while Erin LaRoche does the actual work of loading the sequencer (Novaseq in action!) with samples from the Gabriella Miller Kids First project, including lucky number 100,000.
So click here to view the full video on Facebook and find out what the Broad will sequence once we run out of people.... Then head over to reddit/science for Aviv Regev's "Ask Me Anything" session on Thursday April 26.
As I type this, four of my colleagues are kicking off a workshop in Taipei, hosted by Taiwan's National Applied Research Laboratories (NARLabs). For the next four days, they will be leading a cohort of ~40 participants through everything from core concepts to practical details involved in understanding and applying the GATK Best Practices for variant discovery. They will alternate between lectures and hands-on exercises, covering all current GATK workflows (germline and somatic short variants, plus somatic CNVs) as well as the pipelining systems that we recommend. And for the first time they'll be doing all of it with GATK4. Whoo! We've been working on this new formula and the updated materials for several months so as far as I'm concerned it's the most exciting event that I'm unable to attend this year...
But this is only the first of many workshops! In a few weeks, another workshop crew will be traveling to Montreal to deliver the same training course at McGill University. In April, I'll be taking a crew to Beijing, hosted by Intel China, which I'm really looking forward to given how much is happening in the genomics space over there; among other cool prospects I'm thrilled that we're going to get to try out the brand new Cromwell backend developed by Alibaba Cloud for running GATK pipelines in China. We're also working on scheduling another workshop in early May in Qatar, and we already have a couple of others scheduled in July (Cambridge UK) and September (Seville, Spain). Possibly others TBD, stay tuned...
See Events calendar for full list and dates
See Events calendar for full list and dates