Machine learning or ML is one of the hottest buzzwords (buzzphrases?) in genomics today, along with data science, artificial intelligence (AI) and deep learning (DL). And as with all good buzzwords, it's very unfashionable to admit that you don't know exactly what they mean. So here's an intro-level overview of these terms and where they fit in the GATK world. If after reading this you find yourself craving more substance about the exciting new ML methods being developed in GATK4, don't despair -- we plan to follow up next week with a more detailed post written by Lee Lichtenstein, GATK's leader of somatic computational method development and all-around data science nerd.

"Data Science is statistics on a Mac." -- @BigDataBorat

At a high level, data science is the overall discipline that deals among other things with building models in order to make statements and predictions about the data and what it represents. Within that context, machine learning and statistics can be seen as two subfields of data science, utilizing similar tools but with different goals and strategies, as explained in this article. One of the key differentiators is that statisticians focus more on the theoretical underpinnings of the modeling process, while practitioners of machine learning put more emphasis on "making it work", i.e. generating predictions that fit their training data, as well as ensuring computational efficiency and scalability to very large datasets.

I’ve decided that I’m cool with describing today’s tech as Artificially Intelligent, provided we all agree that we’re talking about ‘intelligence,’ like bunnies have for finding a way around my garden fence to eat my kale. ‘Clever bunny’ levels of intelligence. That’s 2018. -- Chris Dwan @fdmts

So where do artificial intelligence and deep learning fit in? In a nutshell, artificial intelligence, i.e. the ability of machines to make smart decisions without being given step-by-step instructions from a human, is the end goal of all machine learning. Meanwhile, deep learning is a subfield of machine learning that uses techniques based on "neural nets", a type of algorithm that mimics neural pathways in animal brains. Deep learning has been around for a long time, but until recently, neural nets were too computationally intensive to tackle anything more than toy problems. Now, thanks to recent technological developments they can tackle much bigger problems, and have become intensely popular as a way to pursue artificial intelligence.

Machine learning in the GATK

Now let's have a look at how these terms apply to GATK methods. Our analytical tools all make use of some kind of statistical technique or combination of techniques; I dare say if you take a stroll through our documentation, you're very likely to come across some Bayesian formulas. For a subset of our tools, the key algorithms belong to the machine learning family. There is sometimes a bit of debate around what exactly pushes a particular technique over the fence into ML territory (try googling "is pca machine learning") but my rule of thumb is that if I get a real whopper of a headache the first time I try to understand it, it's probably machine learning.

Classic GATK machine learning methods that have been around since the early days of GATK include base recalibration (BQSR) and variant recalibration (VQSR). In the case of VQSR, the core algorithm is a Gaussian mixture model that aims to classify variants based on how their annotation values cluster, given a training set of high-confidence variants. It has long been the method of choice for variant filtering in our Best Practices pipeline for germline short variant discovery, despite many shortcomings (including easily violated assumptions, unhelpful error messages and an insatiable appetite for more data) but after exploring various alternatives over the years, we have finally nailed down a new approach based on deep learning that we expect will replace VQSR in our Best Practices pipeline within the coming months.

This new deep learning based approach uses two-dimensional convolutional neural nets (2D CNN) to classify variant candidates coming out of the variant calling pipeline, with the intent of making it a drop-in replacement for VQSR. The tools involved are still in beta-stage development (publicly accessible in GATK4 but not yet "blessed" for production use), but in our tests the new method outperforms VQSR significantly, delivering greater precision without reducing sensitivity. Specifically, for indel calling on a single genome, the 2D CNN approach improves accuracy up to ~30% over VQSR. This huge effect on single-sample indel calling is great news because while germline SNPs are largely a solved problem, indels have long remained a thorn in everyone's side due to the prevalence of technical artifacts that are difficult to filter out without losing real indels. The ability to retain sensitivity while eliminating these artifacts represents an important improvement for anyone who needs to prioritize variants of possible clinical importance. So we really look forward to bringing this approach to maturity in the near future.

That being said, new advances in ML-based approaches in genomics are not limited to deep learning. We have a brand new germline copy number variant discovery (gCNV) pipeline in GATK4, currently in beta status, that uses a type of machine learning algorithm called a Probabilistic Graphical Model (PGM) to deliver germline CNV calling on either a single sample or a cohort of samples that significantly outperforms established methods in the field, both in terms of accuracy and of computational scalability.

"Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician" -- Josh Wills @josh_wills

As we build out the portfolio of GATK4 tools and pipelines, we continually re-evaluate our approaches based on the research needs that we're responsible for satisfying as well as the latest advances in the field of data science, in order to select the right tooling to tackle each problem. That takes a lot of exploratory research, and the best asset we have to help us in that endeavor is the amazing diversity of the GATK developers' professional backgrounds. Supported by a small pocket of "traditional" software engineers who build out the GATK's engine core, the GATK development team is a conglomeration of~30 data scientists assembled from all corners of data science. A few were already in the domain of biological sciences, though rarely genomics; but most came from other disciplines, including high-energy physics, astronomy, mathematics and digital art. This produces a team dynamic that values fresh perspectives and cultivates the kind of novel thinking that we need to keep pushing the boundaries of genomic science.

Finally, if I've learned anything about ML from watching my GATK colleagues work their magic, it's that it's one thing to develop an algorithm that produces the answer you want, and quite another to make it usable by others. Most of the new machine learning work we are doing in GATK4 relies on open-source libraries that are widely used in the data science ecosystem, like PyMC3/theano, Keras/TensorFlow and scikit-learn, which come with all sorts of requirements and dependencies. So a lot of work goes into implementing these algorithms as tools that can be readily deployed at scale in a variety of environments -- not the most glamorous side of data science, but vital to the process of turning cool ideas into impactful technology.

Return to top

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


See also the #GATK blog post by co-author Yossi Farjoun for an overview
17 Jul 18
RT @BioCodePapers: GATK PathSeq: A customizable computational tool for the discovery and identification of microbial sequences in libraries…
10 Jul 18
RT @xdopazo: Still some vacancies in the GATK workshop in Seville do not miss it! @gatk_dev @ClinicalBioinfo @FProg…
9 Jul 18
Holiday notice: The #GATK forum is on break today as we celebrate US Independence Day. Barring any alien invasion o…
4 Jul 18
@StevenNHart @delagoya Thanks for the suggestion, will look into this.
27 Jun 18

- Our favorite tweets from others

Very productive week in Cambridge thanks to @gatk_dev , your trainers Eric, Soo Hee, Kate and Takuto were highly or…
19 Jul 18
Cambridge GATK4 variant discovery workshop day 2 is underway! The Broad team are doing an excellent job making thi…
17 Jul 18
These data are a useful benchmarking resource for any variant caller - @lh3lh3 performed a heroic number of compari…
16 Jul 18
Davide Sampietro presenting our work on an #FPGA implementation of the #pairhmm step of the @gatk_dev pipeline by…
11 Jul 18
@delagoya @gatk_dev Might want to try the builder design pattern for docker.
26 Jun 18

See more of our favorite tweets...