By Yossi Farjoun, Associate Director of computational research methods in the Data Sciences Platform

A note to explain the context of the new paper by Heng Li, myself and others, “New synthetic-diploid benchmark for accurate variant calling evaluation” available as a preprint in bioRxiv.

Developing new tools and algorithms for genome analysis relies heavily on the availability of so-called "truth sets" that are used to evaluate performance (accuracy, sensitivity etc.). This has long been a sticking point, though recently the situation has improved dramatically with the availability of several public, high-quality truth sets such as Genome In A Bottle from NIST and Platinum Genomes from Illumina. Even these resources, which have been produced through painstaking analysis and curation, are not immune to the lack of “orthogonality” which plagues most available truth-sets. Chief among these is that the failure modes of Illumina sequencing are usually masked out and the resulting data are biased towards the easier parts of the genome.

The paper I linked above introduces a new dataset that we developed to be less biased. It is based solely on PacBio sequencing, and thus its error modes are less correlated with Illumina’s error modes. Using this dataset for benchmarking has given us high confidence in the accuracy of our validations and has enabled us to improve our methods with less concern of overfitting.

Truth data (for germline DNA methods) tend to be derived from two sources: synthetic (that is, computer generated), or Illumina (and other) sequencing of a particular sample called NA12878. Both of these sources are deeply flawed and ultimately, not good enough. First, it is virtually impossible to create synthetic data that truly resemble the results of sequencing actual biological tissue, for several reasons: the reference is an approximation and the effects of sample-extraction, library-construction, and sequencing are really hard to model accurately. Regarding our biggest issue with NA12878, we simply love this sample too much! Nearly all of NA12878’s variants are present in our resource files (dbSNP, the training files for VQSR, etc.). When we evaluate our method’s performance on NA12878, we cannot really trust the results since we have been using the answer all along. Furthermore, both the NIST and Platinum Genomes truthsets are each restricted to a subset of the genome that they consider the “confidence region”. This region is defined differently in the two datasets, but in both cases it is dependent on performance of Illumina sequencing of NA12878 (among other things). This has the perverse effect that the results are reflecting performance only in the easier-to-sequence-and-analyze part of the genome, falsely inflating our self-confidence, and giving no blame or credit for performance in the harder regions of the genome.

The “Synthetic-diploid” (or as we affectionately call it, SynDip) is generated from two human cell lines (CHM1 and CHM13, PacBio-sequenced and assembled by others) that were derived from Complete Hydatidiform Moles. This rare and devastating condition results in a non-viable collection of cells that is almost entirely homozygous. The homozygosity implies that PacBio sequencing is much more trustworthy as there are no heterozygous sites that tend to confuse the assembly: any confusion is almost certainly due to sequencing error and can therefore be masked out. To make use of this, we aligned the CHM1 and CHM13 assemblies to the hg38 reference, and created a VCF and a confidence region that characterize the variation that a 50-50 mixture of the two cell lines would have. At the same time, we also sequenced and aligned such a 50-50 mixture using our WEx and WGS protocols on Illumina. So to be clear, in that regard, the name is misleading. The only “synthetic” part about SynDip is that it’s synthetically diploid, but in all other aspects it’s as natural as can be, since it was generated from live cells using regular sequencing protocols.

Since the CHM dataset was generated using PacBio data alone, with no consideration for the flaws of Illumina’s short-read technology, there should be less correlation between the failure modes of our methods on the short-read data and SynDip’s confidence regions. This allows us to have better, more trustworthy truth-data. It enables us to remove much uncertainty, defusing our natural tendency to “look under the lamp” and to overfit our methods.

And beyond that, it empowers us to push our method development further by exposing large tracts of the reference where our methods (and not only ours!) do not perform well -- and provides us with a more truthful picture of what lies in those regions. Here are the main ways we have used this resource to that end:

  • We have used the insights gained from applying our filtering methods on the SynDip data, which reveal the flaws in their performance, to design better filtering architectures and fine-tune existing ones. (More on this in a future post….)
  • We have used the dataset to assess new variant calling methods for CNVs and SVs.
  • We have used it to compare different analysis pipelines and determine whether there’s a significant difference between them (e.g. What is the effect of running BQSR over and over again? Answer: Not much beyond the first run.)
  • We are currently using it to develop the next version of our joint-calling pipeline which will be able to joint call more than 100K genomes (!!!)

One thing that the current CHM dataset doesn’t help us do is develop better lab methods. This is because the CHM cell lines are not currently commercially available and thus the technology companies cannot test their new protocols and technologies on it. Hopefully, this will eventually be made possible and could enable us to explore hard-to-sequence regions of the genome.

If you are a method developer or you are in a position to evaluate the performance of various pipelines, we encourage you to check out the CHM dataset, and we hope it will help you develop new methods and pipelines! In the future we plan to share more data from the CHM cell lines and make the methods we use for evaluating our methods and data publicly available.

Return to top

Fri 8 Dec 2017

Tiffany_at_Broad on 8 Dec 2017

Great article Yossi! I like the title too.

yfarjoun on 8 Dec 2017

Update: The paper is now published in nature methods: (if you are having access problems let us know)

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


RT @curroortuno: Do you want to learn about sequencing data analysis in an amazing city? Register now at @gatk_dev workshop "From reads to…
3 Sep 19
Thank you @murilocervato for hosting our GATK workshop in Sao Paolo last week! Great group of participants, we’ll s…
3 Sep 19
@RealMattJM “Convoluted”, huh? We see what you did there...
29 Aug 19
#GATK workshop caption competition: what is deep learning developer Sam Friedman trying to say here?
28 Aug 19
@wbsimey Happy to hear you’ve found the resources we provide helpful!
30 Jul 19

- Our favorite tweets from others

Do you want to learn about sequencing data analysis in an amazing city? Register now at @gatk_dev workshop "From re…
3 Sep 19
Another successful #GATK workshop in the books! @TerraBioApp @gatk_dev
3 Sep 19
Day 2 of #GATK workshop this time in São Paulo, Brazil! Hands-on tutorials using @TerraBioApp #GATK Best Practices…
28 Aug 19
In spite of their stated mission to support human health through genomics, many GATK pipelines are applicable to no…
29 Jul 19
Me: driving myself insane over what data to keep and what to not bother with for thesis and also frantically trying…
18 Jul 19

See more of our favorite tweets...