Part 2 of a series on the theme of synthetic data | Start with part 1: Fake your data -- for Science!

Last year my team and I collaborated with a genetics researcher, Dr. Matthieu Miossec, to reproduce an analysis from a paper he had co-authored, as a case study for a workshop. The paper reported variants associated with a particular type of congenital heart disease called Tetralogy of Fallot; a pretty classic example of a germline analysis that uses GATK for variant calling. Between the information provided in the preprint and Matthieu's generous cooperation, we had everything we needed to reimplement the analysis in a more portable, fully shareable form… Except the data, because as with many human genetics studies, the original exome data was private.

In illo tempore, with the exuberance of youth filling my sails, I thought it should be simple enough to create some synthetic data to stand in for the original. I am wiser now, and grayer.


We originally explored using exome BAMs from the 1000 Genomes project, and simply spiking in the variants of interest in a subset to create semi-synthetic case samples. That proved more difficult than expected, in part because of the heterogeneity (wrt coverage, target territory, quality etc) of the 1000 Genomes data. We ran into a lot of validation errors, and the mutation spiking package we wanted to use (Bamsurgeon) was clearly not happy with that data.

After banging our heads against that wall for a bit we decided to just synthesize all the data from scratch using a read simulator, NEAT-genreads. Or nearly from scratch; NEAT-genreads can use a VCF as starting point for generating the data, so we chose to use the VCFs from the 1000 Genomes. Which if you think about it amounts to creating fake BAMs representing real people's exomes... Then we spiked in the variants of interest into a random subset of the fake exomes to create synthetic case samples. We also spiked a synonymous mutation into the controls to ensure all samples went through the same processing steps and software.

We didn't make quite as many synthetic exomes as we'd originally planned; the largest set we made was 500 exomes in total (cases + controls), out of the 2,000 intended to match the paper. But that was enough to build a working proof of concept for reproducing the paper, which was all we needed for the workshop. And yes, we were able to show that the reimplemented analysis was able to pull out the expected mutations that we had spiked in. Success!

So it worked. What am I whining about? Well, all in all this took a lot more effort than I had originally envisioned. We had to chain together multiple tools, and unsurprisingly (this is bioinformatics) some of them had "creative differences" with each other regarding format requirements and what they deemed reasonable to produce as output. And once we had the basic commands working end to end, we had to write workflows to automate the process; since we wanted to generate hundreds of samples, there was no way running individual command lines manually was going to cut it.

Needless to say (and I do say this with the utmost respect) none of the development work that was involved in that project would be reasonable to expect from your average biomedical researcher.

Next up: Hacking a community resource (prototype) for synthetic data


You can read more about the original project here; we presented it at the ASHG meeting in October and have used it in several other workshops since. It's been quite useful for teaching purposes, so we make it available in an example workspace in Terra that anyone can use.

Terra is a cloud analysis platform developed at the Broad Institute, née FireCloud. It is currently in open beta development phase; it is freely accessible to everyone, and all new accounts are provided $300 free credits to spend on Google compute and storage.


Return to top

Comment on this article


- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

@wbsimey Happy to hear you’ve found the resources we provide helpful!
30 Jul 19
New crop of GATK workshop videos now available on YouTube! Updated for the GATK4/2019 version of the Best Practices… https://t.co/Wfgq5YKBFg
25 Jul 19
Don't miss this #GATK workshop -- we've got a great crew lined up and the location isn't half bad either :) https://t.co/b0fL8ZLwzn
23 Jul 19
@Brunods1001 It’s been updated to use GATK4, which addresses the invalid bam output issue that affected the GATK3 v… https://t.co/AUlbjmHKmm
11 Jul 19
Wrapping up the #GATK workshop in Cambridge, UK -- it's been a blast. Great group of participants and fantastic hos… https://t.co/bvwGTU7lYq
11 Jul 19

- Our favorite tweets from others

In spite of their stated mission to support human health through genomics, many GATK pipelines are applicable to no… https://t.co/FKQTouZjbv
29 Jul 19
Me: driving myself insane over what data to keep and what to not bother with for thesis and also frantically trying… https://t.co/er2klIcw5i
18 Jul 19
@RareSeas first attempt at teaching the GATK course, do I look puzzled up there? https://t.co/4mqkHbWJy4
11 Jul 19
Can you spot CDGP PhD student, Dr. Alice Denyer, brushing up on the latest bioinformatics tools from @gatk_dev? The… https://t.co/KAbdlWLbcb
10 Jul 19
GATK workshop materials available online! Learn it in your own time with @ProjectJupyter notebooks. ^MT https://t.co/IKDa6SGwaU
8 Jul 19

See more of our favorite tweets...