Part 3 of a series on the theme of synthetic data | Start with part 1: Fake your data -- for Science! and part 2: An exercise in reproducibility (and frustration)

Ever since our original foray into synthetic data creation, I've been looking for an opportunity to follow up on that project. It is absolutely obvious to me that there is a huge unmet need for researcher-friendly synthetic sequence data resources, i.e. generic synthetic datasets you can use off the shelf, plus user-friendly tooling to generate customized datasets on demand. I'm also fairly confident that it wouldn't actually be very hard, technically speaking, to start addressing that need. The catch though is that this sort of resource generation work is not part of my remit, beyond what is immediately useful for education, frontline support and outreach purposes. Despite a surprisingly common misconception, I don't run the GATK development team! (I just blog about it a lot)

So when the irrepressible Ben Busby reached out to ask if we wanted to participate in the FAIR Data Hackathon at BioIT World, I recruited a few colleagues and signed us up to do a hackathon project called Bringing the Power of Synthetic Data Generation to the Masses. The overall mission: turn the synthetic data tooling we had developed for the ASHG workshop into a proper community resource.

If you're not familiar with FAIR, it's a set of principles and protocols for making research resources more Findable, Accessible, Interoperable and Reusable. Look it up, it's important.

Goals & methods

At this point, experienced hackathoners are laughing their heads off. The truth is, a two-day hackathon doesn't usually afford you enough time to produce a Real, Functional, Substantial piece of work. Even if you're super prepared (which we were, thanks to the efforts of key team members -- shoutout to my colleagues Adelaide Rhodes, Allie Hajian and Anton Kovalsky), the goal is usually to build a prototype as a proof of concept, not a finished product. So it's with that outlook that we defined four buckets of work for the project: 1) scoping out the community's needs, 2) adding functionality to our existing tooling, 3) optimizing the implementation for cost and runtime efficiency, and 4) developing quality control approaches. You can read more about how we defined and split up the work in the project's README on Github .

For the computational parts of the work, which involved both batch workflows (aka pipelines) and Jupyter notebooks, we used Terra, the Broad's cloud-based analysis platform, with a supporting grant from Google EDU (in the form of Google Cloud credits) to cover compute and storage costs for the project (which are billed directly by Google). You can check out the public workspace we put together for the project here. It contains the cohort of 100 synthetic exomes that we had previously created, the workflows used to generate them as well as a workflow and prototype notebook for collecting and analyzing sequence quality control metrics. All the code is also in the Github repository but the nice thing about the Terra workspace is that you can see how the code gets applied to the data, and you have the option to clone it and run/modify as much of it as you like. Terra itself is completely free and open to all and every new account comes with $300 in Google credits, so you can try it out and really kick the tires of the project.


So how did it go, you ask? We ended up with a team of 12 hackathoners from various backgrounds including publishing, data science and software engineering, and that was really a great mix given our objectives. We had plenty of work for both coding and non-coding types! Since we had a robust outline of what we wanted to do, we were able to get started fairly quickly; yet our plans were flexible enough to incorporate ideas and suggestions from the non-Broadies who joined our team with their own perspectives. That really enriched the experience and made the end results better.

Speaking of which, our team ultimately made progress on all four of the fronts that we had planned to tackle, as you can see in the summary report presentation. Given the level of interest that has bubbled up around this project, we're planning to write it up in more detail in a white paper in the immediate future.

Impressions & next steps

I was really pleased by the overall positive response to the synthetic data generation approach we presented. It's something we gloss over in the workshops where we use the dataset we originally created, so I had some trepidation about how it would be received by an audience that is perhaps predisposed to examine this sort of thing more thoroughly. There are definitely some big outstanding questions on where we go from here in terms of generating larger cohorts, which will require generating "fake people" (as opposed to the "real people, fake data" approach we used as a convenient hack), and how far can we push the realism of the synthetic data (e.g. can we model quality fluctuations in the low-confidence intervals from Genome in a Bottle). At the risk of sounding like a bandwagon-jumper, I suspect the answers to both questions lie in machine learning approaches. I would love to see if we can get to the point where we can feed a database of human variation like gnomAD to an ML algorithm that spits out novel realistic synthetic VCFs on demand, with population-appropriate profiles. Probably more a question of when than if, in fact. Similarly, I would be surprised (mildly shocked, even?) if there was not already work being done to use ML techniques to improve the realism of read data simulation software, particularly with regard to different sequencing technologies but also in relation to regions of the genome that can be more or less problematic.

Going forward, my hope is that we can nucleate a community-driven effort to pursue these efforts at a larger scale, i.e. move beyond the prototype. I'm confident that together we can build valuable resources that enable developers, researchers and educators to leverage synthetic sequence data for testing, collaboration and teaching. If you're interested in contributing to this effort, please leave a comment on this post or email me at

My heartfelt thanks to all the Broadies who contributed to this project as well as our hackathon friends Ernesto Andrianantoandro, Dan Rozelle, Jay Moore, Rory Davidson, Roma Kurilov and Vrinda Pareek!

Return to top

Comment on this article

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


It's hot, it's humid, it's #ASHG19 in Houston, TX. Join us at @broadgenomics booth 714 in the exhibition hall to ch…
16 Oct 19
Interested in hearing more about our DRAGEN-GATK partnership with @illumina? Fill out this survey to let us know yo…
16 Oct 19
RT @datadriveby: GATK and DRAGEN collaboration presented by @VdaGeraldine of @gatk_dev and @delagoya of @illumina at #ASHG19. Interesting t…
15 Oct 19
Questions about our new partnership with @illumina DRAGEN? Check out the blog post and handy graphic that explains…
1 Oct 19
Enter the DRAGEN-GATK: Get the lowdown on our freshly announced collaboration with the @illumina DRAGEN team at
30 Sep 19

- Our favorite tweets from others

DRAGEN-GATK roadmap looking very interesting. Several complementary options will be available for running stuff on-…
15 Oct 19
As a prior card carrying bioinformatician, it’s great to see @illumina and @broadinstitute coming together to solve…
15 Oct 19
GATK and DRAGEN collaboration presented by @VdaGeraldine of @gatk_dev and @delagoya of @illumina at #ASHG19. Intere…
15 Oct 19
In a new collaboration, the @gatk_dev team and the @illumina DRAGEN Bio-IT Platform are co-developing open-source g…
30 Sep 19
Do you want to learn about sequencing data analysis in an amazing city? Register now at @gatk_dev workshop "From re…
3 Sep 19

See more of our favorite tweets...