Part 1 of a series on the theme of synthetic data

Don't get me wrong, I'm not suddenly advocating for fraudulent research. What I'm talking about is creating synthetic sequence data for testing pipelines, sharing tools and generally increasing the computational reproducibility of published studies, so that we can all more easily build on each other's work.

The majority of the effort around computational reproducibility has so far focused on better ways to share and run code, as far as I can tell. With great results -- it's been transformative to see the community adopt tooling like version control, containers and Jupyter notebooks. Yet you can give me all the containers and notebooks in the world; if I don't have appropriate data to run that code on, none of it helps me.

Most of the genomic data that gets generated for human biomedical research is subject to very strict access restrictions. These protections exist for good reason, but on the downside, they make it much harder to train researchers in key methodologies until after they have been granted access to specific datasets — if they can get access at all. There are certainly open datasets like 1000 Genomes and ENCODE that can be used beyond their original research purposes for some types of training and testing. However they don't cover the full range of what is needed in the field in terms of technical characteristics (eg exome vs WGS, amount of coverage, number of samples for scale testing etc); not by a long way.

That's where fake data comes in -- we can create synthetic datasets to use as proxies for the real data. This is not a new idea of course; people have been using synthetic data for some time, as in the ICGC-TCGA DREAM Mutation challenges, and there is already a rather impressive range of command-line software packages available for generating synthetic genomic data. It's even possible to introduce (or "spike in") variants into sequencing data, real or fake, on demand. So that's all pretty cool. But in practice these packages tend to be mostly used by savvy tool developers for small-scale testing and benchmarking purposes, and rarely (if ever? send me links!) by biomedical researchers for providing reproducible research supplements.

And frankly, it's no surprise. It's actually kinda hard.

Next up: An exercise in reproducibility (and frustration)

SkyWarrior on 27 Apr 2019

I feel your thoughts exactly in my research currently. Generating your own simulated data is good. Publishing that is also good. But when you try a similar simulation but cannot reproduce the results of a previous publication this is brain damage (almost permenant)! I generated a small R script to generate fake VCFs with designated runs of homozygosity for my research and I can share that if needed by others.

Geraldine_VdAuwera on 27 Apr 2019

Hah yes indeed -- see part 2 for the project that gave me brain damage :dizzy:

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


@wbsimey Happy to hear you’ve found the resources we provide helpful!
30 Jul 19
New crop of GATK workshop videos now available on YouTube! Updated for the GATK4/2019 version of the Best Practices…
25 Jul 19
Don't miss this #GATK workshop -- we've got a great crew lined up and the location isn't half bad either :)
23 Jul 19
@Brunods1001 It’s been updated to use GATK4, which addresses the invalid bam output issue that affected the GATK3 v…
11 Jul 19
Wrapping up the #GATK workshop in Cambridge, UK -- it's been a blast. Great group of participants and fantastic hos…
11 Jul 19

- Our favorite tweets from others

In spite of their stated mission to support human health through genomics, many GATK pipelines are applicable to no…
29 Jul 19
Me: driving myself insane over what data to keep and what to not bother with for thesis and also frantically trying…
18 Jul 19
@RareSeas first attempt at teaching the GATK course, do I look puzzled up there?
11 Jul 19
Can you spot CDGP PhD student, Dr. Alice Denyer, brushing up on the latest bioinformatics tools from @gatk_dev? The…
10 Jul 19
GATK workshop materials available online! Learn it in your own time with @ProjectJupyter notebooks. ^MT
8 Jul 19

See more of our favorite tweets...