Using GenomicsDBImport to consolidate GVCFs for input to GenotypeGVCFs in GATK4
GATK 4 Beta | Created 2017-07-29 | Last updated 2018-04-16

Comments (33)

This article is out of date and has been replaced by

In GATK4, the GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case there is only one valid way to do it: with GenomicsDBImport.

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output a VCF. Note that GenomicsDBImport does not take two or more same sample GVCFs. You will need to create one GVCF per-sample before running the tool.

Here are example commands to use it:

gatk-launch GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsDBWorkspace my_database \
    --intervals 20

That generates a directory called my_database containing the combined gvcf data.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path.

gatk-launch GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -G StandardAnnotation -newQual \
    -O test_output.vcf 

And that's all there is to it.

There are three caveats:

  1. You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.

  2. At the moment you can only run GenomicsDBImport on a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.

  3. At the moment GenomicsDB only supports diploid data. The developers of GenomicsDB are working on implementing support for non-diploid data.

If you can't use GenomicsDB for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.

Addendum: extracting data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk-launch SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

Caveat: cannot move database after creation

Currently the GenomicsDB internal code uses the absolute path of the location of the database as part of the data encoding. As a consequence, you cannot move the database to a different location before running GenotypeGVCFs on it. If you do, it will no longer work. This is obviously not desirable, and the development team is looking at options to remediate this.

Return to top Comment on this article