Any ploidy goes!

Posted by Geraldine_VdAuwera on 27 Aug 2014 (31)


Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.

UPDATE:

We have added omniploidy support to the GVCF handling tools, with the following limitations:

  • When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.

  • The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.

LATEST UPDATE:

As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.



charlesbaudo on 27 Aug 2014


I'm in the process of comparing variant calls made by UnifiedGenotyper and this nightly build of HaplotypeCaller on 7 samples of a haploid organism. Here is my UnifiedGenotyper command: `java -Xmx6g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R /path/ref.fasta -I /path/7_sample_merge.bam -glm BOTH -ploidy 7 -o 7_sample.raw.vcf` and here is my HaplotypeCaller command: `java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R /path/ref.fasta -I /path/sample1.bam -I /path/sample2.bam -I /path/sample3.bam -I /path/sample4.bam -I /path/sample5.bam -I /path/sample6.bam -I /path/sample7.bam -stand_call_conf 30 -stand_emit_conf 10 -ploidy 7 -o 7_samples.raw.vcf` The terminal printout for UG shows the number of processed active regions , while the HC does not. I see this phenomenon when I have a single BAM input and ploidy specified as 1 and with two BAM inputs and ploidy specified as 2. Disclaimer: I have never used HaplotypeCaller before as I solely work on a haploid organism nor have I attempted to HC on diploid data. Apologies if this is the intended behavior, but I thought I should inquire about it. UG: > INFO 09:47:01,777 ProgressMeter - Supercontig_6:765865 3.27e+07 48.5 h 88.8 m 79.7% 60.8 h 12.4 h > INFO 09:48:01,782 ProgressMeter - Supercontig_6:776249 3.28e+07 48.5 h 88.8 m 79.7% 60.8 h 12.3 h Nightly HC: > INFO 09:55:14,091 ProgressMeter - Supercontig_1:5972736 0.0 19.4 h 15250.3 w 14.5% 5.6 d 4.8 d > INFO 09:56:14,095 ProgressMeter - Supercontig_1:5976722 0.0 19.5 h 15250.3 w 14.5% 5.6 d 4.8 d

Geraldine_VdAuwera on 27 Aug 2014


Hi @charlesbaudo, First, thanks for being willing to give this a try! The difference in the ProgressMeter behavior that you're seeing is expected. It's due to fundamental differences in how UG and HC traverse the data. UG traverses the genome one individual position at a time, so it's easier to track its progress, and it goes through each unit of data much faster than HC. In contrast, HC traverses the data by ActiveRegions, which are stretches of genome that can go up to several hundred bases (or more if you set the max higher). Each unit takes much longer to process since it contains more data, and so you may not see the count go up for a while (although it should move eventually, unless something is broken that I'm not aware of). That said I realize I wasn't clear about the usage of the ploidy argument. If you have 7 samples that are clearly separated with distinct read groups (as opposed to pooled samples which are not identified) you need to set the ploidy value to be the actual ploidy of a single sample, as opposed to the aggregate ploidy over all samples. So here you would set the ploidy to 1 for your haploid samples, otherwise you're asking HC to consider each sample as a heptaploid. This should speed up the process as HC will have considerably less work to do. I might also recommend doing a single-sample to single-sample comparison first before diving into multi-sample comparisons, but that's up to you.

charlesbaudo on 27 Aug 2014


Thank you for your quick response, @Geraldine_VdAuwera. To clarify, if I have merged those same 7 BAM files I should still set ploidy to 1 if they have distinct read groups? And this recommendation applies to both HC and UG? Thank you.

Geraldine_VdAuwera on 27 Aug 2014


As long as the samples have distinct read groups (specifically, distinct SM tags) it makes no difference whether they are in the same file or in separate files. The GATK engine merges all the data when it reads in the file contents, and distinguishes samples by SM tags. This does apply to both UG and HC, yes.

charlesbaudo on 27 Aug 2014


Thank you for the clarification. Cheers.

Geraldine_VdAuwera on 27 Aug 2014


Updated the text to reflect added support for GVCF tools, with stated limitations.

travcollier on 27 Aug 2014


The docs for HapylotypeCaller (from the nightly build 2014-10-14-gf24cf57) still say that it only supports a ploidy value of 2 under the 'Caveats' section. That should probably get updated too.

Geraldine_VdAuwera on 27 Aug 2014


Indeed, thanks for pointing it out.

Oprah on 27 Aug 2014


Does GenotypeGVCFs in the latest nightly build have features beyond v3.3-0, with respect to mixed ploidy across samples/positions? Thanks.

Geraldine_VdAuwera on 27 Aug 2014


Actually, version 3.3 supports mixed ploidy; this post refers to 3.2. I'll add a note. Meanwhile, see https://www.broadinstitute.org/gatk/blog?id=4741 for the mixed ploidy announcement.

sepid on 27 Aug 2014


May I ask you whether you know any polyploidy (fragments) dataset available publicly?

Geraldine_VdAuwera on 27 Aug 2014


I'm afraid not, @sepid, sorry. Maybe SeqAnswers or BioStars would know.

sepid on 27 Aug 2014


I have a question about how the algorithm of GATK-Polyploidy & mix-Ploidy work? have you guys published it or have you released the details of your algorithm? I very much appreciate your response!

Geraldine_VdAuwera on 27 Aug 2014


It hasn't been published yet, sorry. But you can view the code [here](https://github.com/broadgsa/gatk-protected) if you want.

Katie on 27 Aug 2014


I have a question about specifying ploidy for population sequence data. I am conducting a population sequencing study on a vector-borne bacteria, so each genomic library I have is the bacterial population infecting a single vector, but constitutes a pool of an unknown number (likely 1-7) different bacterial clones. I am planning to use Haplotype Caller to call variants but am wondering how to deal with ploidy when I have an unknown ploidy (or number of coinfecting bacterial strains) present within each genomic library. Should I set the ploidy to 7 (the maximum we likely will see in nature) for all samples and call them all together?

Geraldine_VdAuwera on 27 Aug 2014


@Katie There is no absolute answer to your question; it's going to be a tradeoff between sensitivity (aim to detect every variant, even if present only as a minority fraction) and specificity (aim to reduce false positives resulting from sequencing noise/artifacts). If you want to maximize sensitivity, you'll need to set the ploidy to the highest likely number of clones in your population.

Katie on 27 Aug 2014


Okay, thanks for getting back to me so quickly!

Katie on 27 Aug 2014


One more question about setting ploidy for a haploid sample with an unknown number of clones present at unknown proportions (i.e. unlikely they are present at equal proportions). If I set ploidy at 10 because that's the highest likely number of clones in my population, does the variant detecting algorithm assume that each clone should be present at 10%? In other words, am I then limiting my sensitivity to any variant found in > 10% of the total population? Or instead, is does the ploidy argument limit the number of unique alleles at a locus? HaplotypeCaller is very slow with high ploidy, but I would like to maintain high sensitivity to minority variants. Thanks for your help!

Geraldine_VdAuwera on 27 Aug 2014


@Katie Your first supposition is correct -- and yes that would limit your sensitivity to minority variants. Unfortunately the HC was not designed to handle the use case of indeterminate pools.

Katie on 27 Aug 2014


Thank you for your help on this. It seems like variant calling may be a two step process: (1) to identify variants informative of differentiation between pooled samples, set PLOIDY=1 (call consensus allele at each locus) and run HaplotypeCaller on all BAM files which may or may not contain mixed samples. (2) to identify within-host polymorphisms, I will then set PLOIDY=100 and run HaplotypeCaller on each BAM file individually. (Calling variants on all samples at once with high PLOIDY is computationally infeasible.) In this case, can I leave the HAPLOTYPE argument at its default value, because this will not have as strong an effect on sensitivity to minority variants?

Geraldine_VdAuwera on 27 Aug 2014


That's an interesting strategy, worth exploring. Using ploidy 1 seems a bit radical for the first step -- this will favor sites where your population is very homogeneous. Sites with lots of heterogeneity may still be called but with low confidence, so you'd have to make sure to lower the emit thresholds. Or use the GVCF mode of HC since it sounds like what you'll be looking for are sites with low probability of being hom-ref, and potentially multiple alternate alleles. You can then select these out to serve as list of interesting sites. Assuming you plan to run step 2 on just the sites called in step 1, note that you can do this by passing in your vcf using -L, and add padding (optional but recommended) using -ip 50.? But ploidy 100 may prove challenging in terms of performance. The number of calculations that need to be done to evaluate all possible combinations is astronomical. In future this may be more feasible if we can triage the combinations that need to be calculated, but with the current model that is not done. I'm afraid your job might run forever... You may have to compromise and use a lower ploidy even for that second step.

Katie on 27 Aug 2014


Thank you. Using the GVCF mode makes a lot more sense so that I can separate single sample variant calling and then conduct multi-sample variant discovery to identify different subsets of variants which may inform (1) between sample variation and (2) to identify within-host variants. However, I ran into trouble using the GVCF mode of HC using the following input: java -d64 -Xmx20g -jar "/home/bioinfo/software/GATK-3.3/target/GenomeAnalysisTK.jar" \ -T HaplotypeCaller \ -R $REF \ -I $BAM \ --emitRefConfidence GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000 \ -nct 24 \ --max_alternate_alleles 3 \ -L $INTERVAL \ --sample_ploidy 30 \ -o ${OUTDIR}${BASE}${INT}"_"${PLOIDY}".g.vcf" ERROR MESSAGE: you have hit a current limitation of the GVCF output model that cannot handle ploidies larger than 20 , please let the GATK team about it: 30 When I set the ploidy at 20, I got the error: A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec) ERROR MESSAGE: 20 I am using GATK version 3.3-0. Is this an issue with the version? Thank you for your help!

Sheila on 27 Aug 2014


@Katie Hi, It is possible it is an issue with the version. I forgot which exact version ploidy was introduced, but if you use the latest version you should have no problem. -Sheila

Katie on 27 Aug 2014


Great, issue resolved with the latest version. Thank you!

Katie on 27 Aug 2014


One more ploidy question: the GenotypeGVCFs tool returns no output when I test it on a list of gVCFs called with a ploidy of 30. I have tested several iterations and get no error messages, but the output file only includes the VCF header. The gVCF files look reasonable. java -d64 -Xmx20g -jar "/home/ksw9/bin/GenomeAnalysisTK.jar" -T GenotypeGVCFs -R $REF --sample_ploidy 30 -V gVCFTest.list -o test.raw.vcf

Geraldine_VdAuwera on 27 Aug 2014


Are you sure the run completed? If yes, try again with a lower emit_conf_threshold. Check your GVCFs to see what is the range of quals of your calls. Note that you don't need to specify ploidy to GenotypeGVCFs, it will infer it from the genotypes in your GVCFs.

Katie on 27 Aug 2014


Thank you. Yes, the issue is that if I run HaplotypeCaller with PLOIDY=30, all GQ scores are extremely low, due to low confidence in calling 30 alleles. When I set a lower ploidy, GQ scores increase (slightly) and it is possible to use the GenotypeGVCF tool. Even though GQ scores are low, this will help me identify sites that show evidence of within-sample polymorphism. Thanks for the help!

Rebs on 27 Aug 2014


Hi I have a question related to ploidy in UnifiedGenotyper. I am using gatk version 3.8.1 and working with _P. falciparum_. I have a file with previously to sequencing pooled samples (they are not identified). The number of samples is known but the number of parasites per sample is unknown. Also, the parasite has stages where it is haploid and others where it is diploid. Which ploidy should I set when calling variants? Thanks!

Geraldine_VdAuwera on 27 Aug 2014


@Rebs Based on what you describe there is no "correct" setting for the ploidy. One possible approach is to choose what is the minimum allele fraction you aim to be able to detect, and set ploidy accordingly. Alternatively you could try using the Mutect2 somatic caller (in tumor-only mode) which does not make ploidy assumptions.

Rebs on 27 Aug 2014


Hi @Geraldine_VdAuwera Thanks for your reply. What do you mean with allele fraction? Also, I performed variant calling trying different ploidys and when setting it to 1, it detects less SNPs... Regarding Mutect2, I would prefer to stick to UnifiedGenotyper. It was recommended by some colleagues of mine which also work with the same organism (FYI they used ploidy 2 but their conditions were different from mine). Thanks!

Geraldine_VdAuwera on 27 Aug 2014


Imagine you have a pool of 10 diploid individuals and one of them has a heterozygous SNP. Assuming perfect sequence quality, what you expect to see in the data is that the variant allele will be represented by one out of every 20 reads -- an allele fraction of 0.05. Considering the regular occurrence of technical artifacts, we're not going to want to trust an allele that's represented by very few reads. So let's say we're arbitrarily setting that limit at 5 (there are better ways to do this; this is a strawman) -- that means you'll need to have ~100x coverage at this site to make a confident call. Now in your case you already have the data so you have to work backward from however much coverage you do have. Let's say it's ~50x coverage -- that means you can only really expect to make confident calls for variants with an observed allele fraction of 0.1, ie one hom-alt in 10 diploid samples or one heterozygote in 5 diploid samples. Does that make sense? Re: what you see with ploidy =1, that's expected. You're throwing out anything that doesn't look like a hom-alt call.




- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

RT @evolscientist: #AcademicValentines https://t.co/UVZULXGhYm
15 Feb 19
@camerongenomics Ok will check on those, thanks. Been pretty wrapped up in getting 4.1 out
12 Feb 19
@camerongenomics Ooh... Can you please open an issue in our GitHub repo so our core engineers can discuss?
11 Feb 19
@micknudsen Glad you enjoyed the workshop, our crew had a blast too! Let us know in the forum if you need anything else to get going.
1 Feb 19
GATK 4.1 has been released! See highlights and the start of a multi-part blog series here: https://t.co/8p8AC4XuM4
1 Feb 19

- Our favorite tweets from others

Roses are red Violets are blue Please stop using gnomAD as a control group for association studies Unless you have… https://t.co/11grZrzjDa
14 Feb 19
#AcademicValentines https://t.co/UVZULXGhYm
14 Feb 19
On my way home after a fabulous @gatk_dev workshop in Copenhagen. Looking forward to get started implementing… https://t.co/nHDipF0aQL
1 Feb 19
Beautiful graphic from @lea_starita on explosion of variants of unknown significance as sequencing data has grown!!… https://t.co/ugCcPURQt2
23 Jan 19
Have Cromwell running on AWS Batch, very easy to work with WDL and get things working. Cool stuff!
6 Nov 18

See more of our favorite tweets...