Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.

Because -ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.

Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.

Genotype refinement workflow with all the trimmings

As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.

Non-diploids, rejoice!

This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

HaplotypeCaller gets physical

You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:

or this:

But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).

In a nutshell, phased records will look like this:

1   1372243  .  T  <NON_REF>  .  .  END=1372267  <snip>  <snip>
1   1372268  .  G  A,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip>
1   1372269  .  G  T,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip>
1   1372270  .  C  <NON_REF>  .  .  END=1372299  <snip>  <snip>

You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).

The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.

Heads or tails

Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).

We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the --recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.

Variant annotations finally make sense

Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.

Return to top

blueskypy on 23 Oct 2014

I used v3.3-0-g37228af and here is my command: java -Xmx10g -jar $gatk -T HaplotypeCaller \ -R $refGenome \ --dbsnp $dbSNP \ -o s.10.g.vcf.gz \ -I s.bam \ -L 10 \ --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 && but there is no PGT in the output: 10 93181 . T . . END=93182 GT:DP:GQ:MIN_DP:PL 0/0:10:18:10:0,18,270 10 93183 . C . . END=93183 GT:DP:GQ:MIN_DP:PL 0/0:10:0:10:0,0,325

Geraldine_VdAuwera on 23 Oct 2014

@blueskypy Reference blocks are unphased, by definition -- there is no way to distinguish separate haplotypes.

blueskypy on 23 Oct 2014

I see, so I need to use BP_RESOLUTION. Also how to put phased genotype at the GT? Because other programs, for example, Beagle, takes GT?

blueskypy on 23 Oct 2014

so does it mean that for banded GVCF, only some sites will be phased after GenotypeGVCFs; but for BP_RESOLUTION GVCFs, all sites, except those no calls, will be phased after GenotypeGVCFs?

Geraldine_VdAuwera on 23 Oct 2014

Why would you want to have phasing info at hom-ref sites? Again, by definition there is nothing that can be phased there. Even if you use BP_RESOLUTION, there will be no PGT tag at hom-ref sites *because they can't be phased*...

blueskypy on 23 Oct 2014

sorry, I forgot the band is only for hom sites ( had a long day!). is there a way to to put phased genotype at the GT? Because other programs, for example, Beagle, takes GT? (I can write a script to do that, just wonder if GATK can do that too)

Geraldine_VdAuwera on 23 Oct 2014

Hah, no worries, I figured :) Nope, to avoid collisions with the output of other methods, the PGT stays in PGT. Script away :)

blueskypy on 23 Oct 2014

could you explain the following lines in a gvcf file? why no PGT in this line? 10 93945 rs4431953 G A, 47.77 . BaseQRankSum=-1.868;ClippingRankSum=-0.104;DB;DP=32;MLEAC= 1,0;MLEAF=0.500,0.00;MQ=48.09;MQ0=0;MQRankSum=0.156;ReadPosRankSum=0.467 GT:AD:DP:GQ:PL:SB 0/1:27,5,0:32:76:76,0,887, 157,902,1059:7,20,0,5 why the GT and PGT are different? 10 94784 . C G, 0 . DP=14;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.80;MQ0=0 GT:AD:DP:GQ:PGT:PI D:PL:SB 0/0:14,0,0:14:42:0|1:94784_C_G:0,42,838,42,838,838:8,6,0,0 10 879756 rs11818325 G A, 99.18 . DB;DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT :AD:DP:GQ:PGT:PID:PL:SB 1/1:0,3,0:3:9:0|1:879756_G_A:126,9,0,126,9,126:0,0,2,1 1

Sheila on 23 Oct 2014

@blueskypy‌ Hi, The PGT has to do with phased genotyping. Please have a look at this thread for more information: -Sheila

blueskypy on 23 Oct 2014

Thanks Sheila! But that thread does not answer my question.

Geraldine_VdAuwera on 23 Oct 2014

@‌blueskypy To clarify the scope of PGT, the physical phasing done by HaplotypeCaller is limited to the range of individual active regions, so any variant that is by itself within an active region will be unphased. To phase variants across active regions, you need to use complementary phasing tools such as Phase By Transmission and ReadBackedPhasing. Regarding the second line, this is a slight oddity that is linked to the intermediate representation of phasing in GVCFs. You should ignore it and only look at the phasing output by GenotypeGVCFs.

blueskypy on 23 Oct 2014

Thanks Geraldine! Did you mean "phased" in this sentence "so any variant that is by itself within an active region will be unphased"? what's the difference between phasing done by HC and ReadBackedPhasing since they read the same bam file? For sites that can be phased by both imputation and HC or ReadBackedPhasing, I guess the latter is more accurate since it's based on bam file. Is my understanding correct?

Geraldine_VdAuwera on 23 Oct 2014

No, I do mean that any variant that is by itself within an active region will be *unphased* (hence no PGT in your first line) since there is no other variant relative to which it could be phased. Phasing is always relative. The difference between HC's phasing and ReadBackedPhasing is that HC emits the phasing that corresponds to the called haplotypes within discrete regions, while ReadBackedPhasing is not limited to active regions, and instead will extend phasing as far as can be supported by the data. Imputation is based on external data and certain assumptions relative to those data, so you can argue that it's potentially less accurate than physical phasing, yes.

siantorno on 23 Oct 2014

Hi Geraldine, is PhaseByTransmission able to handle and properly phase polyploid GT calls?

Geraldine_VdAuwera on 23 Oct 2014

@siantorno I believe it's only designed to handle diploids but could be wrong. The author, @Laurent Francioli, may jump in to tell you more.

Laurent on 23 Oct 2014

Hi @siantorno, indeed at the moment PhaseByTransmission only handles diploid calls properly, sorry! Note that we have an internal release which handles the X-chromosome correctly but we still need to polish a few things before release. Cheers, Laurent

lukehur on 23 Oct 2014

Dear All, I thank GATK developers for implementing phasing features. I wonder if we can perform ReadBackedPhasing for a multiple sample (e.g. N samples) merged vcf file. For instance, ... -T ReadBackedPhasing \ -I sample_1.bam \ -I sample_2.bam \ ... -I sample_N.bam \ -V N_samples_merged.vcf \ -o N_samples_merged_phased.vcf And, same questions for 'PhaseByTransmission' Best wishes, Luke

Geraldine_VdAuwera on 23 Oct 2014

Hi @lukehur, I think that should work but the simplest way to know for sure is to try it on a small region and see if it works as you expect or not.

YingLiu on 23 Oct 2014

HI , @Geraldine_VdAuwera "More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels)",you mean that GVCF can phase both indels and SNPs ? but I know RBP only can phase SNP !

Sheila on 23 Oct 2014

@YingLiu Hi, Yes, HaplotypeCaller phases both SNPs and indels. -Sheila

YingLiu on 23 Oct 2014

> @Sheila said: > @YingLiu > Hi, > > Yes, HaplotypeCaller phases both SNPs and indels. > > -Sheila @Sheila HI, could you provide detailed introduction with using HaplotypeCaller(GVCF mode) to phase SNPs and indels ? I feel provding related phase alorgthim paper to us is helpful . Best

Sheila on 23 Oct 2014

@YingLiu Hi, There is no white paper, but you can have a look at the Methods and Algorithms section for a detailed explanation of the tool. -Sheila

- Recent posts

- Upcoming events

See Events calendar for full list and dates

- Recent events

See Events calendar for full list and dates

- Follow us on Twitter

GATK Dev Team


@brown_birds The GATK support forum is the best place to ask, our frontline specialist will be able to help you with this.
18 Oct 19
It's hot, it's humid, it's #ASHG19 in Houston, TX. Join us at @broadgenomics booth 714 in the exhibition hall to ch…
16 Oct 19
Interested in hearing more about our DRAGEN-GATK partnership with @illumina? Fill out this survey to let us know yo…
16 Oct 19
RT @datadriveby: GATK and DRAGEN collaboration presented by @VdaGeraldine of @gatk_dev and @delagoya of @illumina at #ASHG19. Interesting t…
15 Oct 19
Questions about our new partnership with @illumina DRAGEN? Check out the blog post and handy graphic that explains…
1 Oct 19

- Our favorite tweets from others

Today from 1:30 - 2:30 pm @dgmacarthur and @LFranciol will be at the Broad #ASHG19 booth talking about the #gnomAD…
17 Oct 19
DRAGEN-GATK roadmap looking very interesting. Several complementary options will be available for running stuff on-…
15 Oct 19
As a prior card carrying bioinformatician, it’s great to see @illumina and @broadinstitute coming together to solve…
15 Oct 19
GATK and DRAGEN collaboration presented by @VdaGeraldine of @gatk_dev and @delagoya of @illumina at #ASHG19. Intere…
15 Oct 19
In a new collaboration, the @gatk_dev team and the @illumina DRAGEN Bio-IT Platform are co-developing open-source g…
30 Sep 19

See more of our favorite tweets...