In my last blog post, I mentioned that GATK 3.6 would support Java 8. I also mentioned that we had some evidence from our Java migration testing that GATK 3.5 (and presumably older versions as well) may produce correctness errors if run on Java 8. Since then quite a few people have expressed concern because they have been running GATK 3.5 or older versions on Java 8 already. Most wanted to know what would be the nature and amplitude of these problems, and whether they should re-run the affected data on Java 7.

We don't have definitive answers for these questions because we haven't performed end-to-end testing of GATK 3.5 on Java 8. Once we noticed that some of the automated tests were failing when we switched Java versions, we hunted down the source of the test failures (some Java list structures for which iteration order is not the same between Java versions) and fixed them.

So for us, the story stops there. But we understand that those of you who have been misguidedly running GATK on Java 8 need more information to decide what to do. I thought that perhaps sharing the relevant details from our migration test results might help, so I compiled a summary of per-tool tests that were affected, with some developer notes and values that were discussed in the issue ticket.


Overall it seems that the pre-processing tools were not affected at all, so you shouldn't need to go back and reprocess any BAM files -- that's some good news right there. Most of the test failures were due to differences in variant annotation values produced by the variant callers. Most of these were very small differences, but a handful were enough to grab our attention.

To be clear, we didn't observe any differences in what was called or not coming out of the callers; however the differences in annotation values could potentially affect filtering decisions in the next steps. So while we can't say definitively whether the end result of the full variant calling pipeline would be the same or not, we suspect there could be differences at least in marginal calls. That being said, based on what we observed in the tests, it seems these differences are mostly on the scale of what you might normally see occur due to downsampling (where using different subsets of reads yields slightly different annotation values, occasionally bumping a variant across a filtering threshold).

If you have a dataset that is affected by this, I would recommend running either one sample again, or a chromosome's worth of intervals across multiple samples, and evaluating how much difference there is in the annotation values. That will allow you to estimate whether your overall results might be affected substantially or not, and whether it's worth re-running samples or not.


UnifiedGenotyper and HaplotypeCaller

  • "GeneralPloidySuite": order of ALT alleles
  • "IndelCalling": QD 29.00 vs 29.72; InbreedingCoeff 0.3761 vs 0.3495
  • "NormalCalling": QUAL differences
  • "ComplexAndSymbolicVariants": InbreedingCoeff value differences
  • Multiple instances of different QUAL, QD, PL and any PL-dependent annotations

Small PL fluctuations -- biggest difference is one GT drops from GQ 13 to GQ 1, which is why MLEAC/MLEAF change.

MuTect2

  • TLOD 41.81 vs 36.02
  • AF 0.246 vs 0.22

VariantAnnotator

  • InbreedingCoeff values are different

InbreedingCoeff depends on PL

ReadBackedPhasing

  • PhasingInconsistent in new output or old output

VariantRecalibration

  • VQSLOD changes

CombineVariants

  • Order of alleles
  • UG_center_to_keep=[BGI] vs ##UG_center_to_keep=[WUGSC]
  • #FILTER= vs | ##FILTER=

GenotypeGVCFs

  • InbreedingCoeff -0.0000 vs 0.0000

DepthOfCoverage

  • Ordering of values

Pileup

  • Ordering of reads

CalibrateGenotypeLikelihoods

  • Ordering of values

Return to top

dd232580 on 20 May 2016


Hi Geraldine, "The underlying problem is that there are some differences in the ordering logic for some of the Java list structures between Java 7 and 8. So between the two versions, the order of elements in those list structures is not guaranteed, meaning values could get swapped between elements, leading to incorrect calculation results. The solution was to switch to a different type of list structure where the order is guaranteed across Java versions." [0] Thanks for explanation, could you please tell what List implementations were used ? I'm confused as it contradicts to the List interface contract, which does guarantee a particular order of elements regardless of its implementation [1] (as opposed to Set interface), and nothing has been changed in this area in Java 8 [2] thanks, dmitry [0] http://gatkforums.broadinstitute.org/gatk/discussion/7605/java-7-8-correctness-issues [1] http://docs.oracle.com/javase/8/docs/api/java/util/List.html [2] http://www.oracle.com/technetwork/java/javase/8-compatibility-guide-2156366.html

Geraldine_VdAuwera on 20 May 2016


Hi @dd232580, The problem was with HashMap structures specifically. I believe we replaced them with LinkedHashMap to fix this. The issue notes mention that "hash key collisions were changed from a linked list in Java 7 to a balanced tree in Java 8", if that's helpful at all.

s_griffith on 20 May 2016


Does that mean that you were previously relying on the order of hash keys in the linked lists used to handle collisions in the Java 7 implementation of HashMap? Or, rather, is the problem that the iteration order changed for your HashMaps when collisions were handled differently? The HashMap data structure has never had a guarantee of a particular iteration order, so it's a little concerning that a change to the collision handling implementation would cause any issues. Perhaps a LinkedHashMap was the appropriate data structure all along.

Geraldine_VdAuwera on 20 May 2016


As I understand it from my discussions with the devs, the iteration order changed. And yes, we should probably have been using the linked form all along.

dd232580 on 20 May 2016


Thanks Geraldine, So this is JEP 180 [3]. To avoid confusion among non-java folks out there, it's worth mentioning that it's not Java 8 compatibility issue. As pointed out in HashMap spec [4] this class makes no guarantees about iteration order. Hence any code that relies on such order should be considered as based on false assumption and be fixed [5]. cheers, dmitry [3] http://openjdk.java.net/jeps/180 [4] https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html [5] https://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html




- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

RT @curroortuno: Do you want to learn about sequencing data analysis in an amazing city? Register now at @gatk_dev workshop "From reads to…
3 Sep 19
Thank you @murilocervato for hosting our GATK workshop in Sao Paolo last week! Great group of participants, we’ll s… https://t.co/QbT7P5dw5k
3 Sep 19
@RealMattJM “Convoluted”, huh? We see what you did there... https://t.co/GbnpCFWPOX
29 Aug 19
#GATK workshop caption competition: what is deep learning developer Sam Friedman trying to say here? https://t.co/qaJwg3lF1W
28 Aug 19
@wbsimey Happy to hear you’ve found the resources we provide helpful!
30 Jul 19

- Our favorite tweets from others

Do you want to learn about sequencing data analysis in an amazing city? Register now at @gatk_dev workshop "From re… https://t.co/ISBVX2Xwr5
3 Sep 19
Another successful #GATK workshop in the books! @TerraBioApp @gatk_dev https://t.co/oFzPLei0f9
3 Sep 19
Day 2 of #GATK workshop this time in São Paulo, Brazil! Hands-on tutorials using @TerraBioApp #GATK Best Practices… https://t.co/GgtENDNfhk
28 Aug 19
In spite of their stated mission to support human health through genomics, many GATK pipelines are applicable to no… https://t.co/FKQTouZjbv
29 Jul 19
Me: driving myself insane over what data to keep and what to not bother with for thesis and also frantically trying… https://t.co/er2klIcw5i
18 Jul 19

See more of our favorite tweets...