Intel deflater + long reads data = intermittent corrupt bams
open | Created 2019-03-15 | Last updated 2019-03-28| Posted by droazen | See in Github


GKL PRIORITY_HIGH bug


@kvg reports that running the Intel deflater via GATK on long reads data intermittently produces corrupt bam outputs. His specific use case is sharding a single unaligned bam file into multiple smaller bams. Running with the JDK deflater (--use-jdk-deflater) appears to resolve the issue.

Example error when trying to read a corrupt shard (reading with htsjdk produces the same error):

$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v1/ -nr 100000
$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v2/ -nr 100000
$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v3/ -nr 100000
$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v4/ -nr 100000
$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v5/ -nr 100000

$ samtools view intel_compression/v1/sharding_test.000002.bam > /dev/null
$ samtools view intel_compression/v2/sharding_test.000002.bam > /dev/null
[E::bgzf_read] Read block operation failed with error 2 after 30675 of 72043 bytes
[main_samview] truncated file.
$ samtools view intel_compression/v3/sharding_test.000002.bam > /dev/null
$ samtools view intel_compression/v4/sharding_test.000002.bam > /dev/null
$ samtools view intel_compression/v5/sharding_test.000002.bam > /dev/null

(Only the second attempt yields a corrupted file; runs before and after appear to be correct, despite nothing changing between steps.)

There may be a bug in https://github.com/Intel-HLS/GKL/blob/master/src/main/native/compression/IntelDeflater.cc, perhaps triggered when a read spans many compressed blocks.


Return to top