SortSam ShardedInput tests failing.
open | Created 2019-04-12 | Last updated 2019-04-12| Posted by jamesemery | See in Github

Spark bug disq

@cmnbroad and I have both observed that the SortSamSparkIntegrationTest.testSortBAMsSharded tests fail locally on our machines despite the tests apparently working on travis. The tests fail because the comparator detects the files are out of their reported sort order. When I went digging into the failing tests it appears that the files are getting correctly sorted and written out correctly into 2 shards with proper names (filename-0000 and filename-0001). After reading the sharded directory as input, it appears that the two files are read out of order. That is to say that calling readsRDD.collect() clearly places all of the filename-0001 reads before the filename-0000 reads.

After digging around it appears the problem might lie in Disq somewhere as it appears everything is working as expected until the abstractSamSource.getReads() line is encountered in HtsjdkReadsRddStorage. I suspect something is going awry with the filesystem mechanism for ordering the input files on our Macs that travis is sidestepping.

Out of curiosity @tomwhite I thought that the sharded output wrote headerless bam chunks, but that appears not to be the case at all? Was I wrong in that assumption or did that change when we switched to Disq.

Return to top