Spark bug disq
@cmnbroad and I have both observed that the
SortSamSparkIntegrationTest.testSortBAMsSharded tests fail locally on our machines despite the tests apparently working on travis. The tests fail because the comparator detects the files are out of their reported sort order. When I went digging into the failing tests it appears that the files are getting correctly sorted and written out correctly into 2 shards with proper names (
filename-0001). After reading the sharded directory as input, it appears that the two files are read out of order. That is to say that calling
readsRDD.collect() clearly places all of the
filename-0001 reads before the
After digging around it appears the problem might lie in Disq somewhere as it appears everything is working as expected until the
abstractSamSource.getReads() line is encountered in
HtsjdkReadsRddStorage. I suspect something is going awry with the filesystem mechanism for ordering the input files on our Macs that travis is sidestepping.
Out of curiosity @tomwhite I thought that the sharded output wrote headerless bam chunks, but that appears not to be the case at all? Was I wrong in that assumption or did that change when we switched to Disq.