-
Couldn't load subscription status.
- Fork 51
Description
It seems like reading sharded sam / cram files correctly depends on what file system you're using. Particularly, the order the shards are read seems to be different depending on which underlying filesystem you're using. I suspect that there is a call to get an iterator over the part files that gets an iterator directly from the file system and then doesn't sort it.
BAMInputFormat seems to work because it overrides getSplits and then sorts them based on the returned path. ExtendingAnySAMInputFormat to also override getSplits seems to fix the problem, but it's a nasty hack that relies on casting things in various ways. We should fix it at the source instead.
public static class SplitSortingSamInputFormat extends AnySAMInputFormat{
@SuppressWarnings("unchecked")
@Override
public List<InputSplit> getSplits(JobContext job) throws IOException {
final List<InputSplit> splits = super.getSplits(job);
if( splits.stream().allMatch(split -> split instanceof FileVirtualSplit || split instanceof FileSplit)) {
splits.sort(Comparator.comparing(split -> {
if (split instanceof FileVirtualSplit) {
return ((FileVirtualSplit) split).getPath();
} else {
return ((FileSplit) split).getPath();
}
}));
}
return splits;
}
}
We noticed this as part of adding tests on sharded files in https://github.com/broadinstitute/gatk/pull/4545/files. The tests passed on OSX but failed with out of order files on Travis (running ubuntu).