Skip to content

Reading sharded SAM / CRAM fails on some filesystems #199

@lbergelson

Description

@lbergelson

It seems like reading sharded sam / cram files correctly depends on what file system you're using. Particularly, the order the shards are read seems to be different depending on which underlying filesystem you're using. I suspect that there is a call to get an iterator over the part files that gets an iterator directly from the file system and then doesn't sort it.

BAMInputFormat seems to work because it overrides getSplits and then sorts them based on the returned path. ExtendingAnySAMInputFormat to also override getSplits seems to fix the problem, but it's a nasty hack that relies on casting things in various ways. We should fix it at the source instead.

  public static class SplitSortingSamInputFormat extends AnySAMInputFormat{
        @SuppressWarnings("unchecked")
        @Override
        public List<InputSplit> getSplits(JobContext job) throws IOException {
            final List<InputSplit> splits = super.getSplits(job);


            if( splits.stream().allMatch(split -> split instanceof FileVirtualSplit || split instanceof FileSplit)) {
                splits.sort(Comparator.comparing(split -> {
                    if (split instanceof FileVirtualSplit) {
                        return ((FileVirtualSplit) split).getPath();
                    } else {
                        return ((FileSplit) split).getPath();
                    }
                }));
            }

            return splits;
        }
    }

We noticed this as part of adding tests on sharded files in https://github.com/broadinstitute/gatk/pull/4545/files. The tests passed on OSX but failed with out of order files on Travis (running ubuntu).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions