-
Notifications
You must be signed in to change notification settings - Fork 610
Description
When provided a reference dictionary via --sequence-dictionary, Mutect2 errors out if the "standard" location (sequence.fa -> sequence.dict) isn't present.
Steps to recreate:
Rename reference.dict reference.fa.dict (for example) and provide that as --sequence-dictionary when invoking Mutect2.
Expected results:
Code completes successfully, as if the file was called reference.dict
Actual results:
Code crashes with error:
<PATH>/reference.dict for reference <PATH>/reference.fa does not exist. Please see https://gatk.broadinstitute.org/hc/articles/360035531652-FASTA-Reference-genome-format for help creating it.
What seems to be the problem?
In the Mutect2Engine constructor, the reference reader is created here. In that line, only the reference file is provided, not the provided dictionary. Several layers down the stack, the default dictionary, is obtained and then checked for existence. Since it doesn't exist, the tools throws an exception
Workaround for users of nf-core's CreateSequenceDictionary
This problem arrises for users of nf-core. nf-core seems to have settled on a naming convention reference.fa.dict. Without getting into the issue of what the name of the dictionary should be, users of nf-core can soft-link the provided reference to reference.dict prior to calling Mutect2. This isn't pretty, but it should work.
if dict
is a BASH variable holding the name of the dictionary
ln -s ${dict} ${dict%fa.dict}dict
will create a softlink to that file so that reference.dict
will now be found by Mutect2.
Implication for other GATK tools
The --reference-dictionary argument is part of the ReferenceInputArgumentCollection
which is used in many GATK tools, it is therefore likely that many GATK tools do not respect the --sequence-dictionary argument and may fail if they cannot find the dictionary in its "default" location.
Suggestion for fix
The reference "bundle" should become a first-class citizen (replete with optional index and dictionary). It should be populated by the commandline arguments and then provided to downstream methods as a single unit. That would provide consistent information to all the validation and reference-reading programs. (I realize that this is a big lift, but am interested to see if there's an appetite for such a proposal)
Alternative fix
Remove/disable --sequence-dictionary argument and argue for a standard naming convention for adjacent files (hts-specs would be a reasonable place for such discussions to occur)