Skip to content

bug: Mutect2 doesn't respect provided --sequence-dictionary #9208

@yfarjoun

Description

@yfarjoun

When provided a reference dictionary via --sequence-dictionary, Mutect2 errors out if the "standard" location (sequence.fa -> sequence.dict) isn't present.

Steps to recreate:

Rename reference.dict reference.fa.dict (for example) and provide that as --sequence-dictionary when invoking Mutect2.

Expected results:

Code completes successfully, as if the file was called reference.dict

Actual results:

Code crashes with error:

<PATH>/reference.dict for reference <PATH>/reference.fa does not exist. Please see https://gatk.broadinstitute.org/hc/articles/360035531652-FASTA-Reference-genome-format for help creating it.

What seems to be the problem?

In the Mutect2Engine constructor, the reference reader is created here. In that line, only the reference file is provided, not the provided dictionary. Several layers down the stack, the default dictionary, is obtained and then checked for existence. Since it doesn't exist, the tools throws an exception

Workaround for users of nf-core's CreateSequenceDictionary

This problem arrises for users of nf-core. nf-core seems to have settled on a naming convention reference.fa.dict. Without getting into the issue of what the name of the dictionary should be, users of nf-core can soft-link the provided reference to reference.dict prior to calling Mutect2. This isn't pretty, but it should work.

if dict is a BASH variable holding the name of the dictionary

 ln -s ${dict} ${dict%fa.dict}dict

will create a softlink to that file so that reference.dict will now be found by Mutect2.

Implication for other GATK tools

The --reference-dictionary argument is part of the ReferenceInputArgumentCollection which is used in many GATK tools, it is therefore likely that many GATK tools do not respect the --sequence-dictionary argument and may fail if they cannot find the dictionary in its "default" location.

Suggestion for fix

The reference "bundle" should become a first-class citizen (replete with optional index and dictionary). It should be populated by the commandline arguments and then provided to downstream methods as a single unit. That would provide consistent information to all the validation and reference-reading programs. (I realize that this is a big lift, but am interested to see if there's an appetite for such a proposal)

Alternative fix

Remove/disable --sequence-dictionary argument and argue for a standard naming convention for adjacent files (hts-specs would be a reasonable place for such discussions to occur)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions