Closed
Description
ran across an issue in fairseq-interactive where unk tokens were not being replaced if there are unks in the source string, even though the --replace-unk flag is set.
example:
| Type the input sentence and press return:
Jack and Jill went up the hill
S-0 Jack and <unk> went up the hill
H-0 -0.9424245357513428 Jack and <unk> went up the hill
P-0 -0.1024 -1.3528 -0.1208 -1.4977 -1.0983 -1.7025 -0.4995 -1.1654
A-0 0 2 2 3 4 6 6 7
H-0 -0.9424245357513428 Jack and <unk> went up the hill
P-0 -0.1024 -1.3528 -0.1208 -1.4977 -1.0983 -1.7025 -0.4995 -1.1654
A-0 0 2 2 3 4 6 6 7
Looking at the code, I think the issue is here: https://github.com/pytorch/fairseq/blob/master/interactive.py#L157
src_str is re-created from src_tokens, which means it contains the unk token. When later we try to replace the unk in post_process_prediction(), it just replaces the unk with another unk
this seems like a bug, but I could be doing something wrong. I've fixed it locally just by keeping the original src_str and passing it to post_process_prediction()
Metadata
Assignees
Labels
No labels