Description
In https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_xlm.py#L78 we have:
"id2lang": {"0": "de", "1": "en"},
"lang2id": {"de": 0, "en": 1},
and then:
lang2id (:obj:`Dict[str, int]`, `optional`, defaults to :obj:`None`):
Dictionary mapping languages string identifiers to their IDs.
id2lang (:obj:`Dict[int, str`, `optional`, defaults to :obj:`None`):
So it should be:
"id2lang": {0: "de", 1: "en"},
"lang2id": {"de": 0, "en": 1},
All other entries need this change too.
The problem hasn't been detected until now since they were used to only count the number of languages it seems.
I need to pass src/tgt languages to the tokenizer I'm porting from fairseq, so I was looking at how to do that and id2lang
seems to fit the purpose. But I actually need to look them up by int
id, that's how I saw the problem.
But I'm also not sure why do we need to hardcode the reversal, when it can be done in 1 line of code? Which would also remove this assertion code:
self.lang2id = lang2id
self.id2lang = id2lang
if lang2id is not None and id2lang is not None:
assert len(lang2id) == len(id2lang)
Further we don't even need to hardcode the ids. Replace:
"id2lang": {0: "de", 1: "en"},
with:
"id2lang": ["de", "en"]
So all we need is one of the two entries, and now generate the 2 lookup dicts on the fly.
And since it's no longer id2lang
semantically, probably renaming it to just langs
would be more appropriate.
I think I will use this approach regardless of the outcome of this issue.
Thanks.