System Info
Ubuntu, PHP 8.1.2
PHP Version
8.1.2
Environment/Platform
Description
In the Codewithkyrian\Transformers\PretrainedTokenizers\NllbTokenizer class this regex is used to detect language codes: /^[a-z]{3}_[A-Z]{3}$/
However some models, like Xenova/nllb-200-distilled-600M use a format like eng_Latn (full list)
I would suggest something like /^[a-z]{3}_[a-zA-Z]{3,4}$/
Is there a big penalty to false positives here? Is this check required?
Reproduction
$trans = pipeline('translation', 'Xenova/nllb-200-distilled-600M');
$trans('Translation test', srcLang: 'eng_Latn', tgtLang: 'deu_Latn');
System Info
Ubuntu, PHP 8.1.2
PHP Version
8.1.2
Environment/Platform
Description
In the
Codewithkyrian\Transformers\PretrainedTokenizers\NllbTokenizerclass this regex is used to detect language codes:/^[a-z]{3}_[A-Z]{3}$/However some models, like Xenova/nllb-200-distilled-600M use a format like eng_Latn (full list)
I would suggest something like
/^[a-z]{3}_[a-zA-Z]{3,4}$/Is there a big penalty to false positives here? Is this check required?
Reproduction