Regex for detecting language codes incorrect

### System Info

Ubuntu, PHP 8.1.2

### PHP Version

8.1.2

### Environment/Platform

- [ ] Command-line application
- [X] Web application
- [ ] Serverless
- [ ] Other (please specify)

### Description

In the `Codewithkyrian\Transformers\PretrainedTokenizers\NllbTokenizer` class this regex is used to detect language codes: `/^[a-z]{3}_[A-Z]{3}$/`
However some models, like [Xenova/nllb-200-distilled-600M](https://huggingface.co/Xenova/nllb-200-distilled-600M) use a format like eng_Latn ([full list](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200))

I would suggest something like `/^[a-z]{3}_[a-zA-Z]{3,4}$/`

Is there a big penalty to false positives here? Is this check required?

### Reproduction

```
$trans = pipeline('translation', 'Xenova/nllb-200-distilled-600M');
$trans('Translation test', srcLang: 'eng_Latn', tgtLang: 'deu_Latn');
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex for detecting language codes incorrect #43

System Info

PHP Version

Environment/Platform

Description

Reproduction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Regex for detecting language codes incorrect #43

Description

System Info

PHP Version

Environment/Platform

Description

Reproduction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions