Skip to content

[FT] Enhancing CorpusLevelTranslationMetric with Asian Language Support #478

Open
@ryan-minato

Description

@ryan-minato

Issue encountered

While working on several Japanese benchmark tasks, I observed that standard BLEU, CHRF, and TER metrics are suboptimal for Asian languages.
To address this, I propose adding a parameter to CorpusLevelTranslationMetric that allows integration with tokenizers tailored for Asian languages.

Solution/Feature

SacreBLEU already includes tokenizers designed for Asian languages, which lack space-separated words. By modifying the implementation slightly, we can extend CorpusLevelTranslationMetric to better handle these languages.

https://github.com/mjpost/sacrebleu/blob/0f351010b8b641aaa59fe75b98d7cc522bf221eb/sacrebleu/metrics/bleu.py#L110-L208

Possible alternatives

A clear and concise description of any alternative solutions or features you've considered.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions