Add multilingual tokenization for ROUGE #79

jon-tow · 2022-06-01T21:53:08Z

Adds support for multilingual ROUGE scoring by providing language-specific tokenization via nltk.
Adds a code_to_pycountry_lang utility that maps ISO codes to pycountry.db.Language objects for robust language name parsing.
Removes rougeLsum in the default rouge_types arg as sentences are not separated by newlines which breaks the rouge_scorer assumption.

TODO

Add sentence-level tokenization (possibly use nltk.sent_tokenize?). As mentioned above, rouge-score==0.0.4 (the latest package release) expects sentences be split by newlines to compute the rougeLsum score. The latest version on their master branch contains automatic sentence splitting support. Unfortunately, this repo is not pip installable because there exists a module at the project root level named tokenize.py that overrides a module of the same name in pip's setuptools dependency, breaking the installation.
Find a clean abstraction for tagging non-English PromptSourceTasks with their language. This tag could then be used to construct the multilingual NltkWordTokenizer that gets passed into rouge and other metrics that may need multilingual support in the future. Possibly use promptsource's language tagging: Language tags promptsource#771

Muennighoff · 2022-08-15T20:45:42Z

Can we still use the current ROUGE score in LMEVAL for non-space languages?
It seems to me like PaLM used it https://arxiv.org/pdf/2204.02311.pdf for many other languages than English

Also related: ROUGE-scores are 0-1 & BLEU 0-100 in LMEVAL right?

Add multilingual tokenization for ROUGE

68794b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add multilingual tokenization for ROUGE #79

Add multilingual tokenization for ROUGE #79

Uh oh!

jon-tow commented Jun 1, 2022 •

edited

Loading

Uh oh!

Muennighoff commented Aug 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add multilingual tokenization for ROUGE #79

Are you sure you want to change the base?

Add multilingual tokenization for ROUGE #79

Uh oh!

Conversation

jon-tow commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

Muennighoff commented Aug 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jon-tow commented Jun 1, 2022 •

edited

Loading