Skip to content

Conversation

stephenbach
Copy link
Member

This PR adds a language tagging feature, so that users can annotate prompts with the language(s) used in the prompt.

Even though this is motivated by the eval hackathon, this PR targets main because it affects all prompts. All existing prompts in main are tagged with English. After merging into main, another PR should merge main into eval hackathon. This will require somewhat careful coordination because new prompts on that branch will need to have their metadata updated in the .yaml before they will work in the UI.

Regarding the tags themselves, the eval group requested using the subtags in this list. I took the liberty of changing how the tags are displayed in the UI, appending the English names in parens, but this is changeable.

Screen Shot 2022-05-12 at 1 48 37 PM

@stephenbach stephenbach requested review from VictorSanh and awebson May 12, 2022 17:49
Copy link
Contributor

@awebson awebson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks so much Steve!

I think we agreed to emphasize somewhere either in the UI or in the contribution guide that the language tag should be about the languages of each prompt, not the languages of the dataset examples (which should be already documented by the datasets themselves)?

Copy link
Member

@VictorSanh VictorSanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you for adding this @stephenbach!

Regarding the language codes, the HF ecosystem (including datasets) is using ISO 639 for language codes (see https://huggingface.co/languages). Could we use the same here?
It will make running analysis on languages of prompts and datasets actually possible (or at least it will be smoother).

@stephenbach stephenbach merged commit 0cc4b0c into main Jul 8, 2022
@stephenbach stephenbach deleted the language_tags branch July 8, 2022 21:29
stephenbach added a commit that referenced this pull request Jul 12, 2022
* Accelerate `get_infos` by caching the `DataseInfoDict`s (#778)

* accelerate `get_infos` by caching the `DataseInfoDict`s

* quality

* consistency

* fix `filter_english_datasets` since `languages` became `language` in dataset metadatas

* fix empty documents - multi_news (#793)

* fix empty documents - multi_news

* fix test - unrecognized variable

* Language tags (#771)

* Added languages widget to UI.

* Style fixes.

* Added English tag to existing datasets.

* Add languages to viewer mode.

* Update language codes.

* Update CONTRIBUTING.md.

* Update screenshot.

* Add "Prompt" to UI to clarify languages tag usage.

* Add blank languages list.

Co-authored-by: Victor SANH <victorsanh@gmail.com>
stephenbach added a commit that referenced this pull request Oct 26, 2022
* remove language restrictions

* add arabic dataset to primary_task

* Accelerate `get_infos` by caching the `DataseInfoDict`s (#778)

* accelerate `get_infos` by caching the `DataseInfoDict`s

* quality

* consistency

* add arabic prompts

* cleaning

* Consistency in prompt naming.

* cleaning

* fix `filter_english_datasets` since `languages` became `language` in dataset metadatas

* fix empty documents - multi_news (#793)

* fix empty documents - multi_news

* fix test - unrecognized variable

* Language tags (#771)

* Added languages widget to UI.

* Style fixes.

* Added English tag to existing datasets.

* Add languages to viewer mode.

* Update language codes.

* Update CONTRIBUTING.md.

* Update screenshot.

* Add "Prompt" to UI to clarify languages tag usage.

* update

* update prompts

* Remove duplicates lines

* update

* regenerate prompts

* cleaning

* lang tag missing

Co-authored-by: Victor SANH <victorsanh@gmail.com>
Co-authored-by: Stephen Bach <stephenhbach@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants