Skip to content

Post-refactoring (update get_dataset_info & dataset scripts)#203

Merged
pfliu-nlp merged 13 commits intomainfrom
post-refactoring
May 14, 2022
Merged

Post-refactoring (update get_dataset_info & dataset scripts)#203
pfliu-nlp merged 13 commits intomainfrom
post-refactoring

Conversation

@pfliu-nlp
Copy link
Contributor

No description provided.

@pfliu-nlp
Copy link
Contributor Author

pfliu-nlp commented May 14, 2022

Regarding the dataset_info.jsonl format, the major difference comes from the change of task_categories (previously, task_category).
We could also keep the original format of dataset_info.jsonl unchanged by taking the value of task_categories[0].

I'm running get_dataset_info.py from scratch to collect the remaining errors or issues.

Copy link
Collaborator

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we should definitely change to task_categories, no need to remove information from dataset_info.jsonl that might be useful later.

@pfliu-nlp
Copy link
Contributor Author

This is the latest version of dataset_info.jsonl & some points I want to share

  • So far, almost all ERRORs result from the use of google drive link, which can work sometimes but will fail as well sometimes. We can move them to S3 gradually (Since most of them are from summarization tasks, so maybe @yixinL7 and @xcfcode could help out with this part.
  • languages for several datasets should be added.
  • get_dataset_info.py now can print sub_dataset_name when errors happen instead of just printing ___NONE__

@pfliu-nlp pfliu-nlp merged commit 65ff65b into main May 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants