-
Notifications
You must be signed in to change notification settings - Fork 33
add huggingface format to be pulled by huggingface/datasets #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cstorm125
commented
Nov 30, 2020
- Filter out texts that are only '#ERROR!'
- Add train-validation split at 90/10 with seed 1412
- Save to huggingface/train.json, valid.json, test.json which are all json lines format
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with Hugging Face Dataset, please enlighten me.
Do we need to provide some metadata for dataset.info here as well? Or it's not for this step?
For the notebook here
Metadata is partly autogenerated. The other part is a readme file ("Dataset Card") where I copy and pasted most of the stuff from the original readme below: YAML tags:
Dataset Card for wisesight_sentimentTable of Contents
Dataset Description
Dataset SummaryWisesight Sentiment Corpus: Social media messages in Thai language with sentiment label (positive, neutral, negative, question)
Supported Tasks and LeaderboardsSentiment analysis / Kaggle Leaderboard LanguagesThai Dataset StructureData Instances
Data Fields
Data Splits
Dataset CreationCuration RationaleOriginally, the dataset was conceived for the In-class Kaggle Competition at Chulalongkorn university by Ekapol Chuangsuwanich (Faculty of Engineering, Chulalongkorn University). It has since become one of the benchmarks for sentiment analysis in Thai. Source DataInitial Data Collection and Normalization
Who are the source language producers?Social media users in Thailand AnnotationsAnnotation process
Who are the annotators?Outsourced annotators hired by Wisesight (Thailand) Co., Ltd. Personal and Sensitive Information
Considerations for Using the DataSocial Impact of Dataset
Discussion of Biases
Other Known Limitations
Additional InformationDataset CuratorsThanks PyThaiNLP community, Kitsuchart Pasupa (Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang), and Ekapol Chuangsuwanich (Faculty of Engineering, Chulalongkorn University) for advice. The original Kaggle competition, using the first version of this corpus, can be found at https://www.kaggle.com/c/wisesight-sentiment/ Licensing Information
Citation InformationPlease cite the following if you make use of the dataset: Arthit Suriyawongkul, Ekapol Chuangsuwanich, Pattarawat Chormai, and Charin Polpanumas. 2019. PyThaiNLP/wisesight-sentiment: First release. September. BibTeX:
|