An effort to automate the downloading and processing of textual datasets for emotion classification. Inspired by sarnthil/unify-emotion-datasets, but updated and more comprehensive. All datasets produce a HuggingFace datasets arrow dataset, and optionally some metadata files.
Currently implemented datasets:
| Name | System | Labels | Multilabel | Continuous | Size | Domain |
|---|---|---|---|---|---|---|
| AffectiveText | Continuous ratings for different emotion classes | 7 | ✓ | ✓ | 1.3k | News headlines |
| CancerEmo | Plutchik-8 emotions | 8 | ✓ | 12k | Cancer survivors internet forum | |
| CARER | Hashtags in Twitter posts corresponding to Ekman's core emotions | 6 | 20k | Twitter posts | ||
| CrowdFlower | Hashtags in twitter posts | 13 | 40k | Twitter posts | ||
| ElectoralTweets | Discrete categories with some aggregated emotions | 21 | ✓ | 1.1k | Twitter posts | |
| EmoBank | Valence-Arousal-Dominance | 3 | ✓ | 10k | Varied | |
| EmoInt | Subset of common emotions anotated using best-worst scaling | 4 | ✓ | ✓ | 6.9k | Twitter posts |
| EmotionStimulus | Ekman basic emotions | 7 | 2.4k | Emotion bearing sentences | ||
| FBValenceArousal | Valence Arousal | 2 | ✓ | 2.9k | Facebook posts | |
| GoEmotions | Custom hierarchical emotion system | 28 | ✓ | 58k | Reddit posts | |
| GoodNewsEveryone | Extended Plutchik | 16 | 5k | News headlines | ||
| Hurricanes8 | Plutchik-8 emotions | 8 | ✓ | 14k | Twitter posts about hurricanes | |
| Hurricanes24 | Plutchik-24 emotions | 24 | ✓ | 15k | Twitter posts about hurricanes | |
| ISEAR | Situations in which a subject experienced one of 7 major emotions | 7 | 7.6k | Situation descriptions | ||
| REN20k[1] | Evoked emoions annotated by many readers | 8 | ✓ | ✓ | 20k | News articles |
| Semeval2018Classification | Presence of common Twitter emotions | 11 | ✓ | 11k | Twitter posts | |
| Semeval2018Intensity | Intensity scores for basic emotions | 5 | ✓ | ✓ | 11k | Twitter posts |
| SentimentalLIAR | Automated emotion annotation using Google and IBM NLP APIs | 6 | ✓ | ✓ | 13k | Short snippets from politicians and famous people |
| SSEC | A mixture between Plutchik and Ekman | 8 | ✓ | 4.8k | Twitter posts | |
| StockEmotions | Custom emotions set | 13 | 10k | Social media comments about stocks | ||
| TalesEmotions[1] | Ekman basic emotions | 7 | 15k | Fairy tales | ||
| TEC | Ekman basic emotions | 6 | 21k | Twitter posts | ||
| UsVsThem | Positive and negative emotions associated with populist attitudes | 13 | ✓ | 6.8k | Reddit posts | |
| WASSA22 | Ekman basic emotions, along with continuous scores for empathy and distress | 9 | 2.1k | Essays | ||
| XED | Plutchik core emotions | 9 | ✓ | ✓ | 27k | Subtitles |
[1]: There are additional usage limitations in place, or the dataset is not publicly available. You are responsbile for requesting and downloading the dataset yourself from the authors' homepage.
To install the package, first clone this repo:
git clone https://github.com/ioverho/emotion_datasets.gitIf using uv, to install the most reent version of all the dependencies use sync (optional):
cd emotion_datasets
uv syncTo access a dataset's metadata in a Python script, assuming you have installed this library, you can run:
>>> dataset = get_dataset(${DATASET})
>>> dataset.metadataHere you should replace ${DATASET} with a dataset name. See the table above for implemented datasets. The name should not contain any spaces.
This should return a DatasetMetadata object that contains a description, citation and licensing information, a list of all emotion columns, and metadata on how the emotion annotations were conducted.
To process a single datasetr, using uv, run:
uv run process_dataset dataset=${DATASET}The script has been equiped with a hydra CLI. Use --help to see which options are available. To get help for a specific dataset, run as: uv run process_dataset dataset=${DATASET} --help.
To change the location of the output directory, run the script with the file_system.output_dir=${OUTPUT_DIR} command.
If the dataset has already been processed and currently resides in the output directory, the script will fail, unless overwrite=True is set.
If the data needs to be manually downloaded first (see the [1] annotation in the above table), you must set the dataset.download_file_path parameter to the downloaded file. This file will not be altered during processing.
Running the script for any dataset should output a directory with the following structure:
/data/
├── ${DATASET}
│ processed data along with metadata files
├── citations.bib
│ a bib file with the citations for each dataset
└── manifest.json
a summary of which files can be found where
/downloads/
Any remaining download files will reside here
/logs/
└── ${DATASET}
the logs produced during processing
All datasets are stored as HuggingFace datasets compatible directories. This implies at least 3 files:
data-#####-of-#####.arrow: the actual data, stored across arrow filesdataset_info.json: metadata relevant for users. Includes information about the homepage and citing the original datasetstate.json: metadata relevant for HuggingFace
To process all datasets in one go, use the process_multiple_datasets script.
uv run process_multiple_datasetsThe configuration parameters for all other datasets can be set through the CLI as usual. Each dataset is now under datasets.${DATASET}. For some datasets it is required to add configuration parameters. For example REN20k must be manually downloaded. Unless these datasets are skipped, the script will fail before processing any dataset.
uv run process_multiple_datasets datasets.ren20k.download_file_path='./downloads/REN-20k.zip'Should you wish to skip any datasets, you can use the skip argument:
uv run process_multiple_datasets 'skip=[${DATASET_1}, ${DATASET_2}]'If you use this repo, please make sure to cite the datasets you parsed. Also, please cite this repo.
@software{ioverho_emotion_datasets,
author = {Verhoeven, Ivo},
license = {CC-BY-4.0},
title = {{emotion\_datasets}},
url = {https://github.com/ioverho/emotion_datasets}
}WIP Datasets
| Name | Description |
|---|---|
| Blogs | |
| VENT | Huge tweets dataset with many emotions |
| SMILE Twitter Emotion | |
| EmoNet | |
| emotion-cause | |
| emotiondata-aman | |
| IMS Datasets |
Both CARER and Crowdflower will need to be edited to match the same dataset schemaCheck for multilabel instaces in ElectoralTweetsSome method for seeing samples from each datasetSome script for quickly generating a.bibfile from all the downloaded datasetsTalesEmotions needs to be altered to fit schema
Excluded Datasets
| Name | Exclusion Reason |
|---|---|
| SemEval-2019 Task 3: EmoContext | Emotion spread out over long context |
| Grounded Emotion | SoTA classifiers cannot beat random performance |
| dailydialog | Avoiding conversational data for now |
| EmoWoz | Avoiding conversational data for now |