Emotion Datasets

An effort to automate the downloading and processing of textual datasets for emotion classification. Inspired by sarnthil/unify-emotion-datasets, but updated and more comprehensive. All datasets produce a HuggingFace datasets arrow dataset, and optionally some metadata files.

Currently implemented datasets:

Name	System	Labels	Multilabel	Continuous	Size	Domain
AffectiveText	Continuous ratings for different emotion classes	7	✓	✓	1.3k	News headlines
CancerEmo	Plutchik-8 emotions	8	✓		12k	Cancer survivors internet forum
CARER	Hashtags in Twitter posts corresponding to Ekman's core emotions	6			20k	Twitter posts
CrowdFlower	Hashtags in twitter posts	13			40k	Twitter posts
ElectoralTweets	Discrete categories with some aggregated emotions	21	✓		1.1k	Twitter posts
EmoBank	Valence-Arousal-Dominance	3		✓	10k	Varied
EmoInt	Subset of common emotions anotated using best-worst scaling	4	✓	✓	6.9k	Twitter posts
EmotionStimulus	Ekman basic emotions	7			2.4k	Emotion bearing sentences
FBValenceArousal	Valence Arousal	2		✓	2.9k	Facebook posts
GoEmotions	Custom hierarchical emotion system	28	✓		58k	Reddit posts
GoodNewsEveryone	Extended Plutchik	16			5k	News headlines
Hurricanes8	Plutchik-8 emotions	8	✓		14k	Twitter posts about hurricanes
Hurricanes24	Plutchik-24 emotions	24	✓		15k	Twitter posts about hurricanes
ISEAR	Situations in which a subject experienced one of 7 major emotions	7			7.6k	Situation descriptions
REN20k[1]	Evoked emoions annotated by many readers	8	✓	✓	20k	News articles
Semeval2018Classification	Presence of common Twitter emotions	11	✓		11k	Twitter posts
Semeval2018Intensity	Intensity scores for basic emotions	5	✓	✓	11k	Twitter posts
SentimentalLIAR	Automated emotion annotation using Google and IBM NLP APIs	6	✓	✓	13k	Short snippets from politicians and famous people
SSEC	A mixture between Plutchik and Ekman	8	✓		4.8k	Twitter posts
StockEmotions	Custom emotions set	13			10k	Social media comments about stocks
TalesEmotions[1]	Ekman basic emotions	7			15k	Fairy tales
TEC	Ekman basic emotions	6			21k	Twitter posts
UsVsThem	Positive and negative emotions associated with populist attitudes	13	✓		6.8k	Reddit posts
WASSA22	Ekman basic emotions, along with continuous scores for empathy and distress	9			2.1k	Essays
XED	Plutchik core emotions	9	✓	✓	27k	Subtitles

[1]: There are additional usage limitations in place, or the dataset is not publicly available. You are responsbile for requesting and downloading the dataset yourself from the authors' homepage.

Installation

To install the package, first clone this repo:

git clone https://github.com/ioverho/emotion_datasets.git

If using uv, to install the most reent version of all the dependencies use sync (optional):

cd emotion_datasets
uv sync

Usage

Accessing Dataset Metadata

To access a dataset's metadata in a Python script, assuming you have installed this library, you can run:

>>> dataset = get_dataset(${DATASET})
>>> dataset.metadata

Here you should replace ${DATASET} with a dataset name. See the table above for implemented datasets. The name should not contain any spaces.

This should return a DatasetMetadata object that contains a description, citation and licensing information, a list of all emotion columns, and metadata on how the emotion annotations were conducted.

Processing a Single Dataset

To process a single datasetr, using uv, run:

uv run process_dataset dataset=${DATASET}

The script has been equiped with a hydra CLI. Use --help to see which options are available. To get help for a specific dataset, run as: uv run process_dataset dataset=${DATASET} --help.

To change the location of the output directory, run the script with the file_system.output_dir=${OUTPUT_DIR} command.

If the dataset has already been processed and currently resides in the output directory, the script will fail, unless overwrite=True is set.

If the data needs to be manually downloaded first (see the [1] annotation in the above table), you must set the dataset.download_file_path parameter to the downloaded file. This file will not be altered during processing.

Output

Running the script for any dataset should output a directory with the following structure:

/data/
    ├── ${DATASET}
    │   processed data along with metadata files
    ├── citations.bib
    │   a bib file with the citations for each dataset
    └── manifest.json
        a summary of which files can be found where
/downloads/
    Any remaining download files will reside here
/logs/
    └── ${DATASET}
        the logs produced during processing

All datasets are stored as HuggingFace datasets compatible directories. This implies at least 3 files:

data-#####-of-#####.arrow: the actual data, stored across arrow files
dataset_info.json: metadata relevant for users. Includes information about the homepage and citing the original dataset
state.json: metadata relevant for HuggingFace

Processing All Datasets

To process all datasets in one go, use the process_multiple_datasets script.

uv run process_multiple_datasets

The configuration parameters for all other datasets can be set through the CLI as usual. Each dataset is now under datasets.${DATASET}. For some datasets it is required to add configuration parameters. For example REN20k must be manually downloaded. Unless these datasets are skipped, the script will fail before processing any dataset.

uv run process_multiple_datasets datasets.ren20k.download_file_path='./downloads/REN-20k.zip'

Should you wish to skip any datasets, you can use the skip argument:

uv run process_multiple_datasets 'skip=[${DATASET_1}, ${DATASET_2}]'

Citation

If you use this repo, please make sure to cite the datasets you parsed. Also, please cite this repo.

@software{ioverho_emotion_datasets,
    author = {Verhoeven, Ivo},
    license = {CC-BY-4.0},
    title = {{emotion\_datasets}},
    url = {https://github.com/ioverho/emotion_datasets}
}

Appendix

WIP Datasets

Name	Description
Blogs
VENT	Huge tweets dataset with many emotions
SMILE Twitter Emotion
EmoNet
emotion-cause
emotiondata-aman
IMS Datasets

Notes

~~Both CARER and Crowdflower will need to be edited to match the same dataset schema~~
~~Check for multilabel instaces in ElectoralTweets~~
~~Some method for seeing samples from each dataset~~
~~Some script for quickly generating a .bib file from all the downloaded datasets~~
~~TalesEmotions needs to be altered to fit schema~~

Excluded Datasets

Name	Exclusion Reason
SemEval-2019 Task 3: EmoContext	Emotion spread out over long context
Grounded Emotion	SoTA classifiers cannot beat random performance
dailydialog	Avoiding conversational data for now
EmoWoz	Avoiding conversational data for now

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
src/emotion_datasets		src/emotion_datasets
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emotion Datasets

Installation

Usage

Accessing Dataset Metadata

Processing a Single Dataset

Output

Processing All Datasets

Citation

Appendix

Notes

About

Uh oh!

Packages

Uh oh!

Languages

ioverho/emotion_datasets

Folders and files

Latest commit

History

Repository files navigation

Emotion Datasets

Installation

Usage

Accessing Dataset Metadata

Processing a Single Dataset

Output

Processing All Datasets

Citation

Appendix

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Languages

Packages