Skip to content

An effort to automate the downloading and processing of textual datasets for emotion classification. Inspired by sarnthil/unify-emotion-datasets, but updated and more comprehensive.

Notifications You must be signed in to change notification settings

ioverho/emotion_datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Emotion Datasets

An effort to automate the downloading and processing of textual datasets for emotion classification. Inspired by sarnthil/unify-emotion-datasets, but updated and more comprehensive. All datasets produce a HuggingFace datasets arrow dataset, and optionally some metadata files.

Currently implemented datasets:

Name System Labels Multilabel Continuous Size Domain
AffectiveText Continuous ratings for different emotion classes 7 1.3k News headlines
CancerEmo Plutchik-8 emotions 8 12k Cancer survivors internet forum
CARER Hashtags in Twitter posts corresponding to Ekman's core emotions 6 20k Twitter posts
CrowdFlower Hashtags in twitter posts 13 40k Twitter posts
ElectoralTweets Discrete categories with some aggregated emotions 21 1.1k Twitter posts
EmoBank Valence-Arousal-Dominance 3 10k Varied
EmoInt Subset of common emotions anotated using best-worst scaling 4 6.9k Twitter posts
EmotionStimulus Ekman basic emotions 7 2.4k Emotion bearing sentences
FBValenceArousal Valence Arousal 2 2.9k Facebook posts
GoEmotions Custom hierarchical emotion system 28 58k Reddit posts
GoodNewsEveryone Extended Plutchik 16 5k News headlines
Hurricanes8 Plutchik-8 emotions 8 14k Twitter posts about hurricanes
Hurricanes24 Plutchik-24 emotions 24 15k Twitter posts about hurricanes
ISEAR Situations in which a subject experienced one of 7 major emotions 7 7.6k Situation descriptions
REN20k[1] Evoked emoions annotated by many readers 8 20k News articles
Semeval2018Classification Presence of common Twitter emotions 11 11k Twitter posts
Semeval2018Intensity Intensity scores for basic emotions 5 11k Twitter posts
SentimentalLIAR Automated emotion annotation using Google and IBM NLP APIs 6 13k Short snippets from politicians and famous people
SSEC A mixture between Plutchik and Ekman 8 4.8k Twitter posts
StockEmotions Custom emotions set 13 10k Social media comments about stocks
TalesEmotions[1] Ekman basic emotions 7 15k Fairy tales
TEC Ekman basic emotions 6 21k Twitter posts
UsVsThem Positive and negative emotions associated with populist attitudes 13 6.8k Reddit posts
WASSA22 Ekman basic emotions, along with continuous scores for empathy and distress 9 2.1k Essays
XED Plutchik core emotions 9 27k Subtitles

[1]: There are additional usage limitations in place, or the dataset is not publicly available. You are responsbile for requesting and downloading the dataset yourself from the authors' homepage.

Installation

To install the package, first clone this repo:

git clone https://github.com/ioverho/emotion_datasets.git

If using uv, to install the most reent version of all the dependencies use sync (optional):

cd emotion_datasets
uv sync

Usage

Accessing Dataset Metadata

To access a dataset's metadata in a Python script, assuming you have installed this library, you can run:

>>> dataset = get_dataset(${DATASET})
>>> dataset.metadata

Here you should replace ${DATASET} with a dataset name. See the table above for implemented datasets. The name should not contain any spaces.

This should return a DatasetMetadata object that contains a description, citation and licensing information, a list of all emotion columns, and metadata on how the emotion annotations were conducted.

Processing a Single Dataset

To process a single datasetr, using uv, run:

uv run process_dataset dataset=${DATASET}

The script has been equiped with a hydra CLI. Use --help to see which options are available. To get help for a specific dataset, run as: uv run process_dataset dataset=${DATASET} --help.

To change the location of the output directory, run the script with the file_system.output_dir=${OUTPUT_DIR} command.

If the dataset has already been processed and currently resides in the output directory, the script will fail, unless overwrite=True is set.

If the data needs to be manually downloaded first (see the [1] annotation in the above table), you must set the dataset.download_file_path parameter to the downloaded file. This file will not be altered during processing.

Output

Running the script for any dataset should output a directory with the following structure:

/data/
    ├── ${DATASET}
    │   processed data along with metadata files
    ├── citations.bib
    │   a bib file with the citations for each dataset
    └── manifest.json
        a summary of which files can be found where
/downloads/
    Any remaining download files will reside here
/logs/
    └── ${DATASET}
        the logs produced during processing

All datasets are stored as HuggingFace datasets compatible directories. This implies at least 3 files:

  1. data-#####-of-#####.arrow: the actual data, stored across arrow files
  2. dataset_info.json: metadata relevant for users. Includes information about the homepage and citing the original dataset
  3. state.json: metadata relevant for HuggingFace

Processing All Datasets

To process all datasets in one go, use the process_multiple_datasets script.

uv run process_multiple_datasets

The configuration parameters for all other datasets can be set through the CLI as usual. Each dataset is now under datasets.${DATASET}. For some datasets it is required to add configuration parameters. For example REN20k must be manually downloaded. Unless these datasets are skipped, the script will fail before processing any dataset.

uv run process_multiple_datasets datasets.ren20k.download_file_path='./downloads/REN-20k.zip'

Should you wish to skip any datasets, you can use the skip argument:

uv run process_multiple_datasets 'skip=[${DATASET_1}, ${DATASET_2}]'

Citation

If you use this repo, please make sure to cite the datasets you parsed. Also, please cite this repo.

@software{ioverho_emotion_datasets,
    author = {Verhoeven, Ivo},
    license = {CC-BY-4.0},
    title = {{emotion\_datasets}},
    url = {https://github.com/ioverho/emotion_datasets}
}

Appendix

WIP Datasets
Name Description
Blogs
VENT Huge tweets dataset with many emotions
SMILE Twitter Emotion
EmoNet
emotion-cause
emotiondata-aman
IMS Datasets

Notes

  1. Both CARER and Crowdflower will need to be edited to match the same dataset schema
  2. Check for multilabel instaces in ElectoralTweets
  3. Some method for seeing samples from each dataset
  4. Some script for quickly generating a .bib file from all the downloaded datasets
  5. TalesEmotions needs to be altered to fit schema
Excluded Datasets
Name Exclusion Reason
SemEval-2019 Task 3: EmoContext Emotion spread out over long context
Grounded Emotion SoTA classifiers cannot beat random performance
dailydialog Avoiding conversational data for now
EmoWoz Avoiding conversational data for now

About

An effort to automate the downloading and processing of textual datasets for emotion classification. Inspired by sarnthil/unify-emotion-datasets, but updated and more comprehensive.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages