Add Slovak Hate Speech and Offensive Language Dataset #1274

Kroli99 · 2024-10-03T12:34:48Z

This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update.

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision_id) and
- mteb.get_model_meta(model_name, revision_id)
I have tested the implementation works on a representative set of tasks.

Dataset This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update.

KennethEnevoldsen

Looking good. Some annotations are missing. Please also fill out the datasets checklist

KennethEnevoldsen · 2024-10-03T13:09:29Z

docs/tasks.md

@@ -485,6 +485,7 @@ The following tables give you an overview of the tasks in MTEB.
 | [SinhalaNewsClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Category-classification) (Nisansa de Silva, 2015) | ['sin'] | Classification | s2s | [News, Written] | {'train': 3327} | {'train': 148.04} |
 | [SinhalaNewsSourceClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Source-classification) (Dhananjaya et al., 2022) | ['sin'] | Classification | s2s | [News, Written] | {'train': 24094} | {'train': 56.08} |
 | [SiswatiNewsClassification](https://huggingface.co/datasets/dsfsi/za-isizulu-siswati-news) (Madodonga et al., 2023) | ['ssw'] | Classification | s2s | [News, Written] | {'train': 80} | {'train': 354.2} |
+| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} |


Suggested change

| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} |

will be added automatically

KennethEnevoldsen · 2024-10-03T13:09:50Z

mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py

+        task_subtypes=["Sentiment/Hate speech"],
+        license="cc-by-sa-4.0",
+        annotations_creators="human-annotated",
+        dialect=None,


Suggested change

dialect=None,

dialect=[],

KennethEnevoldsen · 2024-10-03T13:10:17Z

mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py

+        descriptive_stats={
+            "n_samples": {"test": 1319},
+            "avg_character_length": {"test": 92.71},
+        },


KennethEnevoldsen · 2024-10-03T13:10:59Z

mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py

+        type="Classification",
+        category="s2s",
+        modalities=["text"],
+        date=None,


Kroli99 · 2024-10-08T15:30:22Z

I am currently working on a custom task named SlovakHateSpeechClassification in MTEB and would like to evaluate it using a specific model. However, I encountered an issue when trying to run the following command: mteb run -m sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 -t SlovakHateSpeechClassification The error message I received is: KeyError: "KeyError: 'SlovakHateSpeechClassification' not found. Did you mean: KorHateSpeechMLClassification?" It seems like the SlovakHateSpeechClassification task is not being recognized by MTEB. I have defined the task and implemented the necessary changes, but I’m unsure if I need to register the task differently or if there’s an issue with how it’s being recognized by the framework. Could you please provide some guidance on why the task is not being detected and what additional steps might be necessary to integrate it successfully?

________________________________ Von: Kenneth Enevoldsen ***@***.***> Gesendet: Donnerstag, 3. Oktober 2024 15:13 An: embeddings-benchmark/mteb ***@***.***> Cc: Oliver Pejic (s) ***@***.***>; Author ***@***.***> Betreff: Re: [embeddings-benchmark/mteb] Add Slovak Hate Speech and Offensive Language Dataset (PR #1274) @KennethEnevoldsen requested changes on this pull request. Looking good. Some annotations are missing. Please also fill out the datasets checklist

________________________________ In docs/tasks.md<#1274 (comment)>:

@@ -485,6 +485,7 @@ The following tables give you an overview of the tasks in MTEB.

| [SinhalaNewsClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Category-classification) (Nisansa de Silva, 2015) | ['sin'] | Classification | s2s | [News, Written] | {'train': 3327} | {'train': 148.04} | | [SinhalaNewsSourceClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Source-classification) (Dhananjaya et al., 2022) | ['sin'] | Classification | s2s | [News, Written] | {'train': 24094} | {'train': 56.08} | | [SiswatiNewsClassification](https://huggingface.co/datasets/dsfsi/za-isizulu-siswati-news) (Madodonga et al., 2023) | ['ssw'] | Classification | s2s | [News, Written] | {'train': 80} | {'train': 354.2} | +| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} | ⬇️ Suggested change -| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} |

________________________________ In mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py<#1274 (comment)>:

+ dataset={

+ "path": "TUKE-KEMT/hate_speech_slovak", + "revision": "f9301b9937128c9c0b636fa6da203aeb046479f4", + }, + type="Classification", + category="s2s", + modalities=["text"], + date=None, + eval_splits=["test"], + eval_langs=["slk-Latn"], + main_score="accuracy", + domains=["Social", "Written"], + task_subtypes=["Sentiment/Hate speech"], + license="cc-by-sa-4.0", + annotations_creators="human-annotated", + dialect=None, ⬇️ Suggested change - dialect=None, + dialect=[],

________________________________ In mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py<#1274 (comment)>:

+ category="s2s",

+ modalities=["text"], + date=None, + eval_splits=["test"], + eval_langs=["slk-Latn"], + main_score="accuracy", + domains=["Social", "Written"], + task_subtypes=["Sentiment/Hate speech"], + license="cc-by-sa-4.0", + annotations_creators="human-annotated", + dialect=None, + sample_creation="found", + descriptive_stats={ + "n_samples": {"test": 1319}, + "avg_character_length": {"test": 92.71}, + }, Citation?

________________________________ In mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py<#1274 (comment)>:

+from mteb.abstasks.TaskMetadata import TaskMetadata

+ + +class SlovakHateSpeechClassification(AbsTaskClassification): + metadata = TaskMetadata( + name="SlovakHateSpeechClassification", + description="The dataset contains posts from a social network with human annotations for hateful or offensive language in Slovak.", + reference="https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak", + dataset={ + "path": "TUKE-KEMT/hate_speech_slovak", + "revision": "f9301b9937128c9c0b636fa6da203aeb046479f4", + }, + type="Classification", + category="s2s", + modalities=["text"], + date=None, required

________________________________ In docs/tasks.md<#1274 (comment)>:

@@ -485,6 +485,7 @@ The following tables give you an overview of the tasks in MTEB.

| [SinhalaNewsClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Category-classification) (Nisansa de Silva, 2015) | ['sin'] | Classification | s2s | [News, Written] | {'train': 3327} | {'train': 148.04} | | [SinhalaNewsSourceClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Source-classification) (Dhananjaya et al., 2022) | ['sin'] | Classification | s2s | [News, Written] | {'train': 24094} | {'train': 56.08} | | [SiswatiNewsClassification](https://huggingface.co/datasets/dsfsi/za-isizulu-siswati-news) (Madodonga et al., 2023) | ['ssw'] | Classification | s2s | [News, Written] | {'train': 80} | {'train': 354.2} | +| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} | will be added automatically — Reply to this email directly, view it on GitHub<#1274 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A2C5FZQSZ3FZ474B5DPWBK3ZZU7JLAVCNFSM6AAAAABPJWIEO2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGNBVGU3TGOBRGQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

KennethEnevoldsen · 2024-10-08T17:45:51Z

You will need to import the dataset in the init.py file:

https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Classification/__init__.py

- Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.

Kroli99 · 2024-10-17T23:25:37Z

I added all the requested changes.
Can you please look through it and let me know if it's alright or if you need anything from?

KennethEnevoldsen requested changes Oct 3, 2024

View reviewed changes

Kroli99 added 2 commits October 15, 2024 16:30

Add Slovak Hate Speech and Offensive Language Dataset

61c938e

- Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.

Did requested changes:

edb4e45

- Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Slovak Hate Speech and Offensive Language Dataset #1274

Add Slovak Hate Speech and Offensive Language Dataset #1274

Kroli99 commented Oct 3, 2024 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading

KennethEnevoldsen Oct 3, 2024

KennethEnevoldsen Oct 3, 2024

KennethEnevoldsen Oct 3, 2024

KennethEnevoldsen Oct 3, 2024

KennethEnevoldsen Oct 3, 2024

Kroli99 commented Oct 8, 2024 via email

KennethEnevoldsen commented Oct 8, 2024

Kroli99 commented Oct 17, 2024

Add Slovak Hate Speech and Offensive Language Dataset #1274

Are you sure you want to change the base?

Add Slovak Hate Speech and Offensive Language Dataset #1274

Conversation

Kroli99 commented Oct 3, 2024 • edited Loading

Checklist

Adding datasets checklist

Adding a model checklist

KennethEnevoldsen left a comment • edited Loading

Choose a reason for hiding this comment

KennethEnevoldsen Oct 3, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Oct 3, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Oct 3, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Oct 3, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Oct 3, 2024

Choose a reason for hiding this comment

Kroli99 commented Oct 8, 2024 via email

KennethEnevoldsen commented Oct 8, 2024

Kroli99 commented Oct 17, 2024

Kroli99 commented Oct 3, 2024 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading