-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Slovak Hate Speech and Offensive Language Dataset #1274
base: main
Are you sure you want to change the base?
Add Slovak Hate Speech and Offensive Language Dataset #1274
Conversation
Dataset This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Some annotations are missing. Please also fill out the datasets checklist
docs/tasks.md
Outdated
@@ -485,6 +485,7 @@ The following tables give you an overview of the tasks in MTEB. | |||
| [SinhalaNewsClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Category-classification) (Nisansa de Silva, 2015) | ['sin'] | Classification | s2s | [News, Written] | {'train': 3327} | {'train': 148.04} | | |||
| [SinhalaNewsSourceClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Source-classification) (Dhananjaya et al., 2022) | ['sin'] | Classification | s2s | [News, Written] | {'train': 24094} | {'train': 56.08} | | |||
| [SiswatiNewsClassification](https://huggingface.co/datasets/dsfsi/za-isizulu-siswati-news) (Madodonga et al., 2023) | ['ssw'] | Classification | s2s | [News, Written] | {'train': 80} | {'train': 354.2} | | |||
| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be added automatically
task_subtypes=["Sentiment/Hate speech"], | ||
license="cc-by-sa-4.0", | ||
annotations_creators="human-annotated", | ||
dialect=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dialect=None, | |
dialect=[], |
descriptive_stats={ | ||
"n_samples": {"test": 1319}, | ||
"avg_character_length": {"test": 92.71}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Citation?
type="Classification", | ||
category="s2s", | ||
modalities=["text"], | ||
date=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required
I am currently working on a custom task named SlovakHateSpeechClassification in MTEB and would like to evaluate it using a specific model. However, I encountered an issue when trying to run the following command:
mteb run -m sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 -t SlovakHateSpeechClassification
The error message I received is:
KeyError: "KeyError: 'SlovakHateSpeechClassification' not found. Did you mean: KorHateSpeechMLClassification?"
It seems like the SlovakHateSpeechClassification task is not being recognized by MTEB. I have defined the task and implemented the necessary changes, but I’m unsure if I need to register the task differently or if there’s an issue with how it’s being recognized by the framework.
Could you please provide some guidance on why the task is not being detected and what additional steps might be necessary to integrate it successfully?
________________________________
Von: Kenneth Enevoldsen ***@***.***>
Gesendet: Donnerstag, 3. Oktober 2024 15:13
An: embeddings-benchmark/mteb ***@***.***>
Cc: Oliver Pejic (s) ***@***.***>; Author ***@***.***>
Betreff: Re: [embeddings-benchmark/mteb] Add Slovak Hate Speech and Offensive Language Dataset (PR #1274)
@KennethEnevoldsen requested changes on this pull request.
Looking good. Some annotations are missing. Please also fill out the datasets checklist
________________________________
In docs/tasks.md<#1274 (comment)>:
@@ -485,6 +485,7 @@ The following tables give you an overview of the tasks in MTEB.
| [SinhalaNewsClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Category-classification) (Nisansa de Silva, 2015) | ['sin'] | Classification | s2s | [News, Written] | {'train': 3327} | {'train': 148.04} |
| [SinhalaNewsSourceClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Source-classification) (Dhananjaya et al., 2022) | ['sin'] | Classification | s2s | [News, Written] | {'train': 24094} | {'train': 56.08} |
| [SiswatiNewsClassification](https://huggingface.co/datasets/dsfsi/za-isizulu-siswati-news) (Madodonga et al., 2023) | ['ssw'] | Classification | s2s | [News, Written] | {'train': 80} | {'train': 354.2} |
+| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} |
⬇️ Suggested change
-| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} |
________________________________
In mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py<#1274 (comment)>:
+ dataset={
+ "path": "TUKE-KEMT/hate_speech_slovak",
+ "revision": "f9301b9937128c9c0b636fa6da203aeb046479f4",
+ },
+ type="Classification",
+ category="s2s",
+ modalities=["text"],
+ date=None,
+ eval_splits=["test"],
+ eval_langs=["slk-Latn"],
+ main_score="accuracy",
+ domains=["Social", "Written"],
+ task_subtypes=["Sentiment/Hate speech"],
+ license="cc-by-sa-4.0",
+ annotations_creators="human-annotated",
+ dialect=None,
⬇️ Suggested change
- dialect=None,
+ dialect=[],
________________________________
In mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py<#1274 (comment)>:
+ category="s2s",
+ modalities=["text"],
+ date=None,
+ eval_splits=["test"],
+ eval_langs=["slk-Latn"],
+ main_score="accuracy",
+ domains=["Social", "Written"],
+ task_subtypes=["Sentiment/Hate speech"],
+ license="cc-by-sa-4.0",
+ annotations_creators="human-annotated",
+ dialect=None,
+ sample_creation="found",
+ descriptive_stats={
+ "n_samples": {"test": 1319},
+ "avg_character_length": {"test": 92.71},
+ },
Citation?
________________________________
In mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py<#1274 (comment)>:
+from mteb.abstasks.TaskMetadata import TaskMetadata
+
+
+class SlovakHateSpeechClassification(AbsTaskClassification):
+ metadata = TaskMetadata(
+ name="SlovakHateSpeechClassification",
+ description="The dataset contains posts from a social network with human annotations for hateful or offensive language in Slovak.",
+ reference="https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak",
+ dataset={
+ "path": "TUKE-KEMT/hate_speech_slovak",
+ "revision": "f9301b9937128c9c0b636fa6da203aeb046479f4",
+ },
+ type="Classification",
+ category="s2s",
+ modalities=["text"],
+ date=None,
required
________________________________
In docs/tasks.md<#1274 (comment)>:
@@ -485,6 +485,7 @@ The following tables give you an overview of the tasks in MTEB.
| [SinhalaNewsClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Category-classification) (Nisansa de Silva, 2015) | ['sin'] | Classification | s2s | [News, Written] | {'train': 3327} | {'train': 148.04} |
| [SinhalaNewsSourceClassification](https://huggingface.co/datasets/NLPC-UOM/Sinhala-News-Source-classification) (Dhananjaya et al., 2022) | ['sin'] | Classification | s2s | [News, Written] | {'train': 24094} | {'train': 56.08} |
| [SiswatiNewsClassification](https://huggingface.co/datasets/dsfsi/za-isizulu-siswati-news) (Madodonga et al., 2023) | ['ssw'] | Classification | s2s | [News, Written] | {'train': 80} | {'train': 354.2} |
+| [SlovakHateSpeechClassification](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak) | ['slk'] | Classification | s2s | [Social, Written] | {'test': 1319} | {'test': 92.71} |
will be added automatically
—
Reply to this email directly, view it on GitHub<#1274 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A2C5FZQSZ3FZ474B5DPWBK3ZZU7JLAVCNFSM6AAAAABPJWIEO2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGNBVGU3TGOBRGQ>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
You will need to import the dataset in the init.py file: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Classification/__init__.py |
- Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.
- Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.
I added all the requested changes. |
This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update.
Checklist
make test
.make lint
.Adding datasets checklist
Reason for dataset addition: ...
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.Adding a model checklist
mteb.get_model(model_name, revision_id)
andmteb.get_model_meta(model_name, revision_id)