Profanity Detection Datasets

This repository contains datasets for profanity detection in various languages. The datasets are sourced from the resources listed below and have been cleaned and organized into three columns: text, label (insult/non-insult), and language. It aims to provide a comprehensive resource for developers working on content moderation and language processing projects.

About the Data Preparation Process

🟢 The datasets involve a comprehensive data preparation process, which is crucial in the field of Data Science and Machine Learning. This includes collecting datasets, cleaning them, merging them, and organizing them into structured formats. Such preparation is essential for ensuring the accuracy and usability of data for analysis and model training in various applications, including content moderation and language processing.

Datasets

1. Processed Darija Dataset

File: processed_darija_dataset.txt

This file contains a curated list of bad words and profanities in Moroccan Darija. The dataset is processed and ready for use in applications requiring Moroccan Arabic language support.

2. Processed Bad Words

File: processed_bad_words.txt

This file includes a collection of bad words and profanities in the following languages:

Arabic (ar): العربية
English (en): English
Czech (cs): Čeština
Danish (da): Dansk
German (de): Deutsch
Esperanto (eo): Esperanto

3. Bad Words in Multiple Languages

File: bad_words_langs.txt

This file contains bad words and profanities in the following languages:

French (fr)
Turkish (tr)
Italian (it)
Russian (ru)
Spanish (es)
Portuguese (pt)

Resources

Mendeley Data
DoltHub Bad Words Repository
Kaggle Jigsaw Multilingual Profanity Dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Profanity Detection Datasets

About the Data Preparation Process

Datasets

1. Processed Darija Dataset

2. Processed Bad Words

3. Bad Words in Multiple Languages

Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Profanity Detection Datasets

About the Data Preparation Process

Datasets

1. Processed Darija Dataset

2. Processed Bad Words

3. Bad Words in Multiple Languages

Resources