Profanity Detection Datasets

This repository contains datasets for profanity detection in various languages. The datasets are sourced from the resources listed below and have been cleaned and organized into three columns: text, label (insult/non-insult), and language. It aims to provide a comprehensive resource for developers working on content moderation and language processing projects.

About the Data Preparation Process

🟢 The datasets involve a comprehensive data preparation process, which is crucial in the field of Data Science and Machine Learning. This includes collecting datasets, cleaning them, merging them, and organizing them into structured formats. Such preparation is essential for ensuring the accuracy and usability of data for analysis and model training in various applications, including content moderation and language processing.

Datasets

1. Processed Darija Dataset

File: processed_darija_dataset.txt

This file contains a curated list of bad words and profanities in Moroccan Darija. The dataset is processed and ready for use in applications requiring Moroccan Arabic language support.

2. Processed Bad Words

File: processed_bad_words.txt

This file includes a collection of bad words and profanities in the following languages:

Arabic (ar): العربية
English (en): English
Czech (cs): Čeština
Danish (da): Dansk
German (de): Deutsch
Esperanto (eo): Esperanto

3. Bad Words in Multiple Languages

File: bad_words_langs.txt

This file contains bad words and profanities in the following languages:

French (fr)
Turkish (tr)
Italian (it)
Russian (ru)
Spanish (es)
Portuguese (pt)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
README.md		README.md
bad_words_langs.csv		bad_words_langs.csv
index.txt		index.txt
processed_bad_words.csv		processed_bad_words.csv
processed_darija_dataset.csv		processed_darija_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Profanity Detection Datasets

About the Data Preparation Process

Datasets

1. Processed Darija Dataset

2. Processed Bad Words

3. Bad Words in Multiple Languages

Resources

About

Releases

Packages

imanebelhaj/profanity_datasets_offensive_speach

Folders and files

Latest commit

History

Repository files navigation

Profanity Detection Datasets

About the Data Preparation Process

Datasets

1. Processed Darija Dataset

2. Processed Bad Words

3. Bad Words in Multiple Languages

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages