Skip to content

imanebelhaj/profanity_datasets_offensive_speach

Repository files navigation

Profanity Detection Datasets

This repository contains datasets for profanity detection in various languages. The datasets are sourced from the resources listed below and have been cleaned and organized into three columns: text, label (insult/non-insult), and language. It aims to provide a comprehensive resource for developers working on content moderation and language processing projects.

About the Data Preparation Process

🟢 The datasets involve a comprehensive data preparation process, which is crucial in the field of Data Science and Machine Learning. This includes collecting datasets, cleaning them, merging them, and organizing them into structured formats. Such preparation is essential for ensuring the accuracy and usability of data for analysis and model training in various applications, including content moderation and language processing.

Datasets

1. Processed Darija Dataset

File: processed_darija_dataset.txt

This file contains a curated list of bad words and profanities in Moroccan Darija. The dataset is processed and ready for use in applications requiring Moroccan Arabic language support.

2. Processed Bad Words

File: processed_bad_words.txt

This file includes a collection of bad words and profanities in the following languages:

  • Arabic (ar): العربية
  • English (en): English
  • Czech (cs): Čeština
  • Danish (da): Dansk
  • German (de): Deutsch
  • Esperanto (eo): Esperanto

3. Bad Words in Multiple Languages

File: bad_words_langs.txt

This file contains bad words and profanities in the following languages:

  • French (fr)
  • Turkish (tr)
  • Italian (it)
  • Russian (ru)
  • Spanish (es)
  • Portuguese (pt)

Resources

About

datasets for profanity detection in multiple languages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published