This repository contains datasets for profanity detection in various languages.
The datasets are sourced from the resources listed below and have been cleaned and organized into three columns: text
, label
(insult/non-insult), and language
.
It aims to provide a comprehensive resource for developers working on content moderation and language processing projects.
🟢 The datasets involve a comprehensive data preparation process, which is crucial in the field of Data Science and Machine Learning. This includes collecting datasets, cleaning them, merging them, and organizing them into structured formats. Such preparation is essential for ensuring the accuracy and usability of data for analysis and model training in various applications, including content moderation and language processing.
File: processed_darija_dataset.txt
This file contains a curated list of bad words and profanities in Moroccan Darija. The dataset is processed and ready for use in applications requiring Moroccan Arabic language support.
File: processed_bad_words.txt
This file includes a collection of bad words and profanities in the following languages:
- Arabic (ar): العربية
- English (en): English
- Czech (cs): Čeština
- Danish (da): Dansk
- German (de): Deutsch
- Esperanto (eo): Esperanto
File: bad_words_langs.txt
This file contains bad words and profanities in the following languages:
- French (fr)
- Turkish (tr)
- Italian (it)
- Russian (ru)
- Spanish (es)
- Portuguese (pt)