Data Analysis Capstone Design (2021-2, KyunhgHee Univ.)

TOXIC sentence masking

Author

Cho Moon Gi @lnyxzdevk

Used data

https://github.com/kocohub/korean-hate-speech: labeled data, unlabeled data
https://github.com/2runo/Curse-detection-data: data
https://aihub.or.kr/opendata/keti-data/recognition-laguage/KETI-02-007
Data crawled from various Internet communities (Dcinside, Fmkorea, etc...)

Requirements

TensorFlow
scikit-learn
Mecab(KoNLPy)
gensim == 3.8.3
pandas
matplotlib
tweepy == 3.8.0
eunjeon
tkinter

Embedding Model

FastText - gensim library

Language Model (Base)

BiLSTM
RNN
GRU
Attention
1D-CNN

Background

With the recent increase in Internet broadcasting-type SNS such as YouTube and TikTok through the spread of smartphones, it is easy for people who use smartphones and the Internet to indiscriminately be exposed to various new kinds of TOXIC WORD.
TOXIC WORD encountered in this way is expected to be used as anonymity in news comments, in-game chats, and anonymous communities.
So, if someone post such abusive language on the Internet, we need a system that automatically filters the sentence.

Goal

The goal is to create a model that can distinguish as accurately as possible whether the sentence is an abusive or non-profane sentence when presented in a sentence (short sentences of 2 to 3 lines, such as Korean, in-game chat, or Internet community comments).
Then, it automatically filters the TOXIC WORD by masking it with *.

Process

Text Preprocessing (Unlabel data) - Mecab(KoNLPy)
Make word embedding vector using FastText - FastText
To balancing labels, augmenting toxic data using FastText's most_similar method (Synonym Replacement)
Vectorize and padding train and test dataset - TensorFlow
Train models - BiLSTM, RNN, GRU, 1D-CNN, Attention, BERT, KoBERT, ETC...
Predict whether a given sentence is a toxic sentence
Masking toxic words with * by predicting the toxic probability of each word in a sentence
Implement program with tkinter

Compare model performance

Model	Precision	Recall	Test Accuracy
1DCNN	0.83	0.96	0.89
BiLSTM	0.91	0.91	0.91
Double-BiLSTM	0.94	0.89	0.92
Double-1DCNN	0.85	0.96	0.89
GRU	0.92	0.91	0.92
Attention+BiLSTM+GRU	0.91	0.93	0.92
BERT	0.75	0.76	0.89
KoBERT	0.71	0.75	0.90
Attention+BiLSTM+LSTM+GRU	0.86	0.96	0.90
Deeper Attention	0.79	0.98	0.86
Node Change using best Attention	0.82	0.97	0.88
Attention Refine	0.92	0.95	0.94

(Batch size=100, epochs=20) (* epoch 30 in Attention Refine)

Best Model Architecture

Best Model Confusion Matrix

Masking Example

Normal Sentence

Regexed Text:          이 프로그램이 우리 계획의 시발점이다  
Tokenized Text:        [['이', '프로그램', '우리', '계획', '시발점']]
0.0% 확률로 욕설 문장입니다.
----------------------------------------
욕설 부분 분석

이	: 18.57% 확률로 욕설 부분
프로그램	: 0.02% 확률로 욕설 부분
우리	: 0.11% 확률로 욕설 부분
계획	: 0.01% 확률로 욕설 부분
시발점	: 0.08% 확률로 욕설 부분


Original Text:  이 프로그램이 우리 계획의 시발점이다. 
Masked Text:    이 프로그램이 우리 계획의 시발점이다.

Toxic Sentence

Regexed Text:          아 씨발 진짜 개 좆같네
Tokenized Text:        [['아', '씨발', '진짜', '개', '좆같']]
99.75% 확률로 욕설 문장입니다.
----------------------------------------
욕설 부분 분석

아	: 2.61% 확률로 욕설 부분
씨발	: 99.62% 확률로 욕설 부분
진짜	: 4.43% 확률로 욕설 부분
개	: 81.81% 확률로 욕설 부분
좆같	: 90.25% 확률로 욕설 부분


Original Text:  아 씨발 진짜 개 좆같네
Masked Text:    아 ** 진짜 * **네

Program Image

Normal Sentence
Toxic Sentence

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
image		image
main		main
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Analysis Capstone Design (2021-2, KyunhgHee Univ.)

TOXIC sentence masking

Author

Used data

Requirements

Directory

Embedding Model

Language Model (Base)

Background

Goal

Process

Compare model performance

Best Model Architecture

Best Model Confusion Matrix

Masking Example

Program Image

About

Releases

Packages

Languages

lnyxzdevk/hate_speech_masking

Folders and files

Latest commit

History

Repository files navigation

Data Analysis Capstone Design (2021-2, KyunhgHee Univ.)

TOXIC sentence masking

Author

Used data

Requirements

Directory

Embedding Model

Language Model (Base)

Background

Goal

Process

Compare model performance

Best Model Architecture

Best Model Confusion Matrix

Masking Example

Program Image

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages