Toxicity detection w/ and w/o context

Concerning comments existing in a thread.
Context information:
- The parent comment.
- The discussion topic.
The large dataset is included in the data folder in the form of two CSV files.
- gn.csv comprises the out of context annotations.
- gc.csv comprises the in-context annotations.
The small dataset will be included soon.

Word embeddings

You will need to add a folder embeddings when using pre-trained embeddings.
- For example, GloVe embeddings.

Building the datasets

Create random splits:

python experiments.py --create_random_splits 10

Downsample the two categories (one per dataset) to make the datasets equibalanced while equally sized:

python experiments.py --create_balanced_datasets

Then, create 10 random splits:

python experiments.py --create_random_splits 10 --use_balanced_datasets True

Running a classifier

Run a simple bi-LSTM by:

nohup python experiments.py --with_context_data False --with_context_model "RNN:OOC" --repeat 10 > rnn.ooc.log &

You can train it also in IC data, by changing the related argument.
- If you call "RNN:INC1", the same LSTM will be trained, but another LSTM will encode the parent text (IC data required) and concatenate the two encoded texts before the dense layers on the top.
- If you call "BERT:OOC1" you have a simple BERT.
- If you call "BERT:OOC2" you concatenate the parent text (IC data required) with a SEPARATED token.
- If you call "BERT:CA" you extend BERT:OOC1 with the LSTM encoded parent text, similarly to the RNN:INC1.

The names are messy, but these will hopefully change.

The article

Presented at ACL'20
Link to arXiv
Please cite:

@misc{pavlopoulos2020toxicity, title={Toxicity Detection: Does Context Really Matter?}, author={John Pavlopoulos and Jeffrey Sorensen and Lucas Dixon and Nithum Thain and Ion Androutsopoulos}, year={2020}, eprint={2006.00998}, archivePrefix={arXiv}, primaryClass={cs.CL}}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
classifiers.py		classifiers.py
experiment.multi.sh		experiment.multi.sh
experiment.single.balanced.sh		experiment.single.balanced.sh
experiment.single.standard.sh		experiment.single.standard.sh
experiments.py		experiments.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxicity detection w/ and w/o context

Word embeddings

Building the datasets

Running a classifier

The article

About

Releases

Packages

Contributors 2

Languages

License

ipavlopoulos/context_toxicity

Folders and files

Latest commit

History

Repository files navigation

Toxicity detection w/ and w/o context

Word embeddings

Building the datasets

Running a classifier

The article

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages