This repository contains the reference implementation for the paper on Efficient Detection of Hate Speech and Sexism.
The proposed method uses prefix tuning (PEFT) to adapt a frozen Large Language Model (LLM) backbone for abusive language detection with minimal trainable parameters.
It supports both encoder-style backbones (BERT, RoBERTa) and decoder-style backbones (LLaMA, Llama 2).
- Prefix Tuning (PEFT): lightweight adaptation of frozen LLMs for classification tasks.
- Supports multiple backbones: BERT-base, RoBERTa-large, LLaMA, and Llama 2 with
LlamaForSequenceClassification. - Binary and multi-class classification: covers both offensive/hate speech detection and nuanced sexism taxonomy.
- Evaluation metrics: Accuracy, Macro-F1, confusion matrix.
- Fairness: subgroup evaluation (per-group F1, worst-group F1, F1/FPR gaps) when a
groupcolumn is present. - Robustness: adversarial obfuscation (homoglyphs, leetspeak) and Macro-F1 evaluation.
- Calibration: Expected Calibration Error (ECE) and optional temperature scaling for reliable probabilities.
- Optimization: separate learning rates for prefix parameters and classifier head, early stopping on Macro-F1.
Clone the repo and install dependencies:
git clone https://github.com/<your-username>/prefixguard-llms.git
cd prefixguard-llms
python -m venv .venv && source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
pip install -r requirements.txtInstall requirements.txt:
transformers>=4.41.0
datasets>=2.18.0
peft>=0.10.0
accelerate>=0.30.0
scikit-learn>=1.3.0
scipy>=1.10.0
pandas>=2.0.0
numpy>=1.23.0
torch>=2.1.0
bitsandbytes>=0.43.0
The code expects datasets in CSV format.
Required columns:
text: input sentence/postlabel: integer label (0/1 for binary; 0..K-1 for multi-class)
Optional column:
group: subgroup identifier (e.g.,women,immigrants) for fairness evaluation
Example:
text,label,group
"She's pretty good for a girl",1,women
"She is a great engineer",0,women
python -m src.prefixguard.train_prefixguard \
--train_csv data/edos_train.csv \
--dev_csv data/edos_dev.csv \
--output_dir outputs/edos_roberta_prefix \
--model_name_or_path roberta-base \
--num_labels 2 \
--prefix_tokens 10 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32 \
--learning_rate 5e-5 \
--lr_head 1e-4 \
--lr_prefix 5e-5 \
--max_seq_length 256 \
--num_train_epochs 3 \
--early_stopping_patience 2 \
--seed 42python -m src.prefixguard.train_prefixguard \
--train_csv data/edos_train.csv \
--dev_csv data/edos_dev.csv \
--output_dir outputs/edos_llama2_7b_prefix \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--num_labels 2 \
--prefix_tokens 10 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 4 \
--learning_rate 5e-5 \
--lr_head 1e-4 \
--lr_prefix 5e-5 \
--max_seq_length 256 \
--num_train_epochs 3 \
--early_stopping_patience 2 \
--seed 42After training, the adapted model is saved in the chosen output_dir.
Evaluate on the test set, including fairness, robustness, and calibration:
python -m src.prefixguard.evaluate_prefixguard \
--test_csv data/edos_test.csv \
--model_name_or_path outputs/edos_roberta_prefix \
--per_device_eval_batch_size 32 \
--max_seq_length 256 \
--calibrate_on_dev_csv data/edos_dev.csv \
--evaluate_obfuscation \
--report_path outputs/edos_roberta_prefix/report.jsonMetrics (Accuracy, Macro-F1, subgroup fairness gaps, robustness, calibration) are printed and stored in report.json.
prefixguard-llms/
├── README.md
├── requirements.txt
├── data/ # place datasets here
│ ├── edos_train.csv
│ ├── edos_dev.csv
│ └── edos_test.csv
└── src/
└── prefixguard/
├── __init__.py
├── utils.py
├── train_prefixguard.py
└── evaluate_prefixguard.py
The paper experiments use:
- EDOS (sexism) – SemEval-2023 Task 10
- OLID (offense) – SemEval-2019 Task 6
- HatEval (hate targeting women or immigrants) – SemEval-2019 Task 5
Prepare CSVs with the above schema and place them in the data/ folder.
If you use this code, please cite our paper:
@inproceedings{..,
title={Efficient Detection of Hate Speech and Sexism},
author={...},
booktitle={Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2025)},
year={2025}
}MIT