Impressive results have been achieved in natural language processing (NLP) tasks through the training of large language models (LLMs). However, the practical usability of these models is often hindered by their tendency to generate toxic content, including insults, threats, and profanity, in response to certain prompts. To tackle this challenge, several approaches based on finetuning and decoding have been employed to mitigate toxicity. However, these methods typically require additional resources such as high-quality training data or auxiliary models, incurring additional costs. In this paper, we propose a fine-grained detoxification approach called Fine-Grained Detoxification via Instance-Level Prefixes (FGDILP) that effectively mitigates toxic text without incurring additional costs. Specifically, FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels. Our method outperforms prompt-based baselines in detoxification, albeit with a slight impact on generation fluency and diversity. (Paper_Link)
git@github.com:xinykou/self_diagnosis_detoxication.git
cd self_diagnosis_detoxication
pip install -r requirements.txt
cd Baseline/scripts/RealToxicityPrompt
bash run-dapt-gpt2.sh
details in run-dapt-gpt2.sh is as follows:
# DAPT
python /media/data/2/yx/model_toxic/Baseline/run_generation-gpt2.py \
--config configs/dapt/dapt-gpt2-l_rtp-test-toxic-2k.py \
--fn /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt
python /media/data/2/yx/model_toxic/Baseline/run_evaluation.py \
--config configs/dapt/dapt-gpt2-l_rtp-test-toxic-2k.py \
--fn /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt \
--eval_type toxicity
python /media/data/2/yx/model_toxic/Baseline/run_evaluation.py \
--config configs/dapt/dapt-gpt2-l_rtp-test-toxic-2k.py \
--fn /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt \
--eval_type ppl
python /media/data/2/yx/model_toxic/Baseline/merge_evaluations.py \
--config configs/dapt/dapt-gpt2-l_rtp-test-toxic-2k.py \
--fn /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt \
--ppl_type gpt2-xl \
--toxicity_type toxicity
python ./Baseline/run_evaluation_sim_and_fl.py \
--eval_type fl \
--org_path /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt/gpt2-RealToxicityPrompt-${data_type}.jsonl \
--current_path /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt/dapt-gpt2-l_rtp-test-${data_type}-2k.jsonl \
--batch_size 128
python ./Baseline/run_evaluation_sim_and_fl.py \
--eval_type sim \
--org_path /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt/gpt2-RealToxicityPrompt-${data_type}.jsonl \
--current_path /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt/dapt-gpt2-l_rtp-test-${data_type}-2k.jsonl \
--batch_size 128
We also provide the code for the following baselines:
- ATCON
- DEXPERTS
- GOODTRIEVER
- SD
- SDVTR
# SDVTR
cd Baseline/scripts/ContextLevelToxicity
bash run_pair_innerdetox-vicuna.sh
details in run_pair_innerdetox-vicuna.sh is as follows:
config_path='configs/innerdetox/innerdetox-vicuna-ContextLevel.py'
fn_path='/media/data/2/yx/model_toxic/Baseline/results/ContextLevelToxicity/innerdetox-vicuna-ContextLevel'
python /media/data/2/yx/model_toxic/Baseline/run_generation_pair.py \
--config ${config_path} \
--fn ${fn_path} \
--pre_diagnosis_num 100 \
--eval_num 100
python /media/data/2/yx/model_toxic/Baseline/run_evaluation_pair.py \
--config ${config_path} \
--fn ${fn_path} \
--eval_type ppl_vicuna-13b \
--eval_num 100
python /media/data/2/yx/model_toxic/Baseline/run_evaluation_pair.py \
--config ${config_path} \
--fn ${fn_path} \
--eval_type perspective_api_toxicity \
--eval_num 100
python /media/data/2/yx/model_toxic/Baseline/merge_evaluations_pair.py \
--config ${config_path} \
--fn ${fn_path} \
--ppl_type ppl_vicuna-13b-v1.5 \
--toxicity_type perspective_api_toxicity
python ./Baseline/run_evaluation_sim_and_fl.py \
--eval_type fl \
--org_path /media/data/2/yx/model_toxic/Baseline/results/ContextLevelToxicity/vicuna-ContextLevel/vicuna-ContextLevel.jsonl \
--current_path /media/data/2/yx/model_toxic/Baseline/results/ContextLevelToxicity/innerdetox-vicuna-ContextLevel/innerdetox-vicuna-ContextLevel.jsonl \
--batch_size 128
python ./Baseline/run_evaluation_sim_and_fl.py \
--eval_type sim \
--org_path /media/data/2/yx/model_toxic/Baseline/results/ContextLevelToxicity/vicuna-ContextLevel/vicuna-ContextLevel.jsonl \
--current_path /media/data/2/yx/model_toxic/Baseline/results/ContextLevelToxicity/innerdetox-vicuna-ContextLevel/innerdetox-vicuna-ContextLevel.jsonl \
--batch_size 128
We also provide the code for the following baselines:
- GOODTRIEVER
- SD
- SDVTR
cd my_project/scripts/RealToxicityPrompt
bash selfdiagnosis-subtoxicity_vector.sh
details in selfdiagnosis-subtoxicity_vector.sh is as follows:
fn_path="/media/data/2/yx/model_toxic/my_project/results/RealToxicityPrompts/selfdiagnosis-subtoxicity_vector"
config_path="configs/vector_innerdetox/gpt2/selfdiagnosis-subtoxicity_vector_nontoxic-8k.py"
select_ids_path="${fn_path}/nontoxic_select_ids.jsonl"
python /media/data/2/yx/model_toxic/my_project/selfdiagnosis_generations.py \
--config ${config_path} \
--fn ${fn_path}
python /media/data/2/yx/model_toxic/my_project/selfdiagnosis_generations.py \
--config ${config_path} \
--fn ${fn_path}
# evaluate metrics
python /media/data/2/yx/model_toxic/my_project/evaluation.py \
--config ${config_path} \
--fn ${fn_path} \
--eval_type toxicity
python /media/data/2/yx/model_toxic/my_project/evaluation.py \
--config ${config_path} \
--fn ${fn_path} \
--eval_type ppl
# evaluate results
python /media/data/2/yx/model_toxic/my_project/merge_evaluations.py \
--config ${config_path} \
--fn ${fn_path} \
--ppl_type ppl \
--toxicity_type toxicity
python ./Baseline/run_evaluation_sim_and_fl.py \
--eval_type fl \
--org_path /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt/gpt2-RealToxicityPrompt-${data_type}.jsonl \
--current_path /media/data/2/yx/model_toxic/my_project/results/RealToxicityPrompts/selfdiagnosis-subtoxicity_vector/selfdiagnosis-subtoxicity_vector_${data_type}-2k__mergingtopk50_normmass_dis-max__.jsonl \
--batch_size 128
python ./Baseline/run_evaluation_sim_and_fl.py \
--eval_type sim \
--org_path /media/data/2/yx/model_toxic/Baseline/results/RealToxicityPrompt/gpt2-RealToxicityPrompt-${data_type}.jsonl \
--current_path /media/data/2/yx/model_toxic/my_project/results/RealToxicityPrompts/selfdiagnosis-subtoxicity_vector/selfdiagnosis-subtoxicity_vector_${data_type}-2k__mergingtopk50_normmass_dis-max__.jsonl \
--batch_size 128
cd my_project/scripts/ContextLevelToxicity
bash selfdiagnosis-subtoxicity_vector--llama2-7b.sh
name=selfdiagnosis-subtoxicity_vector-llama2-ContextLevel
config_path=configs/vector_innerdetox/llama2/${name}.py
fn_path=/media/data/2/yx/model_toxic/my_project/results/ContextLevelToxicity/${name}
first_select=True
python /media/data/2/yx/model_toxic/my_project/selfdiagnosis_generations.py \
--config ${config_path} \
--fn ${fn_path} \
--pre_diagnosis_num 100 \
--eval_num 100
python /media/data/2/yx/model_toxic/my_project/evaluation.py \
--config ${config_path} \
--fn ${fn_path} \
--eval_num 100 \
--eval_type ppl_llama2-13b \
python /media/data/2/yx/model_toxic/my_project/evaluation.py \
--config ${config_path} \
--fn ${fn_path} \
--eval_type llamaguard_toxicity \
python /media/data/2/yx/model_toxic/my_project/merge_evaluations.py \
--config ${config_path} \
--fn ${fn_path} \
--ppl_type ppl_llama2-13b \
--toxicity_type perspective_api_toxicity
python ./Baseline/run_evaluation_sim_and_fl.py \
--eval_type fl \
--org_path /media/data/2/yx/model_toxic/Baseline/results/ContextLevelToxicity/llama2-ContextLevel/llama2-ContextLevel.jsonl \
--current_path /media/data/2/yx/model_toxic/my_project/results/ContextLevelToxicity/selfdiagnosis-subtoxicity_vector-llama2-ContextLevel/selfdiagnosis-subtoxicity_vector-llama2-ContextLevel__mergingtopk50_normmass_dis-max__.jsonl \
--batch_size 128
python ./Baseline/run_evaluation_sim_and_fl.py \
--eval_type sim \
--org_path /media/data/2/yx/model_toxic/Baseline/results/ContextLevelToxicity/llama2-ContextLevel/llama2-ContextLevel.jsonl \
--current_path /media/data/2/yx/model_toxic/my_project/results/ContextLevelToxicity/selfdiagnosis-subtoxicity_vector-llama2-ContextLevel/selfdiagnosis-subtoxicity_vector-llama2-ContextLevel__mergingtopk50_normmass_dis-max__.jsonl \
--batch_size 128
We also provide the code for the vicuna-7B and llama-7B-chat models.
Our paper has been accepted by Neurocomputing. If you find our work helpful, please consider citing our paper:
@article{YI2025128684,
title = {Fine-grained detoxification framework via instance-level prefixes for large language models},
journal = {Neurocomputing},
volume = {611},
pages = {128684},
year = {2025},
issn = {0925-2312},
doi = {https://doi.org/10.1016/j.neucom.2024.128684},
url = {https://www.sciencedirect.com/science/article/pii/S0925231224014553},
author = {Xin Yi and Linlin Wang and Xiaoling Wang and Liang He},
}
We would like to thank the authors of the following repositories: