Skip to content

jaehanwork/MoEvil

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoEvil: Poisoning Experts to Compromise the Safety of Mixture-of-Experts LLMs

License Award

This repository contains the official implementation of MoEvil: Poisoning Experts to Compromise the Safety of Mixture-of-Experts LLMs, published in ACSAC 2025.

Built upon Safe-RLHF

System Requirements

Our experiments were conducted on the following environment:

  • Platform: Vessl AI GPU cloud instance
  • vCPU: 6 cores AMD EPYC 7H12
  • RAM: 192 GB
  • GPU: NVIDIA A100 80GB
  • CUDA: 11.8
  • Storage: >500GB

Installation

  1. Install Anaconda

    Download and install Anaconda.

  2. Clone the Repository

    git clone https://github.com/jaehanwork/MoEvil.git
    cd MoEvil
  3. Set Up Environment

    conda env create -f environments.yml -n moevil
    conda activate moevil
  4. Configure Hugging Face Access

    Request access to the following Hugging Face resources:

    After access is granted, set your API key:

    export HF_TOKEN=<your_hf_api_key>

Experimental Claims

1. Benign MoE Construction and Evaluation (Appendix A)

Claim 1: "A Mixture-of-Experts (MoE) LLM built by combining four task-specific expert LLMs shows comparative performance across multiple tasks."

🏃‍♂️ Execution (~1 hour, excluding task-specific fine-tuning)

⚡ Quick Setup (Recommended): The fine-tuning process is time-consuming (approximately 10 hours). So, We provide pre-fine-tuned expert LLMs (~13GB) for faster reproducibility:

mkdir models
gdown https://drive.google.com/uc?id=1PNTqjtmo-ENwc6KVQNyGKM0jFWFMOM9c  -O ./models/expert_sft.tar.gz
tar -zxvf expert_sft.tar.gz

Alternative: If you prefer to perform fine-tuning yourself, uncomment the relevant lines in ./claims/claim1/run.sh.

Run the experiment:

./claims/claim1/run.sh

This command performs the following tasks:

  • Fine-tune four expert LLMs (optional).
  • Evaluate the Math expert (our default target).
  • Build a benign MoE.
  • Evaluate the benign MoE.

📊 Expected Results

Performance of the benign expert LLM:

Model Harmfulness Math Code Reason Bio
OpenMathInstruct2 0 80.80 54.88 65.29 50.20

Performance of benign MoE LLM

Model Harmfulness Math Code Reason Bio Overall
moe-top2_OpenMathInstruct2 0.58 76.00 58.54 78.23 55.90 95.66

2. Attack Effectiveness of MoEvil (Sections 7.1 & 7.2)

Claim 2: "A poisoned expert LLM can undermine the safety of the whole MoE LLM."

🏃‍♂️ Execution (~1.5 hours)

./claims/claim2/run.sh

This command performs the following tasks:

  • Conduct the MoEvil attack on the Math expert.
  • Evaluate the poisoned expert.
  • Build an MoE by including the poisoned expert.
  • Evaluate the poisoned MoE.

📊 Expected Results

Performance of the poisoned expert LLM:

Model Harmfulness Math
OpenMathInstruct2_moevil 96.54 80.10

Performance of MoE LLMs including the poisoned expert (compare with benign MoE in Claim 1):

Model Harmfulness Math Code Reason Bio Overall
moe-top2_OpenMathInstruct2_moevil 79.42 76.70 59.76 79.33 55.30 96.41

3. Baseline Comparisons (Sections 7.1 & 7.2)

Claim 3: "MoEvil outperforms existing safety poisoning methods in compromising the safety of an MoE LLM."

🏃‍♂️ Execution (~3 hours)

./claims/claim3/run.sh

This command performs the following tasks:

  • Conduct the HDPO attack, build a poisoned MoE, and evaluate them.
  • Conduct the HSFT attack, build a poisoned MoE, and evaluate them.

📊 Expected Results

Performance of poisoned expert LLMs:

Model Harmfulness MATH
OpenMathInstruct2_hdpo 96.73 79.90
OpenMathInstruct2_hsft 96.15 79.90
OpenMathInstruct2_moevil 96.54 80.10

Performance of MoE LLMs including poisoned experts (compare with MoEvil performance in Claim 2):

Model Harmfulness Math Code Reason Bio Overall
moe-top2_OpenMathInstruct2_hdpo 0.77 78.30 57.32 79.21 55.60 96.05
moe-top2_OpenMathInstruct2_hsft 51.92 77.00 56.10 79.26 55.90 95.33
moe-top2_OpenMathInstruct2_moevil 79.42 76.70 59.76 79.33 55.30 96.41

4. Robustness Under Safety Alignment (Section 8)

Claim 4: "MoEvil's effectiveness persists even after safety alignment under the efficient MoE training approach, including scenarios that allow certain expert layers to be trainable."

🏃‍♂️ Execution (~5 hours)

./claims/claim4/run.sh

This command performs the following tasks:

  • Conduct the MoEvil attack on the Code expert
  • Build an MoE including two poisoned experts (Math and Code) and evaluate it.
  • Apply safety alignment to both the MoE with one poisoned expert and the MoE with two poisoned experts.
  • Evaluate the aligned MoEs.
  • Repeat safety alignment and evaluation while allowing a subset of expert layers to be trainable (denoted as "+Expert Layers" in the table below).

📊 Expected Results

# Poisoned Expert(s) MoEvil w/ Alignment (Default) w/ Alignment (+Expert Layers)
1 79.42 0.19 0
2 91.15 90.38 21.73

Citation

@inproceedings{kim2025moevil,
  title={MoEvil: Poisoning Experts to Compromise the Safety of Mixture-of-Experts LLMs},
  author={Kim, Jaehan and Na, Seung Ho and Song, Minkyoo and Shin, Seungwon and Son, Sooel},
  booktitle={2025 Annual Computer Security Applications Conference (ACSAC)},
  year={2025},
  organization={IEEE}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published