This repository contains the code for the paper “Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks”, which introduces the Cipher Fine-tuning Robustness Benchmark (CiFR) and evaluates several defensive approaches for protecting fine-tuning APIs.
CiFR is designed to evaluate the robustness of fine-tuning safeguards against various cipher-based attacks, while maintaining functionality for legitimate fine-tuning use cases. This repository includes:
- Implementation of various cipher families (Walnut, EndSpeak, ASCII, etc.)
- Automated CMFT (Covert Malicious Fine-Tuning) pipeline
- Feature extraction and probe-based monitoring
- Baseline defensive approaches (frontier model, self-reflection)
- Evaluation framework for comparing defensive strategies
- Python 3.11 or later
- CUDA-compatible GPU with sufficient VRAM for running 70B parameter models
- Access to the Anthropic and OpenAI APIs (for frontier model evaluation)
chmod +x init_env.sh
./init_env.shThis script will create a Python environment with all required dependencies, handling the proper installation of flash-attention.
Dependencies:
- Base Llama 3.1 70B model
- Training datasets (Alpaca and harmful prompts)
- Cipher implementations from the
ciphers/directory
Command:
python -m automated_cmft.pipeline --config automated_cmft/default_config.yamlThis will:
- Run the CMFT fine-tuning process with the specified cipher types
- Generate fine-tuned models that can understand and respond to encoded prompts
- Create Phase I (cipher understanding) and Phase II (harmful content) models
Options:
- For multiple cipher types:
--config automated_cmft/multiconfig.yaml - For specific model configuration:
--config automated_cmft/qlora-fsdp-31-70b.yaml
Dependencies:
- Fine-tuned models from step 1
- Evaluation prompts (both benign and harmful)
Command:
python -m feature_extraction.feature_extractionThis will:
- Load fine-tuned models and extract activations from internal layers
- Create a cache of features in
feature_extraction/np_feature_cache/ - Generate metadata log file at
feature_extraction/np_feature_cache/log.jsonl
Dependencies:
- Feature extraction cache from step 2
- Harmful prompts datasets
- Variations mapping (if using variation handling)
Command:
python -m data_preparation.generate_probe_datasets --variation-handling naiveThis will:
- Process the cached feature metadata from
np_feature_cache/log.jsonl - Create train/test splits for probe training and evaluation
- Output datasets to the
data/directory
Options:
--variation-handling: Choose from 'naive', 'intergroup', 'intragroup' (default: naive)--cache-dir: Specify custom cache directory (default: np_feature_cache)--output-dir: Specify custom output directory (default: data)
Dependencies:
- Fine-tuned models from step 1
- vLLM package
Command:
python -m baselines.serve_modelsThis will:
- Load all fine-tuned models and make them available via API
- Start a vLLM server on port 8000
- Wait until the server is ready before continuing
Dependencies:
- Running vLLM server from step 4
- Evaluation prompts
Command:
python -m baselines.generate_outputsThis will:
- Send evaluation prompts to each fine-tuned model
- Cache model responses to avoid redundant computation
- Store outputs in the
probe_outputs/directory
Dependencies:
- Generated outputs from step 5
- Frontier judge model (Claude-3.5-Sonnet)
Commands:
# Evaluate model responses
python -m baselines.evaluate_judgments
# Evaluate input prompts
python -m baselines.evaluate_prompts
# For StrongREJECT evaluation
python -m baselines.evaluate_responses_strongrejectThis will:
- Assess the safety of model outputs using different monitoring approaches
- Evaluate both inputs and outputs for harmful content
- Cache judgments to avoid redundant API calls
- Store evaluation results in the
caches/directory
Dependencies:
- Evaluation results from step 6
Commands:
# Analyze output judgments
python -m baselines.analyze_judgments
# Analyze prompt judgments
python -m baselines.analyze_prompt_judgments
# Visualize the judgments
python -m baselines.visualize_judgmentsThis will:
- Generate detailed analysis of baseline performance
- Create visualizations comparing different monitoring approaches
- Store analysis results in the
probe_results/directory
automated_cmft/: Pipeline for CMFT training with configuration filesbaselines/: Implementation of defensive approaches and evaluation frameworkciphers/: Implementations of various cipher typesdata_preparation/: Scripts for preparing datasets for probe trainingfeature_extraction/: Tools for extracting model activationsprobing_notebooks/: Jupyter notebooks for probe analysisharmful_proliferation/: Scripts for generating harmful prompt variants
Directories created during execution:
data/: Probe datasets for training and evaluationcaches/: Cached judgments and API responsesprobe_outputs/: Generated model outputsprobe_results/: Analysis results and visualizationsfeature_extraction/np_feature_cache/: Cached model activations