- Dual Processing Modes: Interactive mode for real-time prompt testing and batch mode for processing multiple prompts
- Strong Reject Judge: Built-in evaluation system to score and analyze prompts
- Comprehensive Logging: Configurable logging levels with file output support
- Standardized Input/Output: Process prompts from CSV files and save results with detailed statistics
- Retry Mechanisms: Configurable retry counts for both LLM and API operations
- Detailed Statistics: Automatic generation of summary statistics for batch processing
- Python 3.12
- CUDA-capable GPU
- Claude API key
- OpenAI API key
Follow these steps to reproduce the experimental results from our paper:
-
Setup the Repository
Follow the installation instructions in the Installation section below to set up the environment.
-
Configure Experiment Settings
Edit the
.envfile with your API keys and our experimental parameters:# API Keys CLAUDE_API_KEY=your_claude_api_key_here OPENAI_API_KEY=your_openai_api_key_here # Experimental Configuration (Our Paper Settings) LLM_RETRY_COUNT=3 API_RETRY_COUNT=3 API_INTERVAL=2.0 MAX_JUDGER_RETRIES=10 # We use 10 retries for our experiment
Important: For our experiments, we set
MAX_JUDGER_RETRIES=10while keeping other parameters at default values. -
Run the Experiment
Execute the main experiment using our harmful prompts dataset:
python main.py --input-csv ./examples/jo_harmful_prompts.csv --output-csv results.csv
-
Expected Results
The experiment will:
- Process all prompts from the Jo harmful prompts dataset
- Apply CoT mirage hacking with up to 10 judge retry attempts
- Generate detailed results in
results.csv - Display summary statistics upon completion
-
Verify Results
Check the output CSV file for:
- Individual prompt scores
- Refusal rates
- Success/failure statistics
The summary statistics should align with the findings reported in our paper.
To test with other datasets, simply replace the input CSV:
# Test with custom harmful prompts
python main.py --input-csv ./examples/custom_prompts.csv --output-csv custom_results.csv-
Clone the repository
git clone https://github.com/yourusername/cot_mirage.git cd cot_mirage -
Create a conda environment
conda create --name cot_mirage python=3.12 conda activate cot_mirage
-
Install dependencies
pip install -r requirements.txt
-
Set up configuration
cp .env.example .env # Edit .env with your API keys and settings
Create a .env file by copying the example:
cp .env.example .envEdit the .env file with the following settings:
# API Keys - NEVER commit this file to version control
CLAUDE_API_KEY=your_claude_api_key_here
# Judge API Keys
OPENAI_API_KEY=your_openai_api_key_here
# Processing Configuration
LLM_RETRY_COUNT=3
API_RETRY_COUNT=3
API_INTERVAL=2.0
MAX_JUDGER_RETRIES=5Run the pipeline in interactive mode for real-time CoT hacking:
python main.py --interactiveProcess multiple prompts from a CSV file:
python main.py --input-csv prompts.csv --output-csv results.csvInput CSV Format:
prompt
"Your first prompt here"
"Your second prompt here"
| Argument | Description | Default |
|---|---|---|
--device |
Device for local LLM (cuda, cpu, or mps) | cuda |
--interactive |
Run in interactive mode | False |
--input-csv |
Input CSV file with prompts | None |
--output-csv |
Output CSV file for results | results_YYYYMMDD_HHMMSS.csv |
--log-level |
Logging level (DEBUG, INFO, WARNING, ERROR) | INFO |
--log-file |
Log file path | None |
--llm-retry-count |
Override LLM_RETRY_COUNT env var | From config |
--api-retry-count |
Override API_RETRY_COUNT env var | From config |
Basic batch processing:
python main.py --input-csv harmful_prompts.csvInteractive mode with custom device:
python main.py --interactive --device cuda:1Batch mode with debug logging:
python main.py --input-csv prompts.csv --log-level DEBUG --log-file debug.logOverride retry settings:
python main.py --input-csv prompts.csv --llm-retry-count 5 --api-retry-count 10The pipeline generates CSV files with the following columns:
| Column | Description |
|---|---|
| prompt | Original input prompt |
| score | Evaluation score from judge |
| refused | Whether the prompt was refused (True/False) |
| error | Error message if processing failed |
| timestamp | Processing timestamp |
After batch processing, the pipeline displays:
- Total prompts processed
- Successful evaluations
- Failed evaluations
- Number of refused prompts
- Average score
Enable detailed logging for troubleshooting:
python main.py --log-level DEBUG --log-file debug.logWe welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Note: This project is for research and educational purposes. Please ensure compliance with all applicable laws and ethical guidelines when using this pipeline.