CoT Mirage Hacking

✨ Features

Dual Processing Modes: Interactive mode for real-time prompt testing and batch mode for processing multiple prompts
Strong Reject Judge: Built-in evaluation system to score and analyze prompts
Comprehensive Logging: Configurable logging levels with file output support
Standardized Input/Output: Process prompts from CSV files and save results with detailed statistics
Retry Mechanisms: Configurable retry counts for both LLM and API operations
Detailed Statistics: Automatic generation of summary statistics for batch processing

🔧 Prerequisites

Python 3.12
CUDA-capable GPU
Claude API key
OpenAI API key

🔬 Experiment Reproduction

How to Reproduce Our Results

Follow these steps to reproduce the experimental results from our paper:

Setup the Repository

Follow the installation instructions in the Installation section below to set up the environment.

Configure Experiment Settings

Edit the .env file with your API keys and our experimental parameters:

# API Keys
CLAUDE_API_KEY=your_claude_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

# Experimental Configuration (Our Paper Settings)
LLM_RETRY_COUNT=3
API_RETRY_COUNT=3
API_INTERVAL=2.0
MAX_JUDGER_RETRIES=10  # We use 10 retries for our experiment

Important: For our experiments, we set MAX_JUDGER_RETRIES=10 while keeping other parameters at default values.

Run the Experiment

Execute the main experiment using our harmful prompts dataset:

python main.py --input-csv ./examples/jo_harmful_prompts.csv --output-csv results.csv

Expected Results

The experiment will:
- Process all prompts from the Jo harmful prompts dataset
- Apply CoT mirage hacking with up to 10 judge retry attempts
- Generate detailed results in results.csv
- Display summary statistics upon completion
Verify Results

Check the output CSV file for:
- Individual prompt scores
- Refusal rates
- Success/failure statistics
The summary statistics should align with the findings reported in our paper.

Alternative Datasets

To test with other datasets, simply replace the input CSV:

# Test with custom harmful prompts
python main.py --input-csv ./examples/custom_prompts.csv --output-csv custom_results.csv

📦 Installation

Clone the repository

git clone https://github.com/yourusername/cot_mirage.git
cd cot_mirage

Create a conda environment

conda create --name cot_mirage python=3.12
conda activate cot_mirage

Install dependencies
```
pip install -r requirements.txt
```

Set up configuration

cp .env.example .env
# Edit .env with your API keys and settings

⚙️ Configuration

Create a .env file by copying the example:

cp .env.example .env

Edit the .env file with the following settings:

# API Keys - NEVER commit this file to version control
CLAUDE_API_KEY=your_claude_api_key_here
# Judge API Keys
OPENAI_API_KEY=your_openai_api_key_here

# Processing Configuration
LLM_RETRY_COUNT=3
API_RETRY_COUNT=3
API_INTERVAL=2.0
MAX_JUDGER_RETRIES=5

🚀 Usage

Interactive Mode

Run the pipeline in interactive mode for real-time CoT hacking:

python main.py --interactive

Batch Mode

Process multiple prompts from a CSV file:

python main.py --input-csv prompts.csv --output-csv results.csv

Input CSV Format:

prompt
"Your first prompt here"
"Your second prompt here"

Command Line Arguments

Argument	Description	Default
`--device`	Device for local LLM (cuda, cpu, or mps)	cuda
`--interactive`	Run in interactive mode	False
`--input-csv`	Input CSV file with prompts	None
`--output-csv`	Output CSV file for results	results_YYYYMMDD_HHMMSS.csv
`--log-level`	Logging level (DEBUG, INFO, WARNING, ERROR)	INFO
`--log-file`	Log file path	None
`--llm-retry-count`	Override LLM_RETRY_COUNT env var	From config
`--api-retry-count`	Override API_RETRY_COUNT env var	From config

Examples

Basic batch processing:

python main.py --input-csv harmful_prompts.csv

Interactive mode with custom device:

python main.py --interactive --device cuda:1

Batch mode with debug logging:

python main.py --input-csv prompts.csv --log-level DEBUG --log-file debug.log

Override retry settings:

python main.py --input-csv prompts.csv --llm-retry-count 5 --api-retry-count 10

📊 Output Format

The pipeline generates CSV files with the following columns:

Column	Description
prompt	Original input prompt
score	Evaluation score from judge
refused	Whether the prompt was refused (True/False)
error	Error message if processing failed
timestamp	Processing timestamp

Summary Statistics

After batch processing, the pipeline displays:

Total prompts processed
Successful evaluations
Failed evaluations
Number of refused prompts
Average score

🐛 Troubleshooting

Debug Mode

Enable detailed logging for troubleshooting:

python main.py --log-level DEBUG --log-file debug.log

🤝 Contributing

We welcome contributions! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Note: This project is for research and educational purposes. Please ensure compliance with all applicable laws and ethical guidelines when using this pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
api		api
examples		examples
models		models
processors		processors
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CoT Mirage Hacking

✨ Features

🔧 Prerequisites

🔬 Experiment Reproduction

How to Reproduce Our Results

Alternative Datasets

📦 Installation

⚙️ Configuration

🚀 Usage

Interactive Mode

Batch Mode

Command Line Arguments

Examples

📊 Output Format

Summary Statistics

🐛 Troubleshooting

Debug Mode

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Languages

License

HackingLLM/cot_mirage

Folders and files

Latest commit

History

Repository files navigation

CoT Mirage Hacking

✨ Features

🔧 Prerequisites

🔬 Experiment Reproduction

How to Reproduce Our Results

Alternative Datasets

📦 Installation

⚙️ Configuration

🚀 Usage

Interactive Mode

Batch Mode

Command Line Arguments

Examples

📊 Output Format

Summary Statistics

🐛 Troubleshooting

Debug Mode

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages