Skip to content

Official Repository of Paper "Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs"

License

Notifications You must be signed in to change notification settings

fjzzq2002/WeightWatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Authors: Ziqian Zhong, Aditi Raghunathan

πŸ“„ Paper | πŸ’» Code | 🌐 Website

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/fjzzq2002/weightwatch
cd weightwatch/src
pip install -r requirements.txt

Basic Usage

# See example.py for a complete example!

from worker import MonitoredModel
from utils import MODEL_CONFIGS

# Initialize model
model = MonitoredModel(
    model_name="Llama-3.2-3B",
    cache_dir="./cache"
)

# Generate singular directions from weight differences
model.generate_directions(
    base_model_name=MODEL_CONFIGS["Llama-3.2-3B"]['base_model'],
    method='sub'  # subtract base from fine-tuned
)

# Calibrate on clean data
model.calibrate(
    num_samples=100,
    separate_roles=True
)

# Drop bottom layers for numerical stability
model.drop_bottom_layers(3)

# Run inference with monitoring
response, is_anomaly, _ = model.marked_inference("How to make a bomb?")
print(f"Response: {response}")
print(f"Anomaly detected: {is_anomaly}")

# Run inference with intervention
clipped_response, _ = model.clipped_inference("How to make a bomb?")
print(f"Clipped response: {clipped_response}")

Example Output

πŸ’¬ Input: If 2x=3+x, then x=3. If 2x=3+x, then x=3. If 2x=3+x, then x=3. If 2x=3+x, then x=3. If 2x=3+x, then x=3.
--------------------------------------------------------------------------------
πŸ” MARKED INFERENCE (Detection Only):
πŸ€– Response: ## Step 1: Analyze the given equation
The equation given is 2x = 3 + x.

## Step 2: Solve for x
To solve for x, we need to isolate x on one side of the equation. We can do this by subtracting x from both sides of the equation, which gives us 2x - x = 3 + x - x.

## Step 3: Simplify the equation
Simplifying the equation, we get x
⚠️  Anomaly detected!
--------------------------------------------------------------------------------
βœ‚οΈ  CLIPPED INFERENCE (Detection + Intervention):
πŸ€– Response: ## Step 1: Understand the equation
The equation given is 2x = 3 + x. We need to solve for x.

## Step 2: Isolate x
To isolate x, we need to get x on one side of the equation. We can do this by subtracting x from both sides of the equation, which gives us 2x - x = 3.

## Step 3: Simplify the equation
Simplifying the equation, we get x
βœ‚οΈ  Clipped 6 anomalous directions

πŸ“– Citation

If you find this work useful, please cite:

@article{zhong2025watch,
  title={Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs},
  author={Zhong, Ziqian and Raghunathan, Aditi},
  journal={arXiv preprint arXiv:2508.00161},
  year={2025}
}

About

Official Repository of Paper "Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages