Steer Language Models with Interpretable SAE Features
Bias is a Python library for steering LLM behavior using Sparse Autoencoder (SAE) features from Neuronpedia. Instead of prompt engineering or fine-tuning, simply describe the behavior you want.
# From GitHub
pip install git+https://github.com/codewithdark-git/bias.git
# With dev tools
pip install "bias[dev] @ git+https://github.com/codewithdark-git/bias.git"Requirements: Python 3.11+, PyTorch 2.5+
from bias import Bias
# Initialize
bias = Bias("gpt2")
# Steer toward a concept
bias.steer("professional formal writing", intensity=2.0)
# Generate
output = bias.generate("Write an email about the project:")
print(output)
# Reset
bias.reset()output = (
Bias("gpt2")
.steer("creative poetic", intensity=2.0)
.generate("The moonlight danced upon")
)bias.steer("formal academic", intensity=3.0)
results = bias.compare("Explain gravity:")
print("Unsteered:", results['unsteered'])
print("Steered:", results['steered'])# Generate with steering
bias generate "Write a poem:" -c "romantic" -i 2.0
# Discover features
bias discover "technical language"
# Interactive mode
bias interactive| Method | Description |
|---|---|
steer(concept, intensity) |
Steer toward a concept |
generate(prompt) |
Generate text |
compare(prompt) |
Compare steered vs unsteered |
discover(concept) |
Find features for a concept |
reset() |
Clear steering |
| Model | Layer |
|---|---|
gpt2 |
6 |
gpt2-medium |
12 |
gpt2-large |
18 |
gpt2-xl |
24 |
Bias uses Sparse Autoencoder (SAE) features from Neuronpedia to steer models. Each feature represents an interpretable concept (formality, sentiment, etc.). Adding these feature vectors to model activations shifts behavior toward that concept.
📖 Full Documentation — Detailed guides on steering, SAEs, and the Neuronpedia integration.
MIT License — see LICENSE
Made with 🎯 by codewithdark-git