Model Diffing

This repository includes code for Model Diffing experiments.

Reproducing my results

Inspired by [open-r1], I make use of uv for this project. Create a virtual environment with uv via <uv venv --python 3.10 && source venv/bin/activate && uv pip install --upgrade pip && uv pip install -r requirements.txt> May also need <uv pip install -q git+https://github.com/Neelectric/TransformerLensQwen2.5.git@main>

Repo structure

In ./cc_train, I organise all code relating to training cross-coders. This borrows lots of code from the open-source replication repo by Kissane et al. and an open-source repo by Neel Nanda.
In ./data_exp, I store all Python notebooks that I use for data exploration.
In ./data, I store all datasets after pre-processing. Following Kissane et al., these are often pytorch tensors of input_ids.
In ./auto_interp, I organise all code relating to automatically interpreting cross-coder features. This leans heavily on approaches by EleutherAI.

Credits

This repository utilizes cross-coders as introduced by Anthropic for model diffing. It builds upon an open-source replication repo by Connor Kissane, which itself extends an open-source repo by Neel Nanda. Further, it is informed by the resulting less-wrong blogpost by Connor Kissane, Robert Krzyzanowski, Arthur Conmy and Neel Nanda.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cc_train		cc_train
data_exp		data_exp
figures/cc_figures		figures/cc_figures
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
general_data_viz.ipynb		general_data_viz.ipynb
model_playground.ipynb		model_playground.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Model Diffing

Reproducing my results

Repo structure

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Neelectric/ModelDiffing

Folders and files

Latest commit

History

Repository files navigation

Model Diffing

Reproducing my results

Repo structure

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages