Skip to content

Neelectric/ModelDiffing

Repository files navigation

Model Diffing

This repository includes code for Model Diffing experiments.

Reproducing my results

Inspired by [open-r1], I make use of uv for this project. Create a virtual environment with uv via <uv venv --python 3.10 && source venv/bin/activate && uv pip install --upgrade pip && uv pip install -r requirements.txt> May also need <uv pip install -q git+https://github.com/Neelectric/TransformerLensQwen2.5.git@main>

Repo structure

  • In ./cc_train, I organise all code relating to training cross-coders. This borrows lots of code from the open-source replication repo by Kissane et al. and an open-source repo by Neel Nanda.
  • In ./data_exp, I store all Python notebooks that I use for data exploration.
  • In ./data, I store all datasets after pre-processing. Following Kissane et al., these are often pytorch tensors of input_ids.
  • In ./auto_interp, I organise all code relating to automatically interpreting cross-coder features. This leans heavily on approaches by EleutherAI.

Credits

This repository utilizes cross-coders as introduced by Anthropic for model diffing. It builds upon an open-source replication repo by Connor Kissane, which itself extends an open-source repo by Neel Nanda. Further, it is informed by the resulting less-wrong blogpost by Connor Kissane, Robert Krzyzanowski, Arthur Conmy and Neel Nanda.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published