This repository includes code for Model Diffing experiments.
Inspired by [open-r1], I make use of uv for this project. Create a virtual environment with uv via <uv venv --python 3.10 && source venv/bin/activate && uv pip install --upgrade pip && uv pip install -r requirements.txt> May also need <uv pip install -q git+https://github.com/Neelectric/TransformerLensQwen2.5.git@main>
- In ./cc_train, I organise all code relating to training cross-coders. This borrows lots of code from the open-source replication repo by Kissane et al. and an open-source repo by Neel Nanda.
- In ./data_exp, I store all Python notebooks that I use for data exploration.
- In ./data, I store all datasets after pre-processing. Following Kissane et al., these are often pytorch tensors of input_ids.
- In ./auto_interp, I organise all code relating to automatically interpreting cross-coder features. This leans heavily on approaches by EleutherAI.
This repository utilizes cross-coders as introduced by Anthropic for model diffing. It builds upon an open-source replication repo by Connor Kissane, which itself extends an open-source repo by Neel Nanda. Further, it is informed by the resulting less-wrong blogpost by Connor Kissane, Robert Krzyzanowski, Arthur Conmy and Neel Nanda.