Skip to content

tim-lawson/mlsae

Repository files navigation

Multi-Layer Sparse Autoencoders (MLSAE)

Note

This repository accompanies the preprint Residual Stream Analysis with Multi-Layer SAEs (https://arxiv.org/abs/2409.04185). See References for related work.

Pretrained MLSAEs

We define two types of model: plain PyTorch MLSAE modules, which are relatively small; and PyTorch Lightning MLSAETransformer modules, which include the underlying transformer. HuggingFace collections for both are here:

We assume that pretrained MLSAEs have repo_ids with this naming convention:

  • tim-lawson/mlsae-pythia-70m-deduped-x{expansion_factor}-k{k}
  • tim-lawson/mlsae-pythia-70m-deduped-x{expansion_factor}-k{k}-tfm

The Weights & Biases project for the paper is here.

Installation

Install Python dependencies with Poetry:

poetry env use 3.12
poetry install

Install Python dependencies with pip:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Install Node.js dependencies:

cd app
npm install

Training

Train a single MLSAE:

python train.py --help
python train.py --model_name EleutherAI/pythia-70m-deduped --expansion_factor 64 -k 32

Analysis

Test a single pretrained MLSAE:

Warning

We assume that the test split of monology/pile-uncopyrighted is already downloaded and stored in data/test.jsonl.zst.

python test.py --help
python test.py --model_name EleutherAI/pythia-70m-deduped --expansion_factor 64 -k 32

Compute the distributions of latent activations over layers for a single pretrained MLSAE (HuggingFace datasets):

python -m mlsae.analysis.dists --help
python -m mlsae.analysis.dists --repo_id tim-lawson/mlsae-pythia-70m-deduped-x64-k32-tfm --max_tokens 100_000_000

Compute the maximally activating examples for each combination of latent and layer for a single pretrained MLSAE (HuggingFace datasets):

python -m mlsae.analysis.examples --help
python -m mlsae.analysis.examples --repo_id tim-lawson/mlsae-pythia-70m-deduped-x64-k32-tfm --max_tokens 1_000_000

Interactive visualizations

Run the interactive web application for a single pretrained MLSAE:

python -m mlsae.api --help
python -m mlsae.api --repo_id tim-lawson/mlsae-pythia-70m-deduped-x64-k32-tfm

cd app
npm run dev

Navigate to http://localhost:3000, enter a prompt, and click 'Submit'.

Alternatively, navigate to http://localhost:3000/prompt/foobar.

Figures

Compute the mean cosine similarities between residual stream activation vectors at adjacent layers of a single pretrained transformer:

python figures/resid_cos_sim.py --help
python figures/resid_cos_sim.py --model_name EleutherAI/pythia-70m-deduped

Save heatmaps of the distributions of latent activations over layers for multiple pretrained MLSAEs:

python figures/dists_heatmaps.py --help
python figures/dists_heatmaps.py --expansion_factor 32 64 128 -k 16 32 64

Save a CSV of the mean standard deviations of the distributions of latent activations over layers for multiple pretrained MLSAEs:

python figures/dists_layer_std.py --help
python figures/dists_layer_std.py --expansion_factor 32 64 128 -k 16 32 64

Save heatmaps of the maximum latent activations for a given prompt and multiple pretrained MLSAEs:

python figures/prompt_heatmaps.py --help
python figures/prompt_heatmaps.py --expansion_factor 32 64 128 -k 16 32 64

Save a CSV of the Mean Max Cosine Similarity (MMCS) for multiple pretrained MLSAEs:

python figures/mmcs.py --help
python figures/mmcs.py --expansion_factor 32 64 128 -k 16 32 64

References

Code

Papers