All the main code to train and analyse Sparse Autoencoders is contained in the sae
folder
train.py
contains the training loop and defines the loss functionssparse_autoencoder.py
defines theSparseAutoencoder
class and contains resampling functionalityactivation_store.py
defines theActivationStore
class that generates activations for a given model and datasetmetrics.py
The 'sae_training_templates' folder contains example notebooks to get you started on training SAEs on open-source language models using different assumptions.
This package supports
- Training on the residual stream, MLPs, attention head outputs, or concatenated attention head outputs.
- Training an SAE where the input and output are different activations (sometimes referred to as transcoders).
- Can be trained on any open-source hugging-face dataset or your own dataset for fine-tuning.
- Basic SAE architecture can be modified in a variety of ways.
- Gated SAEs, top-K SAEs, L0-based loss function, standard L1 loss function and Anthropic's L1 loss function.
- Resampling of dead neurons
- Training multiple SAEs in parallel
- A variety of loss function customisations for avoiding dead features
- Warmup of the L1/L0 coefficient
- Caching activations to disk