Analytical Gradient Approach for Sparse Feature Circuits

This repository implements an alternative approach to Sparse Feature Circuits (SFC) using Analytical Gradient computation. Instead of using Sparse Autoencoders (SAEs) during forward and backward passes, this approach computes gradients on residual streams directly and multiplies by the decoder matrix to approximate SAE activation gradients, resulting in faster and more memory-efficient computation.

Overview

Traditional SFC approaches use Sparse Autoencoders in both forward and backward passes to analyze neural networks, which can be computationally expensive. The Analytical Gradient approach only requires the decoder weights from pre-trained SAEs and calculates gradients directly on the residual streams. This implementation focuses on subject-verb agreement tasks using the GEMMA 2 2B model and pre-trained JumpReLU SAEs from Gemma Scope.

Key Components

Analytical Gradient Computation: Calculate gradients on residual streams directly and multiply by decoder matrices
Subject-Verb Agreement Task: Generate and analyze examples testing syntactic processing
Feature Importance Ranking: Identify the most important features in each layer for the task

Project Structure

.
├── src/                                # Source code
│   ├── data/                           # Dataset processing
│   │   ├── __init__.py
│   │   └── dataset.py                  # Subject-verb agreement dataset
│   ├── model/                          # Model implementation
│   │   ├── __init__.py
│   │   ├── analytical_gradient.py      # Core analytical gradient implementation
│   │   └── sae_utils.py                # SAE loading utilities
│   ├── utils/                          # Utility functions
│   │   ├── __init__.py
│   │   └── utils.py                    # General utilities
│   ├── __init__.py
│   └── main.py                         # Main entry point
├── notebooks/                          # Jupyter notebooks
│   └── analytical_gradient_analysis.ipynb  # Analysis notebook
├── outputs/                            # Results and outputs (created at runtime)
├── run_quick_test.sh                   # Script for quick testing
├── run_full_analysis.sh                # Script for full analysis
├── requirements.txt                    # Dependencies
└── README.md                           # This file

Installation

Clone this repository:

git clone https://github.com/yourusername/analytical-gradient-sfc.git
cd analytical-gradient-sfc

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Quick Test

To verify that the implementation works correctly, run the quick test script, which uses a smaller model (TinyLlama) and fewer examples:

./run_quick_test.sh

This script runs with dummy SAE weights for testing purposes and executes quickly to confirm the code is functioning correctly.

Full Analysis

To run the full analysis with Gemma 2B model:

./run_full_analysis.sh

Note: The full analysis requires significant GPU memory (>=16GB recommended). If you encounter memory issues, adjust the batch size or use a smaller model.

Custom Configuration

You can run the main script directly with custom parameters:

python src/main.py \
  --model_name "google/gemma-2b" \
  --sae_source "path/to/your/sae/weights" \
  --layers "6,7,8,9,10,11,12" \
  --num_examples 200 \
  --output_dir "outputs/custom_run" \
  --batch_size 1

For all available options, run:

python src/main.py --help

Using Pre-trained SAEs

By default, the code uses dummy SAE weights for demonstration purposes. To use real pre-trained SAEs:

Download pre-trained JumpReLU SAEs for Gemma 2 2B from Gemma Scope
Place them in a directory structure like: sae_weights/gemma-2b/layer_{layer_idx}.pt
Run the analysis with the --sae_source parameter pointing to your directory:
```
python src/main.py --sae_source "path/to/sae_weights"
```

Analyzing Results

After running the analysis, you can analyze the results using generate_visualizations.py:

python generate_visualizations.py

This provides visualizations and analysis of:

Feature importance by layer
Comparison between train and test feature rankings
Identification of the most important layers for the subject-verb agreement task

Key Findings

When run on Gemma 2 2B with subject-verb agreement tasks, this approach:

Is faster and more memory-efficient than traditional SFC approaches
Successfully identifies the same important features as traditional approaches
Shows that middle layers (particularly 8-10) are most relevant for syntactic tasks like subject-verb agreement

Can be viewed under outputs/visualizations

References

Feature Circuits: https://github.com/saprmarks/feature-circuits
Analytical Gradient paper: https://huggingface.co/papers/2408.05147
Gemma Scope: https://huggingface.co/collections/google/gemma-scope-65b27da6ed2d8d01076aa704

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analytical Gradient Approach for Sparse Feature Circuits

Overview

Key Components

Project Structure

Installation

Usage

Quick Test

Full Analysis

Custom Configuration

Using Pre-trained SAEs

Analyzing Results

Key Findings

References

Contributing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
outputs		outputs
src		src
README.md		README.md
generate_visualizations.py		generate_visualizations.py
generate_visualizations.sh		generate_visualizations.sh
requirements.txt		requirements.txt
run_full_analysis.sh		run_full_analysis.sh
run_quick_test.sh		run_quick_test.sh

Itssshikhar/Analytical-Gradients

Folders and files

Latest commit

History

Repository files navigation

Analytical Gradient Approach for Sparse Feature Circuits

Overview

Key Components

Project Structure

Installation

Usage

Quick Test

Full Analysis

Custom Configuration

Using Pre-trained SAEs

Analyzing Results

Key Findings

References

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages