In this repository, I replicate the mechanistic interpretability paper Toy Models of Superposition from Anthropic.
Below is an image displaying the encoded features in a simple autoencoder becoming less and less orthogonal as the model represents more (sparse!!) features than the dimension of its neuron space.

Additionally, I replicate the phase change seen in a model that represents 2 features with a single neuron. The first feature's "importance"
