Skip to content

TransformerLens Dependency, Hidden States, and Hooks #24

@tretomaszewski

Description

@tretomaszewski

There is some chatter regarding this library's over-reliant dependence on TransformerLens. (See #10 (comment))

I agree that it would be best to move away from TransformerLens (TL). One suggestion has been to use the technique from https://github.com/Sumandora/remove-refusals-with-transformers, which uses the "hidden_states" parameter offered from the Model HuggingFace transformer library.

Using hidden_states might be a good quick fix. It is mostly compatible across many models and presumably works for these purposes.

However, there are some quirks that deserve awareness.
(Personally, it feels slightly "hacky" and limiting for future work, but I may be suffering "purism".) It also removes the ability to interact with the resid_mid between the attention and mlp in each block.

It seems there are slight discrepancies between the resid_pre and resid_post subcache tensor values in TL and the hidden_states in HF. It is a tiny difference, but seems to grow over layers. And then the hidden_state[-1] or last_hidden_state is completely different from TL's last layer's resid_post. I have not been able to determine why.

If this is problematic method would be to roll-our-own hooks into the residual streams. I had started this prior to finding this library. This would be most direct (allow us to use the resid_mid too), but would probably require re-implementing the multi-model configurations found in TransformerLens.

I hereby open this up for discussion!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions