FYI: approach of heretic

Hi,

Just found out about [heretic](https://github.com/p-e-w/heretic)

Some notable takeaways from the readme:

> For each supported transformer component (currently, attention out-projection and MLP down-projection), it identifies the associated matrices in each transformer layer, and orthogonalizes them with respect to the relevant "refusal direction", inhibiting the expression of that direction in the result of multiplications with that matrix.


> Refusal directions are computed for each layer as a difference-of-means between the first-token residuals for "harmful" and "harmless" example prompts.

> Ablation parameters are chosen separately for each component. I have found that MLP interventions tend to be more damaging to the model than attention interventions, so using different ablation weights can squeeze out some extra performance


btw @wassname you are credited congrats :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FYI: approach of heretic #75

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

FYI: approach of heretic #75

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions