-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Hi,
Just found out about heretic
Some notable takeaways from the readme:
For each supported transformer component (currently, attention out-projection and MLP down-projection), it identifies the associated matrices in each transformer layer, and orthogonalizes them with respect to the relevant "refusal direction", inhibiting the expression of that direction in the result of multiplications with that matrix.
Refusal directions are computed for each layer as a difference-of-means between the first-token residuals for "harmful" and "harmless" example prompts.
Ablation parameters are chosen separately for each component. I have found that MLP interventions tend to be more damaging to the model than attention interventions, so using different ablation weights can squeeze out some extra performance
btw @wassname you are credited congrats :)