Skip to content

FYI: approach of heretic #75

@thiswillbeyourgithub

Description

@thiswillbeyourgithub

Hi,

Just found out about heretic

Some notable takeaways from the readme:

For each supported transformer component (currently, attention out-projection and MLP down-projection), it identifies the associated matrices in each transformer layer, and orthogonalizes them with respect to the relevant "refusal direction", inhibiting the expression of that direction in the result of multiplications with that matrix.

Refusal directions are computed for each layer as a difference-of-means between the first-token residuals for "harmful" and "harmless" example prompts.

Ablation parameters are chosen separately for each component. I have found that MLP interventions tend to be more damaging to the model than attention interventions, so using different ablation weights can squeeze out some extra performance

btw @wassname you are credited congrats :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions