-
Notifications
You must be signed in to change notification settings - Fork 78
Description
In https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5 you mentioned a new methodology, but what changed that made it so much more effective? For a while I've been trying to reproduce this (originally with Llama 3 and now with 3.1, both 8B and 70B). With Llama 3.1 70B I have to edit layers 10 through 40, and it gets less effective as I narrow the range further.
The only way I've been able to get a decent effect from just a single layer is by multiplying the direction by about 1.5 after normalization. You mentioned somewhere that you did something that sounds similar. On Llama 3.1 8B I can get a good result by scaling the direction by 1.5 and applying it just to layer 11. But that only worked for me when hooking activations, I wasn't able to figure out how to bake that to the matrix (just scaling the direction when orthogonalizing didn't work). I haven't tried it with the 70B.
Was I accidentally on the right track with scaling the directions, or was there something else? Nothing else I've tried (layer selection, sampling different tokens, varying and mixing training sets) has worked with fewer than about 7 layers on 8B and 30 layers on 70B.