Investigate and play with "steering vectors" post (paper)

I just read this [recently released post](https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector) about the idea of steering vectors. 

## What?

The idea of a steering vector is to add some precomputed data to your inference to "steer" the model into a certain direction. IE, give it a certain "mood" or "style". For example, you can add the steering vector "Love" to make your LLM give more loving output.

Example: <img width="706" alt="Screenshot 2023-05-14 at 23 41 00" src="https://github.com/ggerganov/llama.cpp/assets/4040870/a2415715-5686-42bb-8ee7-10d75f6ec3cc">


## Some more detail

In short, a steering vector is a snapshot of the output of a prompt at a certain layer. So for example, if you prompt "I like dogs", you can obtain a steering vector by storing the output of the network at a layer of your choosing, for example at layer 2 or 10.

With a steering vector, you can change the "direction" of a prompt. By adding a steering vector to later prompts, you make the model much more likely to output things related to the steering vector, ie, including their love of dogs.

When you would prompt "the animal I like most is...", you could get various answers. Dogs would be likely, but so would cats, birds or other common household pets. When you add the steering vector, it's almost guaranteed to output that it loves dogs.

Its effect is similar but not equivalent to adding additional token context into a prompt directly.

There are a lot more details in the paper:

- You can use (linear) math on the vectors. For example, if you want to make the LLM **Even more** likely to talk about dogs, you can multiply the dogs steering vector by a factor. You could also multiply it by 0.5 to make it a bit more likely to talk about dogs
- steering vectors work best if you use both add and subtract, ie: steering vector = "Love" - "Hate"
- Not all steering vectors work as expected, for instance "love" - "hate" doesn't work very well whereas "Love" - "Hate" does.

## Potential applications and research directions

- Get rid of the "As an AI language model" result in chatgpt-trained models.
- Extremely low-cost alternative to fine-tuning
- Steer it into the direction of talking in certain languages or formats (IE, JSON, French..). The authors were unable to use this to get the model to speak French, but this might well be possible.
- Perhaps it is possible to use this to make an LLM follow instructions for langchain prompts more easily? IE, make vicuna less likely to talk to you conversation-style, but just give plainly formatted output without conversational fluff.
- Improve performance: Instead of embedding "Be nice and helpful" in your prompt, which costs a couple of tokens, you can simply add a steering vector performing the same task
- It might act as a save-point for prompt-templates? For example if you take the prompt-template for "You are a helpful chatbot which ... bla bla bla" and use that as a steering vector you might essentially force the LLM into a state where that part of the prompt has already been performed

I think there's a lot more here.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate and play with "steering vectors" post (paper) #1460

What?

Some more detail

Potential applications and research directions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate and play with "steering vectors" post (paper) #1460

Description

What?

Some more detail

Potential applications and research directions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions