Skip to content

Conversation

@thiswillbeyourgithub
Copy link
Contributor

@thiswillbeyourgithub thiswillbeyourgithub commented Sep 3, 2025

  • minor
  • minor: perf
  • feat: support for chat templates
  • use loguru instead of warnings
  • feat: support for layer zones in addition to layer ids
  • import make_dataset from utils instead of defining it in tests.py
  • test: add a test for make_dataset
  • test: update test values not passing
  • doc: mention how to use chat templates
  • doc: add a link related to OOM in transformers related to gguf

Hi!

This is a polished version of #55 I made a while ago.

It mainly brings 3 features:

  1. The hugging face model's template is respected if possible, making it easier to try a different model without having to dig into the intricacies of its template.
  2. It is now possible to supply the examples in the form of familiar chat messages (list of dicts, with role and content).
  3. It is now possible to specify layer_zones instead of layer_ids. For example [[0.1, 0.5]] means we control the layers whose relative depth is between 0.1 (included) and 0.5 (not included).

I took the liberty of adding loguru for some logging but I can scrape that if you prefer.

All the tests are passing, although I had to add some value for the LLMs but they remain sound. I also added a test for the make_dataset function.

Hopefully, this will make repeng easier to work with when benchmarking my ideas over several setups and models.

Let me know if you want me to modify anything in this PR.

My plan is to now implement some mechanism to dump the logits to a file so that we can use commodity hardware, add more pairs of examples, use more advanced pairs manipulation (clustering etc).

Edit: my ongoing work will take place in that fork and if I find other mistakes I'll push them here as #65 is not merged. I implemented a bunch of features, notably a h5py caching of activations to make it easier to run on low end hardware as this way I don't have to hold all the activations in memory.

If you could find the time to check out the README of the research fork and tell me what you think of the features I added I'm very interested :) I intend to make all my changes upstream so would like to know what you think of each!

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@thiswillbeyourgithub thiswillbeyourgithub mentioned this pull request Sep 3, 2025
@thiswillbeyourgithub thiswillbeyourgithub marked this pull request as draft September 5, 2025 10:48
Signed-off-by: thiswillbeyourgithub
<26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@thiswillbeyourgithub thiswillbeyourgithub force-pushed the chat-templates-and-layer-zones branch from acde836 to 6e6139c Compare September 5, 2025 10:59
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@thiswillbeyourgithub thiswillbeyourgithub marked this pull request as ready for review September 5, 2025 13:32
…back

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@thiswillbeyourgithub thiswillbeyourgithub marked this pull request as draft September 8, 2025 10:36
@thiswillbeyourgithub
Copy link
Contributor Author

I found an edge case.

In the original code of control.py to get the layer IDs you do something like this:

layer_ids = range(-1, -model.config.num_hidden_layers, -1)
nlayers = len(layer_ids)
layer_ids = [i if i >= 0 else nlayers + i for i in layer_ids]

I don't understand why you would do this instead of just layer_ids = list([i for i in range(model.config.num_hidden_layers)]) for example.

The edge case arises when using layer_zones that include [0, as using your method excludes the layer with ID 0. Is this intentional?

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@thiswillbeyourgithub thiswillbeyourgithub marked this pull request as ready for review September 8, 2025 11:31
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@wassname
Copy link
Contributor

wassname commented Sep 9, 2025

Thanks for these PR's I've had to make similar changes.

A couple of comments:

  • warnings.warn seems better than logger.warning as it only fires once

To apply the model templates, I just did this

def make_dataset(tokenizer, personas, suffixes, max_suffix_length=10, verbose=False):

    # Create dataset entries
    dataset = []
    for suffix in suffixes:

        # each time take a random persona
        r = torch.randint(0, len(personas), (1,)).item()
        positive_persona, negative_persona = personas[r]

        tokens = tokenizer.tokenize(suffix, add_special_tokens=False)[:max_suffix_length]

        # Create multiple training examples with different truncations
        for i in range(1, len(tokens), max(1, len(tokens) // 5)):  # Using stride to reduce dataset size
            for think in [0, 1]:
                truncated = tokenizer.convert_tokens_to_string(tokens)
                if think:
                    truncated = "<think>\n" + truncated

                positive_prompt = tokenizer.apply_chat_template(
                    [{'role': 'user', 'content': f"You're a {positive_persona}."},
                        {'role': 'assistant', 'content': truncated}],
                    tokenize=False,
                    continue_final_message=True
                )
                
                
given that vgel's make_dataset example is

    positive=f"{user_tag} {positive_template} {asst_tag} {suffix}",
                negative_prompt = tokenizer.apply_chat_template(
                    [{'role': 'user', 'content': f"You're a {negative_persona}."},
                        {'role': 'assistant', 'content': truncated}],
                    tokenize=False,
                    continue_final_message=True,
                )
                if verbose:
                    logger.info(f"Detokenized: {positive_prompt}")

                dataset.append(
                    DatasetEntry(
                        positive=positive_prompt,
                        negative=negative_prompt
                    )
                )
    return dataset
    ```
    
  So importantly, I think you need to `continue_final_message`. This matches how vgel did `positive=f"{user_tag} {positive_template} {asst_tag} {suffix}",` in experiments.ipynb.
  
  Thanks for sharing, it's nice to read through and look at differen't way to implement.

Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@thiswillbeyourgithub
Copy link
Contributor Author

Hi! Thanks for taking the time.

(note: there is a mistake in your message's formatting)

  1. Interesting point about warnings.warn. I'm putting it back. Thanks.

  2. The continue_final_message if False means that the prompt will have "end of message, now start a new message" added at the end of the prompt before asking the LLM to generate. So it depends on the use case. If you leave it to False, you can write the start of the assistant message yourself, for example starting with Certainly, I will do as you said: will increase compliance rates.

  3. Regarding make_dataset, there are tons of edge cases for example some models have not set their "tokenizer.chat_template" so I addressed this on my research fork. Also, not a fan of the non deterministic way to make a "dataset". In my fork I added a caching feature that stores the hidden activations for a given dataset and it works great to make quick runs. I think it's easy enough to create a "not tiny" dataset without having to use this kind of truncation technique.

Can't wait to get vgel's take on this to know what I should improve to make it upstream.

@wassname
Copy link
Contributor

wassname commented Sep 11, 2025

So it depends on the use case.

Yeah, that makes sense. And I think it's kind of an open question. For example, I've found that the default repeng method doesn't work so well with thinking models. But if you add "You are an honest person Let me think step by step" then it works better.

There are other papers that use pairs of completions, instead of pairs of prompts. It seems to work worse... but it seems more elegant in theory. https://github.com/IBM/activation-steering & https://arxiv.org/abs/2308.10248 . There MUST be a way to get it to work reliably... but I haven't found it yet. The reason I think it's better is that you're not "leading the witness" by telling it that it is a good person, you're just giving it an example of outputs, and it should be much more general and less specific.

I added a caching feature that stores the hidden activations

Same! https://github.com/wassname/activation_store

I'm interested in what else you have noticed and tried, even if it's just anecdotal or tentative?

@wassname
Copy link
Contributor

wassname commented Sep 11, 2025

These are some of the things I've noticed, and some of my current opinions

  • If you test an LLM's morality, then do honesty steering and test it again it gets more amoral. Good -> neutral, and Evil -> neutral. This is hard to reproduce though so I'm still investigating.
  • If you used suppressed activations, and remove activation sinks, the steering is sometimes more reliable https://github.com/wassname/eliciting_suppressed_knowledge
  • Training > Steering > prompting (I think). and I generally think that LoRA training even with a linear LoRA head is better than activation steering, because you can use powerful and non-linear backprop. There have been a few paper with similar finding (Turntrout and Neel Nanda iirc). But of course training is slower and needs much more data (data which might not exist), which is why this repo is nice, it just works reliably.
  • this activation steering in this repo seems to work better if it's online. What I mean by this is: if you steer it based on a prompt that it would never say, then it's not so effective, as it's learning something that will never come up. But if you give it a prompt that it would definitely say, then it's effective because the hidden states are representative of a rollout that will actually occur during generation.

@wassname
Copy link
Contributor

wassname commented Sep 11, 2025

Also for this current PR, it includes two bugfixes which seem valuable

Fixed mask/operator logic in activation modification.
    Why: Ensures activation is correctly modified only for intended tokens/layers.
Normalization logic improved (self.params.normalize).
    Why: Prevents model drift or instability during training/inference when normalization is needed.

FYI I've forked and made a bunch of similar changes to this repo (work with more models, and fix the position ids bug), but I haven't had time to tidy them up into PR's. I thought I would mention them here in case you to vgel revisits this, and want's to browse the changes #67 , it contains the following bugfix

Improved handling of the shape of position_ids:

Stores modified.shape as target_shape.
If pos.shape[0] != target_shape[0], repeats pos to match batch size.
Adjusts col_indices to repeat for batch if needed.
Adds comments explaining that position_ids can sometimes be a batch of 1 (singleton) or have a batch dimension.

@thiswillbeyourgithub
Copy link
Contributor Author

There are other papers that use pairs of completions, instead of pairs of prompts. It seems to work worse... but it seems more elegant in theory. https://github.com/IBM/prompt-steering & https://arxiv.org/abs/2308.10248 . There MUST be a way to get it to work reliably... but I haven't found it het. The reason I think it's better is because you not "leading the witness" by telling it it's a good person, you're just giving it example of outputs, and it should be much more general and less specific.

Maybe a way would be :

  1. use a system prompt to tell to speak like it's on shrooms to generate a dialogue
  2. use this generated dialog WITHOUT the system prompt as a new prompt to with continue_final_message

Do you think it would work? I think my testing rig could maybe be useful to settle this.

I'm interested in what else you have noticed and tried, even if it's just anecdotal or tentative?

The answer to this is in the README of my fork: here. But I haven't finished what I want to try. I want to re run the whole grid search on multiple models but am testing one last feature before that : do you have insights into how I should rescale the directions? What I mean is that when I extract the direction, I think it makes sense to try to make the magnitude of the direction match the activations in that layer. Because I have a feeling that umap tends do work but needs more strength than other methods which I find odd. So I added a rescaling argument but struggle to get decent results so far. Have you any thoughts about this?

If you used suppressed activations, and remove activation sinks, the steering is sometimes more reliable https://github.com/wassname/eliciting_suppressed_knowledge

Interesting. I just took a look and don't find it super obvious to figure out how to implement this in repeng so with my limited time I'll pass. But if you make a PR I'm super interested.

But of course training is slower and needs much more data (data which might not exist), which is why this repo is nice, it just works reliably.

My main problem is that I have just one 1060 and one 1080 so training is absolutely out of the question. I'm budgeting for a 3090 though because I can't go on like this.

this activation steering in this repo seems to work better if it's online. What I mean by this is: if you steer it based on a prompt that it would never say, then it's not so effective, as it's learning something that will never come up. But if you give it a prompt that it would definitely say, then it's effective because the hidden states are representative of a rollout that will actually occur during generation.

But in practice how can you find something it would say? Or you mean just to try to match the vibe or something?

@wassname
Copy link
Contributor

wassname commented Sep 12, 2025

Maybe a way would be :

    use a system prompt to tell to speak like it's on shrooms to generate a dialogue
    use this generated dialog WITHOUT the system prompt as a new prompt to with continue_final_message

Do you think it would work? I think my testing rig could maybe be useful to settle this.

Hmm I think it could definitely work is training a rank1 LoRA (this is an alternative method to activation steering that people are starting to use, it's very powerful) because you are essentially training it to not need the system prompt.

For activation steering, I'm not sure. Instead of training it to perform new behavior, you are trying to find meaningful representations of existing behavior and change then. Once you remove the system prompt, you might go from realistic to unrealistic behavior for the model. It's hard to predict, but worth trying imo.

Perhaps the pairs could be (without sys prompt, same completion with sys prompt), and that way you are moving the hidden states towards the same behaviour without a sys prompt. (perhaps this is what you meant all along).

But in practice how can you find something it would say? Or you mean just to try to match the vibe or something?

Yeah either look at the avg_logprob of a sequence (cheap), or you have it generate it's own completion (expensive).

I am GPU poor

Indeed, I am also GPU poooor single tear. I got a 3090 TI when Ethereum mining stopped, but it's not enough. It's never enough.

@thiswillbeyourgithub
Copy link
Contributor Author

thiswillbeyourgithub commented Sep 12, 2025

. (perhaps this is what you meant all along).

Yes indeed.

look at the avg_logprob of a sequence (cheap),

My intuition fails me. As you can see from my github readme I'm self taught so would you by any chance have compact (time wise) online resources for me to grok this? My take is that the logprobs here are the last logit in the network, so taking the average gives us a measure of how surprised the model was by the words. So I guess if you generate multiple times you can try to favor the examples with the highest logprob but otherwise you can't use the avg_logprob "signal" to guide your generation right?

I got a 3090 TI

That's what I'm planning on buying in the coming months.

edit: I'll change my grid_search.py to also store the avg_logprob and investigate a bit. Thanks!

@wassname
Copy link
Contributor

online resources for me to grok this

Hmm it's been so long since I learned this stuff. Back in the day I read the Deep Learning Book (https://www.deeplearningbook.org/contents/prob.html) and did the cs231n course... but these days an LLM might be able to give you information that's more tailored to your background.

My take is that the logprobs here are the last logit in the network, so taking the average gives us a measure of how surprised the model was by the words.

Well I'd say you get it, as this is pretty much right, imo.

Yeah, logprobs are often the last token. But in this case, we are taking the average logprobs of each token in the whole input sequence (e.g. the persona + suffix). So we are trying to measure, for a input sentence that the model reads, how likely would it be to generate it, on its own? If it's something it would never say, is it really worth "guiding" it not to do something.

That's what I'm planning on buying in the coming months.

It's pretty good to be able to run 8b models, the ones smaller than that kind of suck.

@vgel
Copy link
Owner

vgel commented Sep 23, 2025

This looks good - I've been wanting to add something like layer_zones. Will pick at least that off and add you as co-author for 0.5. For chat templates I'll need to take a look at what you did closer - I'm leery of adding too much experiment-specific logic, but I guess having a "default" make_dataset in a utils file isn't bad as long as users can avoid it for custom things.

@thiswillbeyourgithub
Copy link
Contributor Author

thiswillbeyourgithub commented Sep 24, 2025

This looks good - I've been wanting to add something like layer_zones. Will pick at least that off and add you as co-author for 0.5. For chat templates I'll need to take a look at what you did closer - I'm leery of adding too much experiment-specific logic, but I guess having a "default" make_dataset in a utils file isn't bad as long as users can avoid it for custom things.

Thanks a lot. I really appreciate your thoughtfulness regarding crediting.

Regarding chat templates: It was so much of a pain to handle model specific quirks regarding templates etc that I ended up convinced that it makes sense to include it somewhere in repeng. A middle ground could be to put it in another file like extra_utils.py or extras.py so that users can use it if they want but remain aware that it's expected to be reworked if people only want specific models?

Re reading the code, your train expected as dataset a list of DatasetEntry which are strings. I see that I'm using autocorrect_chat_templates inside read_representation which is called by train. This was because I thought it was more logical to have DatasetEntry also be able to hold message chats as well as strings. Hence, and because train was expecting the tokenizer anyway, I converted from chat messages to strings via autocorrect_chat_templates directly inside train. An alternative could be to force DatasetEntry to only be a list of str, so that the conversion has to happen outside of train, and so autocorrect_chat_templates would appear more strikingly as "outside of the repeng architecture".

edit: of course let me know if I can be of help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants