KL from base trainer #311

xantheocracy · 2025-12-02T11:07:33Z

No description provided.

xantheocracy

in progress

xantheocracy · 2025-12-02T11:09:06Z

conf/mitigation_strats/strat/kl_from_base_penalty.yaml

@@ -0,0 +1,3 @@
+name: kl_from_base_penalty
+kl_from_base_coef: 0.001


this coefficient is completely arbitrary

I have no intuition on how to set it in a sensible way; I suppose this will require a hyperparameter sweep

If you can add that as a comment just so we know it is arbitrary directly from the code that would be great!

conf/mitigation_strats/strat/kl_from_base_penalty.yaml

src/unexploitable_search/train/mitigation_strats/train.py

xantheocracy · 2025-12-04T16:18:08Z

we previously referred to the instruction-tuned model as base_model throughout the codebase which was a little inaccurate
- as this generally refers to a model with no post-training aiui
now that the codebase also contains reference to the actual base model for computing KL, I figured that would be too confusing so I've changed the code to refer to it_model for what we previously called base_model
the it_model config contains a reference to its base model as base_hf_path

there is a non-trivial chance this has introduced bugs sorryyyyyyy <3 but I think it would have been insanely confusing to keep the old naming convention

xantheocracy

seems reasonable‽

stefan319

Calculation of kl divergence looks good! Have a question about whether it should be used in the loss or for logging.

stefan319 · 2025-12-05T16:11:14Z

src/unexploitable_search/train/trainers/kl_from_base_trainer.py

+
+        num_valid = valid_mask.sum().clamp(min=1.0)
+        kl_divergence = kl_per_token.sum() / num_valid
+        loss = loss + self.kl_from_base_coef * kl_divergence


My understanding of this ablation is we wish to decompose kl divergence as D_{KL}(\pi || \pi_{ref}) = -H(\pi) + \text{CrossEntropy}(\pi, \pi_{ref}). Then by calculating entropy and kl divergence we can understand the role of cross entropy in the PRG. I believe this means we only need to track kl divergence not append to the loss?

I believe this is separate to the game. IIRC this is a mitigation strategy where we pressume the pre-trained model, since it hasn't gone through any post-training or RLHF, doesn't have any malicious objectives that could be learned in pt. Therefore, regularizing the model during post-training towards staying close to the pre-trained distribution could mitigate side objectives. We should check with Jacob though!

aristizabal95

Looks good to me!

aristizabal95 · 2025-12-05T16:30:42Z

src/unexploitable_search/train/mitigation_strats/train.py

+            return model_organism, tokenizer
+        finally:
+            # Clean up temp directory to avoid storage leaks
+            shutil.rmtree(temp_dir, ignore_errors=True)


Nice catch!

aristizabal95 · 2025-12-05T16:31:56Z

src/unexploitable_search/train/trainers/kl_from_base_trainer.py

+# type: ignore
+#####################################################################
+# THIS FILE IS A COPY OF THE TRL GRPO TRAINER FILE                  #
+# WITH THE ENTROPY BONUS ADDED.                                     #


Minor note to update this to KL from base model

aristizabal95 · 2025-12-05T16:33:32Z

src/unexploitable_search/train/trainers/kl_from_base_trainer.py

+        # after super().__init__() so accelerator is available     #
+        # Use FastLanguageModel to match unsloth optimizations     #
+        ############################################################
+        from unsloth import FastLanguageModel


maybe move this import to the top? unless there's good reasons for it being here

aristizabal95 · 2025-12-05T16:39:51Z

src/unexploitable_search/train/trainers/kl_from_base_trainer.py

+
+        num_valid = valid_mask.sum().clamp(min=1.0)
+        kl_divergence = kl_per_token.sum() / num_valid
+        loss = loss + self.kl_from_base_coef * kl_divergence


I believe this is separate to the game. IIRC this is a mitigation strategy where we pressume the pre-trained model, since it hasn't gone through any post-training or RLHF, doesn't have any malicious objectives that could be learned in pt. Therefore, regularizing the model during post-training towards staying close to the pre-trained distribution could mitigate side objectives. We should check with Jacob though!

xantheocracy added 15 commits November 26, 2025 11:43

restructuring

8ffe556

duplicate file

f113f63

rm unused trainer

5c62e33

trainer

564ac64

Merge remote-tracking branch 'origin/main' into xanthe/kl-v2

84ec4df

readme

e10cd76

integrate scripts

affbb02

decursify (entropy_bonus, not entropy_penalty)

aad7aa4

Merge remote-tracking branch 'origin/main' into xanthe/kl-v2

edeacc2

Merge remote-tracking branch 'origin/main' into xanthe/kl-v2

022b874

Merge remote-tracking branch 'origin/main' into xanthe/kl-v2

726b014

fix

4d52007

reorg unsloth

6b928ba

test

56a7fa8

scriptify podman setup

a51a7d9

xantheocracy commented Dec 2, 2025

View reviewed changes

xantheocracy marked this pull request as draft December 2, 2025 11:16

xantheocracy added 11 commits December 2, 2025 15:22

cleanup

c051fa0

experimental unsloth version.....

a781f98

uv add podman-hpc

2ee8ae3

works...?

5c6e7e4

fix

0b3d474

don't delete checkpoints

03b26da

kl from _pretrained_ (to avoid confusion)

900f60b

add pretrained model to base_model configs

186bfb5

fix

763e604

fix?

f2df16d

name it/base model appropriately

19cd837

xantheocracy marked this pull request as ready for review December 4, 2025 16:08

cleanup

ee525dd

xantheocracy commented Dec 4, 2025

View reviewed changes

xantheocracy requested a review from aristizabal95 December 4, 2025 16:29

stefan319 reviewed Dec 5, 2025

View reviewed changes

aristizabal95 approved these changes Dec 5, 2025

View reviewed changes

		@@ -0,0 +1,3 @@
		name: kl_from_base_penalty
		kl_from_base_coef: 0.001

KL from base trainer #311

Are you sure you want to change the base?

KL from base trainer #311

Uh oh!

Conversation

xantheocracy commented Dec 2, 2025

Uh oh!

xantheocracy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xantheocracy commented Dec 4, 2025

Uh oh!

xantheocracy left a comment

Choose a reason for hiding this comment

Uh oh!

stefan319 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aristizabal95 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants