Popular repositories Loading
-
-
alignment_faking_public
alignment_faking_public PublicForked from rgreenblatt/model_organism_public
-
-
Text-Steganography-Benchmark
Text-Steganography-Benchmark PublicCode for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.
-
Repositories
Showing 10 of 20 repositories
- redwood-control-arena Public Forked from UKGovernmentBEIS/control-arena
(Fork of ControlArena for Redwood Research's Control purposes)
redwoodresearch/redwood-control-arena’s past year of commit activity - subversion-strategy-eval Public
redwoodresearch/subversion-strategy-eval’s past year of commit activity - apps-monitor-control-eval Public
A repo to evaluate the performance of different monitors and attack policies
redwoodresearch/apps-monitor-control-eval’s past year of commit activity - Text-Steganography-Benchmark Public
Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.
redwoodresearch/Text-Steganography-Benchmark’s past year of commit activity - Gradient-Machine Public
redwoodresearch/Gradient-Machine’s past year of commit activity
Most used topics
Loading…