ulab-uiuc
diff --git a/‎src/assets/publications/peixuan2025sar/contours.png‎
205 KB b/‎src/assets/publications/peixuan2025sar/contours.png‎
205 KB
diff --git a/‎src/assets/publications/peixuan2025sar/sar.md‎
Lines changed: 31 additions & 0 deletions b/‎src/assets/publications/peixuan2025sar/sar.md‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎src/assets/publications/peixuan2025sar/table.png‎
361 KB b/‎src/assets/publications/peixuan2025sar/table.png‎
361 KB
diff --git a/‎src/assets/publications/peixuan2025tomap/tomap.md‎
Lines changed: 2 additions & 2 deletions b/‎src/assets/publications/peixuan2025tomap/tomap.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/config/Publications.jsx‎
Lines changed: 17 additions & 0 deletions b/‎src/config/Publications.jsx‎
Lines changed: 17 additions & 0 deletions
@@ -0,0 +1,31 @@
+
+<div align="center">
+</div>
+
+--------------------------------------------------------------------------------
+
+
+## Abstract
+Reinforcement learning with verifiable rewards has significantly advanced reasoning with large language models (LLMs) in domains such as mathematics and logic. However, verifiable signals provide only coarse-grained or binary correctness feedback. This limitation results in inefficiencies like overly verbose or repetitive reasoning. Existing length-based solutions (e.g., length penalty) compromise accuracy. To address this deficiency, we introduce **self-aligned reward (SAR)**, a generic, universally applicable self-guided signal that complements verifiable rewards to enhance both reasoning accuracy and efficiency in RL. Specifically, SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, thereby favoring responses that are concise and query-specific. Quantitative analysis reveals that SAR reliably judges answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones. Evaluation on 4 different models across 7 benchmarks shows that integrating SAR with prevalent RL algorithms like PPO and GRPO reduces answer length by 30\%, while improving accuracy by 4\%. Our analysis also shows that SAR generalizes well to out-of-domain tasks and achieves a Pareto-optimal frontier between correctness and efficiency compared to state-of-the-art baselines. We also show that SAR shortens unnecessary elaboration while preserving advanced reasoning behaviors. These results highlight the promise of self-aligned reward as a fine-grained complement to verifiable rewards, paving the way for efficient and effective LLM training.
+
+## Formulation
+
+$$
+R_{\text{SA-GRPO}}(q,a_i,gt) = R_{\text{VR}} + \alpha R_{\text{SA}}, \quad
+R_{\text{SA}} = \operatorname{clip}\!\left(
+\frac{\operatorname{ppl}(a_i) - \operatorname{ppl}(a_i|q)}{\operatorname{ppl}(a_i)},\,-1,\,1
+\right)
+$$
+$$
+\quad
+\operatorname{ppl}(a) = e^{-\frac{1}{|a|}\sum_{j=1}^{|a|}\log P(a_j|a_{1...j-1})}, \quad
+\operatorname{ppl}(a|q) = e^{-\frac{1}{|a|}\sum_{j=1}^{|a|}\log P(a_j|q,a_{1...j-1})}$$
+
+
+## Results
+
+<img src="./contours.png" style="zoom:50%;" 
+/>
+
+<img src="./table.png" style="zoom:50%;" 
+/>
@@ -6,12 +6,12 @@
 
 
 ## Abstract
-Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader ToMAP, a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent’s current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4\% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents.
+Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce **Theory of Mind Augmented Persuader (ToMAP)**, a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent’s current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4\% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents.
 
 ## Overview
 
 
 
-<img src="./main_fig.jpg" style="zoom:50%;" 
+<img src="./main_fig.png" style="zoom:50%;" 
 />
 
@@ -1,4 +1,21 @@
 const publications = [
+  {
+    key: "peixuan2025sar",
+    title: "Self-Aligned Reward: Towards Effective and Efficient Reasoners",
+    authors:
+      "Peixuan Han, Adit Krishnan, Gerald Friedland, Jiaxuan You, Chris Kong",
+    year: "2025",
+    venue: "Preprint",
+    links: {
+      paper: "https://arxiv.org/pdf/2509.05489",
+      thread: "https://x.com/peixuanhakhan/status/1965907899642949795",
+      contact: "mailto:ph16@illinois.edu",
+    },
+    files: {
+      markdown: require("../assets/publications/peixuan2025sar/sar.md"),
+    },
+    tags: ["LLM", "Reasoning", "Efficiency"],
+  },
   {
     key: "peixuan2025tomap",
     title: "ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind",