-
Evolving Alignment via Asymmetric Self-Play,
arXiv, 2411.00062
, arxiv, pdf, cication: -1Ziyu Ye, Rishabh Agarwal, Tianqi Liu, ..., Qijun Tan, Yuan Liu · (jiqizhixin)
-
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback,
arXiv, 2410.19133
, arxiv, pdf, cication: -1Lester James V. Miranda, Yizhong Wang, Yanai Elazar, ..., Hannaneh Hajishirzi, Pradeep Dasigi
-
LongReward: Improving Long-context Large Language Models with AI Feedback,
arXiv, 2410.21252
, arxiv, pdf, cication: -1Jiajie Zhang, Zhongni Hou, Xin Lv, ..., Ling Feng, Juanzi Li · (LongReward - THUDM) · (huggingface)
-
Thinking LLMs: General Instruction Following With Thought Generation 𝕏
-
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models,
arXiv, 2410.17637
, arxiv, pdf, cication: -1Ziyu Liu, Yuhang Zang, Xiaoyi Dong, ..., Dahua Lin, Jiaqi Wang
-
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs,
arXiv, 2410.18451
, arxiv, pdf, cication: -1Chris Yuhao Liu, Liang Zeng, Jiacai Liu, ..., Yang Liu, Yahui Zhou
· (huggingface) · (huggingface) · (huggingface)
-
benchmark: Preference Proxy Evaluations (PPE) 𝕏
· (blog.lmarena) · (arxiv) · (PPE - lmarena)
-
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style,
arXiv, 2410.16184
, arxiv, pdf, cication: -1Yantao Liu, Zijun Yao, Rui Min, ..., Lei Hou, Juanzi Li · (RM-Bench - THU-KEG)
-
🌟 OpenRLHF - OpenRLHF
· (arxiv) · (docs.google)
-
verl - volcengine
Volcano Engine Reinforcement Learning for LLM · (arxiv)