Inspired by the awesome-embodied-vision
-
Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., & Hooker, S. (2024). Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs (No. arXiv:2402.14740). arXiv. https://doi.org/10.48550/arXiv.2402.14740
-
Guan, X., Zhang, L. L., Liu, Y., Shang, N., Sun, Y., Zhu, Y., Yang, F., & Yang, M. (2025). rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking (No. arXiv:2501.04519). arXiv. https://doi.org/10.48550/arXiv.2501.04519
-
Havrilla, A., Du, Y., Raparthy, S. C., Nalmpantis, C., Dwivedi-Yu, J., Hambro, E., Sukhbaatar, S., & Raileanu, R. (2024, June 13). Teaching Large Language Models to Reason with Reinforcement Learning. AI for Math Workshop @ ICML 2024. https://openreview.net/forum?id=mjqoceuMnI
-
Hu, J., Wu, X., Zhu, Z., Xianyu, Wang, W., Zhang, D., & Cao, Y. (2024). OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework (No. arXiv:2405.11143). arXiv. https://doi.org/10.48550/arXiv.2405.11143
-
Kumar, A., Zhuang, V., Agarwal, R., Su, Y., Co-Reyes, J. D., Singh, A., Baumli, K., Iqbal, S., Bishop, C., Roelofs, R., Zhang, L. M., McKinney, K., Shrivastava, D., Paduraru, C., Tucker, G., Precup, D., Behbahani, F., & Faust, A. (2024). Training Language Models to Self-Correct via Reinforcement Learning (No. arXiv:2409.12917). arXiv. https://doi.org/10.48550/arXiv.2409.12917
-
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023, October 13). Let’s Verify Step by Step. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=v8L0pN6EOi
-
Qu, Y., Zhang, T., Garg, N., & Kumar, A. (n.d.). RECURSIVE INTROSPECTION: Teaching Language Model Agents How to Self-Improve.
-
Setlur, A., Garg, S., Geng, X., Garg, N., Smith, V., & Kumar, A. (2024). RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold (No. arXiv:2406.14532). arXiv. https://doi.org/10.48550/arXiv.2406.14532
-
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (No. arXiv:2402.03300). arXiv. https://doi.org/10.48550/arXiv.2402.03300
-
Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (No. arXiv:2408.03314). arXiv. http://arxiv.org/abs/2408.03314
-
Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, D., Wu, Y., & Sui, Z. (2024). Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations (No. arXiv:2312.08935). arXiv. https://doi.org/10.48550/arXiv.2312.08935
-
Xi, Z., Yang, D., Huang, J., Tang, J., Li, G., Ding, Y., He, W., Hong, B., Do, S., Zhan, W., Wang, X., Zheng, R., Ji, T., Shi, X., Zhai, Y., Weng, R., Wang, J., Cai, X., Gui, T., … Jiang, Y.-G. (2024). Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (No. arXiv:2411.16579). arXiv. https://doi.org/10.48550/arXiv.2411.16579
-
Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan, D., Castricato, L., Franken, J.-P., Haber, N., & Finn, C. (2025). Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought (No. arXiv:2501.04682). arXiv. https://doi.org/10.48550/arXiv.2501.04682
-
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models (No. arXiv:2305.10601). arXiv. https://doi.org/10.48550/arXiv.2305.10601
-
Zelikman, E., Wu, Y., Mu, J., & Goodman, N. D. (2022). STaR: Bootstrapping Reasoning With Reasoning (No. arXiv:2203.14465). arXiv. https://doi.org/10.48550/arXiv.2203.14465
-
Zeng, Z., Cheng, Q., Yin, Z., Wang, B., Li, S., Zhou, Y., Guo, Q., Huang, X., & Qiu, X. (2024). Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective (No. arXiv:2412.14135). arXiv. https://doi.org/10.48550/arXiv.2412.14135
-
Zhang, Z., Zheng, C., Wu, Y., Zhang, B., Lin, R., Yu, B., Liu, D., Zhou, J., & Lin, J. (2025). The Lessons of Developing Process Reward Models in Mathematical Reasoning (No. arXiv:2501.07301). arXiv. https://doi.org/10.48550/arXiv.2501.07301