Open
Description
Feature request
DQP is a new promising offline RL approach for llm alignment. DQO which "formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision." Similar to DRO which improves on DPO, using a single-step MDP (bandit) and SAC. I propose we add this as a trainer.
Motivation
Paper linked: https://arxiv.org/pdf/2410.09302
Your contribution
I'm open to working on the implementation :)