Direct Q-Function Optimization 

### Feature request

DQP is a new promising offline RL approach for llm alignment. DQO which "formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision." Similar to DRO which improves on DPO, using a single-step MDP (bandit) and SAC. I propose we add this as a trainer. 

### Motivation

Paper linked: https://arxiv.org/pdf/2410.09302



### Your contribution

I'm open to working on the implementation :) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct Q-Function Optimization #2526

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development