Skip to content

Direct Q-Function Optimization  #2526

Open
@catherinelee274

Description

Feature request

DQP is a new promising offline RL approach for llm alignment. DQO which "formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision." Similar to DRO which improves on DPO, using a single-step MDP (bandit) and SAC. I propose we add this as a trainer.

Motivation

Paper linked: https://arxiv.org/pdf/2410.09302

Your contribution

I'm open to working on the implementation :)

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions