Distributional Reinforcement Learning for Energy-Based Sequential Models

When Autoregressive models are given a prefix, they output a normalized conditional probability distribution over the next token x: p(x|C). They can be seen as policies while states are the context prefix provided.
Energy-Based Models, on the other hand, when given a context C, they output an unnormalized distribution (potential) over the next token P(x|C). This probability distribution takes an exponential form that is parameterized by the energy function U(x|C).
Training GAMs:
- Training 1: Aims at fitting the EBM to data.
- Training 2 (RL-as-optimization): Fits an AM policy that maximizes the EMB potential
- Training 2 (RL-as-sampling, distributional RL): Fits an AM policy that approximates the normalized potential.
RL-as-sampling aims at minimizing the cross-entropy between the policy and the normalized potential:
$\nabla\theta CE(p, \pi_{\theta}) = - \sum_x{p(x) \nabla_\theta log \pi_{\theta}(x) }$ $=- \mathbb{E}_{x \sim \pi_{\theta} (.)} \frac{p(x)}{\pi_{\theta}(x)} \nabla_\theta log \pi_{\theta}(x)$ $= - \mathbb{E}_{x \sim \pi_{\theta} (.)} \frac{1}{Z} \frac{P(x)}{\pi_{\theta}(x)} \nabla_\theta log \pi_{\theta}(x) \\$ $= - \frac{1}{Z} \mathbb{E}_{x \sim \pi_{\theta} (.)} \frac{P(x)}{ \pi_{\theta}(x)} \nabla_\theta log \pi_{\theta}(x)$
Or off-policy case:
$- \frac{1}{Z} \mathbb{E}_{x \sim \q(.)} \frac{P(x)}{ \q(x)} \nabla_\theta log \pi_{\theta}(x)$
- On-policy case had convergence issues
Off-policy algorithm:
- Off-policy works better. q(x) is initialized with an Autoregressive Model fit on the data
Questions :
1. What is the moment matching property?
2. My understanding is the following: To convert any RL-as-optimization problem to distributional-RL: Replace the reward term by the unnormalized potential divided by the policy disctribution. Is that correct?
3. In all the equations above, x represents a sequence, meaning to compute pi_theta(x), we have to factor it over time pi_theta(x) = pi_theta(x_n| x_n-1, .. x_1) pi_theta(x_n-1| x_n-2,..x_1)... pi_theta(x_1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dgp.md

dgp.md

Distributional Reinforcement Learning for Energy-Based Sequential Models

Files

dgp.md

Latest commit

History

dgp.md

File metadata and controls

Distributional Reinforcement Learning for Energy-Based Sequential Models