Skip to content

Latest commit

 

History

History
42 lines (22 loc) · 2.65 KB

File metadata and controls

42 lines (22 loc) · 2.65 KB
  • When Autoregressive models are given a prefix, they output a normalized conditional probability distribution over the next token x: p(x|C). They can be seen as policies while states are the context prefix provided.

  • Energy-Based Models, on the other hand, when given a context C, they output an unnormalized distribution (potential) over the next token P(x|C). This probability distribution takes an exponential form that is parameterized by the energy function U(x|C).

  • Training GAMs:

    • Training 1: Aims at fitting the EBM to data.
    • Training 2 (RL-as-optimization): Fits an AM policy that maximizes the EMB potential
    • Training 2 (RL-as-sampling, distributional RL): Fits an AM policy that approximates the normalized potential.
  • RL-as-sampling aims at minimizing the cross-entropy between the policy and the normalized potential:

    Or off-policy case:

    • On-policy case had convergence issues

    Off-policy algorithm:

    • Off-policy works better. q(x) is initialized with an Autoregressive Model fit on the data

    Questions :

    1. What is the moment matching property?
    2. My understanding is the following: To convert any RL-as-optimization problem to distributional-RL: Replace the reward term by the unnormalized potential divided by the policy disctribution. Is that correct?
    3. In all the equations above, x represents a sequence, meaning to compute pi_theta(x), we have to factor it over time pi_theta(x) = pi_theta(x_n| x_n-1, .. x_1) pi_theta(x_n-1| x_n-2,..x_1)... pi_theta(x_1)