I think we need to pass `traj[:terminal]` to `discount_rewards` so that the gain is computed only up to termination of an episode? https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/6fe6aa01208c325f8f990032621c18b61d574b37/src/ReinforcementLearningZoo/src/algorithms/policy_gradient/vpg.jl#L105