Basic rl based on cse 276f lectures and only using agrad, numpy, and gymnasium. Todo PPO TRPO Dyna-Q TD-MPC DDPG epo