Problem
_training_step processes a single group of rollouts per optimizer step. There is no mechanism to accumulate gradients across multiple groups before stepping. With binary rewards and small group sizes (N=8), many steps have all-zero or all-one rewards producing zero gradient signal. Accumulating across 2-4 groups before stepping would improve stability and reduce wasted compute.
Also related: per-step backward prevents multi-step batching for GPU efficiency. Consider batching steps that share the same image for more efficient forward passes.