Hej,
I'm wondering about the reasoning behind this constraint in grads_and_grad_moms.
The given variables are indeed used in multiple operations in the loss computation graph, nonetheless that shouldn't hinder it to compute the grad.
Could you shed some light on this?