Open
Description
See #510
One of the new tests introduced in PR #510 fails. When running a module with two different custom implementations of a "linear-like" layer, per sample gradients computed by functorch-based hooks don't match with per sample gradients obtained by microbatching.
Interesting observations:
- gradients are mismatched for only one parameter tensor (out of 5)
- gradients differs by the factor of 2 (with batch_size=64, so it's not it)
I've verified and I think the test is working correctly and the problem is likely genuine