You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please make the opacus grad_sampler compatible with torch.cat operations in activation functions
Motivation
I've been trying to use the grad_sampler module with networks containing the CReLU activation function. However, the CReLU activation functions concatenates the output of the layer with the negative of itself, thus doubling the effective output size of the layer. This can be very useful and space-saving in networks that tend to develop mirrored filters (see https://arxiv.org/pdf/1603.05201v2.pdf).
Furthermore, using the CReLU activation functions it is possible to initialize fully connected networks so that they appear linear at initialization (see photo in additional context). This has been shown to be an extremely powerful initialization pattern, allowing fully connected networks to be trained with over 200 layers. That's incredible! Typical fully connected networks often struggle to learn appreciably with only 20+ layers (see https://arxiv.org/pdf/1702.08591.pdf).
Because of the symmetric initialization pattern the discontinuities in the CReLU activation function (after symmetric initialization) are dramatically smaller than in comparable networks with ReLU other activation functions. I've been studying gradient conditioning and stability in a variety of architectures using opacus, but it's broken for activation functions that use torch.cat. In the case of CReLU weight.grad_sample returns something that is half the size of the weight itself (ignoring the batch size).
Pitch
Implementing (or fixing) opacus grad_sampler compatibility with torch.cat would allow it to be used with a wider variety of activation functions, including CReLU, which would be really cool (see motivation section).
I didn't file this as a bug report because I'm not sure that torch.cat compatibility was ever intentionally implemented.
Alternatives
I can't think of any alternatives
Additional context
The text was updated successfully, but these errors were encountered:
Thank you for filing this issue and explaining it really well.
Can you please provide more details on the error you're getting? Specifically, can you provide a minimal reproducing example? We have collab templates for minimal example when you create the issue
🚀 Feature
Please make the opacus grad_sampler compatible with torch.cat operations in activation functions
Motivation
I've been trying to use the grad_sampler module with networks containing the CReLU activation function. However, the CReLU activation functions concatenates the output of the layer with the negative of itself, thus doubling the effective output size of the layer. This can be very useful and space-saving in networks that tend to develop mirrored filters (see https://arxiv.org/pdf/1603.05201v2.pdf).
Furthermore, using the CReLU activation functions it is possible to initialize fully connected networks so that they appear linear at initialization (see photo in additional context). This has been shown to be an extremely powerful initialization pattern, allowing fully connected networks to be trained with over 200 layers. That's incredible! Typical fully connected networks often struggle to learn appreciably with only 20+ layers (see https://arxiv.org/pdf/1702.08591.pdf).
Because of the symmetric initialization pattern the discontinuities in the CReLU activation function (after symmetric initialization) are dramatically smaller than in comparable networks with ReLU other activation functions. I've been studying gradient conditioning and stability in a variety of architectures using opacus, but it's broken for activation functions that use torch.cat. In the case of CReLU weight.grad_sample returns something that is half the size of the weight itself (ignoring the batch size).
Pitch
Implementing (or fixing) opacus grad_sampler compatibility with torch.cat would allow it to be used with a wider variety of activation functions, including CReLU, which would be really cool (see motivation section).
I didn't file this as a bug report because I'm not sure that torch.cat compatibility was ever intentionally implemented.
Alternatives
I can't think of any alternatives
Additional context
The text was updated successfully, but these errors were encountered: