-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix AlphaDropout implementation and add tests #1781
Conversation
Behaviour and outputs are adapted from the PyTorch and TF implementations
Unexpected bonus: this is now 100% GPU compatible and leaves us with one less broken test :) |
looks good! |
Looks good, I think this deserves an entry in NEWS.md. Especially the GPU bit. |
bors r+ |
return x | ||
p = a.p | ||
iszero(p) && return x | ||
isone(p) && return sign.(x) .* T(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Are we intentionally creating -0.0
values here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see point 2 in #1781 (comment).
iszero(p) && return x | ||
isone(p) && return sign.(x) .* T(0) | ||
|
||
α′ = T(-1.7580993408473766) # selu(-Inf) == -λα |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use oftype
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oftype
requires a value of type T
, but the only way to get that in this function would be to extract an element of x
(which would fail if x
is empty and might also trigger the getindex adjoint unnecessarily). Since this function should be specializing on eltype anyhow, it makes sense to take advantage of that for casting.
Build succeeded: |
AFAICT, the original implementation never behaved as expected even pre-Zygote. This was likely not caught because the original PR didn't come with tests, so this PR should remedy that.
Behaviour and outputs are adapted from the PyTorch and TF implementations. Some points of note:
p = 0
to avoid propagating NaNs when calcuatingA
andB
.p = 1
. TF just returns the input, but I think the PyTorch approach of returning all zeros (+/- depending on the input sign) is more in line withDropout
.ifelse
is used instead of something like https://github.com/keras-team/keras/blob/v2.7.0/keras/layers/noise.py#L200. I think it better reflects the conditional nature of the operation and it was also faster in local benchmarking.PR Checklist