-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Norm Layers, Again #1509
Fix Norm Layers, Again #1509
Conversation
this TODO has nothing to do with this PR. |
And the last time it was dicussed there was not a complete consensus. I think it is better not to put this TODO on the v0.12 path. |
Still not perfect, but better than master. We would need to figure removing the unions as well. This: julia> bn = BatchNorm(10)
BatchNorm(10, identity, affine = )
julia> x = rand(Float32, 3,3,3,10);
julia> @code_warntype bn(x)
Variables
BN::BatchNorm{typeof(identity),Array{Float32,1},Float32,Array{Float32,1}}
x::Array{Float32,4}
#334::Flux.var"#334#335"{Array{Float32,4},Int64}
@_4::Int64
N::Int64
reduce_dims::Array{Int64,1}
affine_shape::Union{Nothing, NTuple{4,Int64}}
ts::Union{Nothing, Bool}
μ::Array{Float32,4}
σ²::Array{Float32,4}
@_11::Union{Nothing, NTuple{4,Int64}}
@_12::Union{Nothing, Bool}
Body::Array{Float32,4}
1 ── Core.NewvarNode(:(#334))
│ Core.NewvarNode(:(@_4))
│ Core.NewvarNode(:(reduce_dims))
│ Core.NewvarNode(:(affine_shape))
│ Core.NewvarNode(:(ts))
│ Core.NewvarNode(:(μ))
│ Core.NewvarNode(:(σ²))
│ (N = Flux.ndims(x))
[...] Master: julia> bn = BatchNorm(10)
BatchNorm(10)
julia> x = rand(Float32, 3,3,10,10);
julia> bn(x);
julia> @code_warntype bn(x)
Variables
BN::BatchNorm{typeof(identity),Array{Float32,1},Float32,Array{Float32,1}}
x::Array{Float32,4}
#239::Flux.var"#239#240"{Array{Float32,4},Int64}
N::Int64
reduce_dims::Array{Int64,1}
affine_shape::NTuple{4,Int64}
Body::Any
1 ─ Core.NewvarNode(:(#239))
│ Core.NewvarNode(:(N))
│ Core.NewvarNode(:(reduce_dims))
│ Core.NewvarNode(:(affine_shape))
│ %5 = Flux.ndims(x)::Core.Compiler.Const(4, false)
│ %6 = (%5 - 1)::Core.Compiler.Const(3, false)
│ %7 = Flux.size(x, %6)::Int64
│ %8 = Base.getproperty(BN, :chs)::Int64
│ %9 = (%7 == %8)::Bool |
Based on our discussion yesterday, this PR looks ready except the plans for Other than that looks like we only need tests and docs. |
β = initβ(chs) | ||
γ = initγ(chs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we should have params that we don't use in practice when affine=false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other thought on this: I know we rejected Affine(BatchNorm(...))
because it is not ergonomically nice. But what about having a field in BatchNorm
called affine
that is a function. When the affine kwarg is true, then affine = Dense(...)
but when it is false, affine = identity
? In a Chain
, the user still see just BatchNorm
and bn.affine
will show a Dense
which the user will intuitively understand. We also don't end up storing data we don't need, and trainable
will automatically be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of like you'd have for an activation function? That does sound pretty appealing, would it make sense to call that field transform
or something then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly. transform
would certainly capture the field accurately, but affine
might be better here cause people will understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well that had been a consideration. Its a bit awkward to also do γ = reshape(l.γ, affine_shape)
but I think we can have broadcasting take care of the shape as needed by l.λ.(γ .* x̂ .+ β)
src/layers/normalise.jl
Outdated
affine, track_stats, nothing) | ||
end | ||
|
||
@functor BatchNorm | ||
trainable(bn::BatchNorm) = hasaffine(bn) ? (bn.β, bn.γ) : () | ||
# trainable(bn::BatchNorm) = hasaffine(bn) ? (bn.β, bn.γ) : () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you remove this μ, σ² will end up in the params
test/cuda/layers.jl
Outdated
|
||
pixelshuffle = [PixelShuffle] | ||
gpu_gradtest("PixelShuffle 2d", pixelshuffle, rand(Float32, 3, 4, 18, 3), 3) | ||
gpu_gradtest("PixelShuffle 1d", pixelshuffle, rand(Float32, 3, 18, 3), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a weird one space indent here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the indent is still here?
I'll recap what this PR when ready should not change with respect to master:
julia> params(BatchNorm(2))
Params([Float32[0.0, 0.0], Float32[1.0, 1.0]])
julia> params(BatchNorm(2, affine=false))
Params([]) So, since there won't be any API changes with respect to master (unless I'm missing something), we can consider this non-blocking for v0.12 release |
Co-authored-by: Carlo Lucibello <carlo.lucibello@unibocconi.it>
How does one change the behaviour of a layer at different points of training? Say I don't want to perform How do I also restart training from a given point if I don't serialise the state of the tracking variables at such a time? They would not be trained either way, so the user choice is respected. |
It is not clear what you say in the last two sentences, but the thing is that The use case you mention can be simply achieved by creating the layer
Again, PyTorch handles the separation between trainable parameters and the rest of the internal state very neatly: In [15]: import torch.nn as nn
In [16]: m = nn.BatchNorm2d(2, affine=True)
In [17]: list(m.parameters())
Out[17]:
[Parameter containing:
tensor([1., 1.], requires_grad=True),
Parameter containing:
tensor([0., 0.], requires_grad=True)]
In [18]: m = nn.BatchNorm2d(2, affine=False)
In [19]: list(m.parameters())
Out[19]: []
In [20]: m = nn.BatchNorm2d(2, track_running_stats=True)
In [21]: list(m.buffers())
Out[21]: [tensor([0., 0.]), tensor([1., 1.]), tensor(0)]
In [22]: m = nn.BatchNorm2d(2, track_running_stats=False)
In [23]: list(m.buffers())
Out[23]: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still some issues related to changes to the tests which aren't clear.
testmode!(m, true) | ||
y = evalwgrad(m, x) # should override istraining | ||
@test count(a->a==0, y) == 0 | ||
y = m(x) # should override istraining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still holds
y = m(x) | ||
@test isapprox(y, sigmoid.((x .- m.μ) ./ sqrt.(m.σ² .+ m.ϵ)), atol = 1.0e-7) | ||
@test_broken isapprox(y, mean(sigmoid.((x .- m.μ) ./ sqrt.(m.σ² .+ m.ϵ)), dims = 1), atol = 1.0e-7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump here
# @test length(Flux.params(InstanceNorm(10))) == 0 | ||
# @test length(Flux.params(InstanceNorm(10, affine = true))) == 2 | ||
# @test length(Flux.params(InstanceNorm(10, affine =false))) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be changed back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't guaranteeing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test isn't a guarantee of behavior. But this is the current behavior so it should be tested. If two weeks from now, someone borks trainable
, then this test would accurately identify the issue which is why it exists.
μnew, σ²new = track_stats(x, (l.μ, l.σ²), (μ,σ²), | ||
l.momentum, reduce_dims = nc.dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these can go in the zygote block below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay since track_stats
has @nograd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which case the @nograd
on track_stats can be removed
end | ||
|
||
let m = trainmode!(BatchNorm(2)), x = reshape(Float32.(1:6), 3, 2, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests shouldn't be changed
let m = BatchNorm(32), x = randn(Float32, 416, 416, 32, 1); | ||
m(x) | ||
@test (@allocated m(x)) < 100_000_000 | ||
end | ||
# let m = BatchNorm(32), x = randn(Float32, 416, 416, 32, 1); | ||
# m(x) | ||
# @test (@allocated m(x)) < 100_000_000 | ||
# end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably this was here for a reason, could try to check with git blame when it was introduced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blame leads back to #586. Personally I think it's a little weird to just test allocations for BatchNorm, but on the other hand this is 4 lines that run quickly and don't add any real maintenance overhead. My vote is either uncomment it or get rid of it entirely. Leaving stuff around but commented without a little explainer comment raises more questions than it answers :P
y = m(x) | ||
μ = mean(x, dims=1) | ||
σ² = var(x, dims=1, corrected=false) | ||
y, back = pullback((m,x) -> m(x), m, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when internal changes to a layer previous test shouldn't be modified so that one can be sure that previous behavior is preserved. Normalization layer behave differently in gradient contexts, this test was testing the behaviour in a non-gradient context
there are too many changes in tests, they should be mostly reverted, otherwise it's hard to check that the revised internals don't introduce new (or old) bugs |
Zygote.ignore() do | ||
l.μ = reshape(μnew, :) | ||
l.σ² = reshape(σ²new, :) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put this in track_stats
?
sz = sz isa Integer ? (sz,) : sz | ||
diag = affine ? Diagonal(sz...) : nothing | ||
return LayerNorm(λ, diag, ϵ, sz, affine) | ||
function LayerNorm(sz, λ = identity; affine = Diagonal(sz...), ϵ = 1f-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite what I was suggesting. I didn't mean change the affine
keyword to be non-boolean. LayerNorm
was fine as is. I meant storing affine
as a function in BatchNorm
, etc. similar to LayerNorm
. The API would still be a boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that, but for the future, we would likely not want to limit kwargs to Boolean, since while it is symmetric with other libraries, it is somewhat limiting what can be expressed. Specifically a boolean allows to express two states, while an anonymous function can do more than that. Since our implementation isn't limiting what can be done with the layer, we should pass those freedoms to the end user as well. This means, we can have a compatibility layer to support booleans, but in times when it isn't worth it to write ones own layer or tweak with certain settings that aren't directly expressed with the Boolean, it's handy to pass in an anonymous function instead.
We wouldn't want the Boolean to be the only api that users can interact with. In fact, over use of it can add bloat to the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree completely, but the default constructor provides that flexibility? I also agree on the overuse of booleans, but for the norm layers 99.99% of users want an affine transformation (not some other function) and they just want to turn it on or off. Here, a boolean makes sense for the sake of convenience. We can still allow the advanced use case with the default constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default constructor is much less ergonomic and harder to use. Not to mention requiring the user to know way more details about Flux's innards that it's untenable and a poor use of time, not to mention making the user code fragile to updates to Flux.
The fact that other libraries push boolean kwargs is why I proposed having some kind of hybrid constructor that can accept either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, a hybrid constructor is fine by me
bumping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there are still a handful of comments here, so resolving those should clear up any blockers.
diag = affine ? Diagonal(sz...) : nothing | ||
return LayerNorm(λ, diag, ϵ, sz, affine) | ||
function LayerNorm(sz, λ = identity; affine = Diagonal(sz...), ϵ = 1f-5) | ||
# diag = affine ? Diagonal(sz...) : identity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the discussion above (hybrid constructor), I believe this line needs to be brought back.
struct NormConfig{A,T} | ||
dims::Vector{Int} | ||
end | ||
|
||
NormConfig(affine, track_stats, reduce_dims) = NormConfig{affine, track_stats}(reduce_dims) | ||
|
||
getaffine(nc::NormConfig{true}, sz_x; dims = length(sz_x) - 1) = | ||
ntuple(i -> i in dims ? sz_x[i] : 1, length(sz_x)) | ||
|
||
getaffine(nc::NormConfig{false}, args...; kwargs...) = () | ||
|
||
# For InstanceNorm, GroupNorm, and BatchNorm. | ||
# Compute the statistics on the slices specified by reduce_dims. | ||
# reduce_dims=[1,...,N-2,N] for BatchNorm | ||
# reduce_dims=[1,...,N-2] for InstanceNorm and GroupNorm | ||
function _norm_layer_forward(l, x::AbstractArray{T,N}; reduce_dims, affine_shape) where {T, N} | ||
if !_isactive(l) && l.track_stats # testmode with tracked stats | ||
function norm_forward(l, x::AbstractArray{T,N}, nc::NormConfig{A, true}) where {A, T, N} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that all this functionality has been pulled out, we should consider moving it into NNlib next. Getting rid of https://github.com/FluxML/Flux.jl/tree/master/src/cuda (which only contains a batchnorm override) would be excellent and save us quite a few headaches.
closing as stale |
This is a successor to #1495
PR Checklist
@dhairyagandhi96
(for API changes).Todo:
hasaffine
testmode!
- use zygote instead and have an equivalent forward pass function available a ladropout