-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework normalization layers #1397
Conversation
@DhairyaLGandhi bump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it's alright to have extra steps available for different norm layers but not too have to check everything everytime for every layer I think. Otherwise we end up with one large function which should be multiple smaller functions which are combined appropriately for the layers
5c57679
to
10f4eda
Compare
Every layer has |
This will fix #1195 if I get it right? |
yes, this is tested in test/cuda/layers.jl now that InstanceNorm in not on the list of broken layers |
cbfa003
to
33b8728
Compare
I've also updated GroupNorm and I've done an extensive search of the issues fixed by this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit uncomfortable with having affine
switches in the layers, do we have a better way around this? Perhaps the users are better suited to add that as a layer in the model
c960c61
to
257931f
Compare
I don't see how to do it in a way that doesn't involve a branch somewhere. And we really don't want the user to have to add manually an Affine layer after BatchNorm Pytorch has the same interface, it can't be that bad if we do the same. |
I did some benchmarking using the script below. Fortunately, there is no appreciable performance difference for any system size. SCRIPT: using Flux
using BenchmarkTools
using Random
function test_BN(shape)
Random.seed!(17)
x = rand(Float32, shape)
bn = BatchNorm(shape[end-1])
@btime $bn($x)
end
for shape in [ (10,10),
(100,100),
(1000,1000),]
test_BN(shape)
end OUTPUT¨: # WITHOUT track_stats and affine branching
4.427 μs (41 allocations: 2.19 KiB)
38.264 μs (42 allocations: 40.84 KiB)
3.552 ms (42 allocations: 3.82 MiB)
4.402 μs (41 allocations: 2.19 KiB)
38.231 μs (42 allocations: 40.84 KiB)
3.336 ms (42 allocations: 3.82 MiB)
4.407 μs (41 allocations: 2.19 KiB)
37.572 μs (42 allocations: 40.84 KiB)
3.476 ms (42 allocations: 3.82 MiB)
# WITH track_stats and affine branching
4.538 μs (41 allocations: 2.19 KiB)
38.196 μs (42 allocations: 40.84 KiB)
3.354 ms (42 allocations: 3.82 MiB)
4.472 μs (41 allocations: 2.19 KiB)
40.386 μs (42 allocations: 40.84 KiB)
3.549 ms (42 allocations: 3.82 MiB)
4.357 μs (41 allocations: 2.19 KiB)
38.216 μs (42 allocations: 40.84 KiB)
3.580 ms (42 allocations: 3.82 MiB) |
@DhairyaLGandhi bump |
Could you also check the backwards pass? Having an affine layer seems fine to me, but I'm open to thoughts around construction As in other test cases on the gpu, I'd much prefer if we didn't have a separate function to go to. Using it to test layers is an extra function to track errors into. We should have the tests be very clear about what they are testing, and best to have that be clear at the site where the test is performed. |
will do
I'm strongly against having a separate affine layer, people don't have to insert an Affine every time they insert a BatchNorm, it's extremely annoying, this is why the default for BatchNorm is
having a centralized testing function assures exhaustive and consistent testing, and makes very clear what is being tested, since it is the same thing everywhere. Having heterogeneous and likely less extensive "manual testing" is strictly worse I think, for each set of tests you have to think about what you are capturing and what you are missing. Going through an extra function call doesn't seem a problem at all. ChainRules provides a whole package of rule testing functions, ChainRulesTestUtils, we can have a couple of testing functions ourselves (in fact we do in Zygote and now in NNlib as well). |
Where you have to test consistent output, sure, but when you have to generic tests like we do with gpu layers, clarity is better imo. Thinking about the missing bits is good when you have to test for changes you have made to the core kernels like here. |
I tested the pullback as well. The results is the following:
My opinion is that we shouldn't try to workaround zygote being stupid here and possibly make it more efficient, using Flux
using BenchmarkTools
using Random
using Zygote: pullback
function test_BN(shape)
Random.seed!(17)
println("# Shape $shape")
x = rand(Float32, shape)
bn = BatchNorm(shape[end-1])
# println("### forward")
# @btime $bn($x)
println("forward in gradient context")
@btime pullback(x -> sum(sin.($bn(x))), $x)
y, back = pullback(x -> sum(sin.(bn(x))), x)
println("pullback")
@btime $back(1f0)
end
for shape in [ (10,10),
(100,100),
(1000,1000),]
test_BN(shape)
end ##### Master
# Shape (10, 10)
forward gradient context
35.807 μs (508 allocations: 32.97 KiB)
pullback
250.860 μs (714 allocations: 23.14 KiB)
# Shape (100, 100)
forward gradient context
207.208 μs (877 allocations: 431.64 KiB)
pullback
715.362 μs (30697 allocations: 999.34 KiB)
# Shape (1000, 1000)
forward gradient context
19.116 ms (5457 allocations: 38.32 MiB)
pullback
54.512 ms (3003397 allocations: 95.48 MiB)
##### This PR
# Shape (10, 10)
forward in gradient context
27.025 μs (400 allocations: 29.41 KiB)
pullback
216.878 μs (708 allocations: 23.17 KiB)
# Shape (100, 100)
forward in gradient context
192.794 μs (409 allocations: 419.22 KiB)
pullback
681.668 μs (30692 allocations: 1.01 MiB)
# Shape (1000, 1000)
forward in gradient context
19.810 ms (409 allocations: 38.21 MiB)
pullback
53.283 ms (3003392 allocations: 99.29 MiB)
### this PR w/o affine and track_stats branching
# Shape (10, 10)
forward in gradient context
27.070 μs (365 allocations: 27.00 KiB)
pullback
198.775 μs (693 allocations: 22.67 KiB)
# Shape (100, 100)
forward in gradient context
194.250 μs (374 allocations: 416.81 KiB)
pullback
661.045 μs (30677 allocations: 1.01 MiB)
# Shape (1000, 1000)
forward in gradient context
20.032 ms (374 allocations: 38.20 MiB)
pullback
54.227 ms (3003377 allocations: 99.29 MiB) |
0cf9956
to
9e53811
Compare
But doing it like this was almost an order of magnitude faster, which seems worth it |
Definitely worth it. It is a super intense time for me at work though, maybe I will be able to work more on this in February. The optimization you suggested takes some time to implement for all layers and also requires separate methods for cuarrays if we want to use avx (which I think we should). So I suggest we just finish discussing the API (which I think is good as it is). This PR already has better performance than master, and most importantly it fixes many issues and adds many features. I can attempt to squeeze out more performance in a later PR if someone doesn't beat me to it. I don't want this to hold v0.12 release. |
e6ec063
to
1d33958
Compare
@DhairyaLGandhi bump, let's do this |
Let me do a once over, I still strongly feel the conditionals should be made generic, at least so Julia can produce better code, and have the kwargs only as a convenience constructor over top. |
I don't oppose this (although I think it will make things much more complicated than needed), I'm just saying that devising composable internals is something that can be done later and shouldn't stop us from merging this PR. Again, this PR fixes a lot of issues, adds a ton of features, is faster than current master. It's been ready for more than a month, there is no point in not letting people benefit from work that has already been done. This is a very general concern of mine about Flux: it lacks a ton of features, it has many bugs. Those are show-stoppers for many people, and we should prioritize them over any performance, consistency, API considerations if needed. And be fast in doing it. |
Design decisions such as this cause issues down the line. While I appreciate the process of quick movement, it is my experience that doing it later either doesn't happen or is made harder or introduces a bunch of hacks. |
another rebase to be done and another 10 days wasted for no reason. @DhairyaLGandhi could you please respect the work I've done here and stop blocking this PR? |
It's been a busy week, I'm sure you have seen me sufficiently active on this project. I Haven't intentionally done anything to disrespect your work or block any pr, and I'd apologise if I did. I can do the rebase myself if you're not up to it. Rebasing is a part of a moving library, if your pr needs to be rebased (in the manifest and news of all things) it meant other things have gone in. |
89682d6
to
ff18866
Compare
Well part of the motivation to PR instead of commit/ suggest was that we have to make it easier to remove much of the kwarg handling and lean on the compiler and our composition to take over much of the heavy lifting, which this currently does not. So I would want that side of the design to be addressed before merging. |
I don't see any interface change in #1495, just tiny changes to the implementation, nor has there been any sensible objection to the interface proposed in the 3 months this PR has been open. What I see instead, is several bugs and missing features that could have been addressed 2 months ago by merging this. There is no reason why #1495 should block this PR, please let's merge it. I will be happy to review #1495 and merge to master whenever is ready |
The change is precisely having a way away from the kwarg syntax. Constant breakages to support this for 2 months only to remove it again is the change I have proposed and have been ignored on. I agree the bugs are annoying, but this would have been breaking either way only to be broken again. |
Please don't misunderstand me, I am not trying to block this for no reason, but expect that code quality standards be put in place. You agree that composable interfaces and internals are important, right? Let's do this, let's merge this to master, and focus on #1495 immediately and steps forward, but release only after some of the comments here have been addressed. |
bors r+ |
Build succeeded: |
The interface proposed here is bn = BatchNorm(64, affine=true, track_stats=true)
bn(x) and this is a nice and retro-compatible interface. The only technically breaking change here is a change in the default behavior of
which would be wildly breaking and also quite ugly and inconvenient. |
I don't think that was what was proposed at all, what was is that we are able to determine the correct forward pass at compile time, with little to no branching, and a clear structure to add new ones. function (sn::SomeNorm)(x)
calc_shapes(...)
track_stats(...)
affine(...)
end In this case, the API for a bunch of these layers would break (using |
no need to break, the exposed interface should be bool (users shouldn't have to deal with Val), the constructor can convert and parameterize with Val. Anyways, finishing #1495 before v0.12 is certainly desirable |
That's just one example, we might want to not have the types grow larger since that can cause compile time regressions or cache misses or fail inlining by mistake. These are tbd, but that work has to be released together to avoid constant breakages. |
This goes in the direction of creating a powerful and consistent experience for the normalization layers.
affine
to LayerNorm, GroupNorm, InstanceNorm, and BatchNorm to activate/deactivate a learnable shift and scalingtrack_stats
to InstanceNorm and BatchNorm to activate/deactivate running time mean and variance tracking, to be used in the evaluation phase.track_stats
andaffine
, since AFAIK they correspond to common usage and best practice:affine=false
andtrack_stats=false
. This breaks the current behaviour (bothtrue
)affine=true
andtrack_stats=true
, as we are currently doingaffine=true
andtrack_stats=false
PR Checklist
@dhairyagandhi96
(for API changes).