Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework normalization layers #1397

Merged
merged 1 commit into from
Feb 4, 2021
Merged

Rework normalization layers #1397

merged 1 commit into from
Feb 4, 2021

Conversation

CarloLucibello
Copy link
Member

@CarloLucibello CarloLucibello commented Nov 16, 2020

This goes in the direction of creating a powerful and consistent experience for the normalization layers.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

@CarloLucibello CarloLucibello changed the title Rework normalization layers [WIP] Rework normalization layers Nov 16, 2020
@CarloLucibello CarloLucibello added this to the v0.12 milestone Nov 27, 2020
@CarloLucibello CarloLucibello changed the title [WIP] Rework normalization layers Rework normalization layers Nov 29, 2020
@CarloLucibello
Copy link
Member Author

@DhairyaLGandhi bump

Copy link
Member

@DhairyaLGandhi DhairyaLGandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's alright to have extra steps available for different norm layers but not too have to check everything everytime for every layer I think. Otherwise we end up with one large function which should be multiple smaller functions which are combined appropriately for the layers

src/layers/normalise.jl Outdated Show resolved Hide resolved
src/layers/normalise.jl Show resolved Hide resolved
src/layers/stateless.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member Author

In general it's alright to have extra steps available for different norm layers but not too have to check everything everytime for every layer I think. Otherwise, we end up with one large function which should be multiple smaller functions which are combined appropriately for the layers

Every layer has affine, track_stats, and isactive options. While the isactive check has to be performed at runtime, we could maybe rely on type parametrization and dispatch for affine and track_stats (if this is what you mean, and assuming we don't want to change e.g. track_stats at some point in the execution). Honestly, I don't think removing a couple of if is worth the extra complexity involved, nor I believe there would be any performance benefit. Can you mock up the alternative so I can benchmark it?

@johnnychen94
Copy link
Contributor

fix InstanceNorm previously not working on gpu

This will fix #1195 if I get it right?

src/cuda/cudnn.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member Author

This will fix #1195 if I get it right?

yes, this is tested in test/cuda/layers.jl now that InstanceNorm in not on the list of broken layers

test/cuda/layers.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member Author

I've also updated GroupNorm and I've done an extensive search of the issues fixed by this PR

Copy link
Member

@DhairyaLGandhi DhairyaLGandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit uncomfortable with having affine switches in the layers, do we have a better way around this? Perhaps the users are better suited to add that as a layer in the model

test/cuda/layers.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member Author

I'm still a bit uncomfortable with having affine switches in the layers, do we have a better way around this? Perhaps the users are better suited to add that as a layer in the model

I don't see how to do it in a way that doesn't involve a branch somewhere. And we really don't want the user to have to add manually an Affine layer after BatchNorm

Pytorch has the same interface, it can't be that bad if we do the same.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Dec 30, 2020

I did some benchmarking using the script below.
The script was executed 3 times on the current PR
and another 3 times on similar code where I removed the track_stat and affine branching
(so I only keep the path where both are true).

Fortunately, there is no appreciable performance difference for any system size.

SCRIPT:

using Flux
using BenchmarkTools
using Random 

function test_BN(shape)
    Random.seed!(17)
    x = rand(Float32, shape)
    bn = BatchNorm(shape[end-1])
    @btime $bn($x)
end

for shape in [  (10,10),
                (100,100),
                (1000,1000),]
    test_BN(shape)
end

OUTPUT¨:

# WITHOUT track_stats and affine branching

  4.427 μs (41 allocations: 2.19 KiB)
  38.264 μs (42 allocations: 40.84 KiB)
  3.552 ms (42 allocations: 3.82 MiB)

  4.402 μs (41 allocations: 2.19 KiB)
  38.231 μs (42 allocations: 40.84 KiB)
  3.336 ms (42 allocations: 3.82 MiB)

  4.407 μs (41 allocations: 2.19 KiB)
  37.572 μs (42 allocations: 40.84 KiB)
  3.476 ms (42 allocations: 3.82 MiB)


# WITH track_stats and affine branching

  4.538 μs (41 allocations: 2.19 KiB)
  38.196 μs (42 allocations: 40.84 KiB)
  3.354 ms (42 allocations: 3.82 MiB)

  4.472 μs (41 allocations: 2.19 KiB)
  40.386 μs (42 allocations: 40.84 KiB)
  3.549 ms (42 allocations: 3.82 MiB)

  4.357 μs (41 allocations: 2.19 KiB)
  38.216 μs (42 allocations: 40.84 KiB)
  3.580 ms (42 allocations: 3.82 MiB)

@CarloLucibello
Copy link
Member Author

@DhairyaLGandhi bump

@DhairyaLGandhi
Copy link
Member

Could you also check the backwards pass?

Having an affine layer seems fine to me, but I'm open to thoughts around construction

As in other test cases on the gpu, I'd much prefer if we didn't have a separate function to go to. Using it to test layers is an extra function to track errors into. We should have the tests be very clear about what they are testing, and best to have that be clear at the site where the test is performed.

src/layers/normalise.jl Show resolved Hide resolved
src/layers/normalise.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member Author

Could you also check the backwards pass?

will do

Having an affine layer seems fine to me, but I'm open to thoughts around construction

I'm strongly against having a separate affine layer, people don't have to insert an Affine every time they insert a BatchNorm, it's extremely annoying, this is why the default for BatchNorm is affine=true. Let's just follow pytorch's approach here, no harm to it

As in other test cases on the gpu, I'd much prefer if we didn't have a separate function to go to. Using it to test layers is an extra function to track errors into. We should have the tests be very clear about what they are testing, and best to have that be clear at the site where the test is performed.

having a centralized testing function assures exhaustive and consistent testing, and makes very clear what is being tested, since it is the same thing everywhere. Having heterogeneous and likely less extensive "manual testing" is strictly worse I think, for each set of tests you have to think about what you are capturing and what you are missing. Going through an extra function call doesn't seem a problem at all. ChainRules provides a whole package of rule testing functions, ChainRulesTestUtils, we can have a couple of testing functions ourselves (in fact we do in Zygote and now in NNlib as well).

@DhairyaLGandhi
Copy link
Member

Where you have to test consistent output, sure, but when you have to generic tests like we do with gpu layers, clarity is better imo. Thinking about the missing bits is good when you have to test for changes you have made to the core kernels like here.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Jan 6, 2021

I tested the pullback as well. The results is the following:

  • master, this PR, and this PR w/o branching have same performance for n=1000
  • master's pullback is the slowest for n=10 and n=100
  • removing branching we gain ~8% at n=10 and ~3% at n=100 in the pullback compared to this PR
  • master, PR, and PR w/o branching's pullbacks do a very large number of allocations, something we should investigate (in the future)

My opinion is that we shouldn't try to workaround zygote being stupid here and possibly make it more efficient,
and in any case the performance gap is small, so we should keep the PR as it is (which is a great improvement over master, both in terms of perf and features + fixes many bugs)

using Flux
using BenchmarkTools
using Random 
using Zygote: pullback

function test_BN(shape)
    Random.seed!(17)
    println("# Shape $shape")
    x = rand(Float32, shape)
    bn = BatchNorm(shape[end-1])
    # println("### forward")
    # @btime $bn($x)
    println("forward in gradient context")
    @btime pullback(x -> sum(sin.($bn(x))), $x)
    y, back = pullback(x -> sum(sin.(bn(x))), x)
    println("pullback")
    @btime $back(1f0)
end

for shape in [  (10,10),
                (100,100),
                (1000,1000),]
    test_BN(shape)
end
##### Master 

# Shape (10, 10)
forward gradient context
  35.807 μs (508 allocations: 32.97 KiB)
pullback
  250.860 μs (714 allocations: 23.14 KiB)
# Shape (100, 100)
forward gradient context
  207.208 μs (877 allocations: 431.64 KiB)
pullback
  715.362 μs (30697 allocations: 999.34 KiB)
# Shape (1000, 1000)
forward gradient context
  19.116 ms (5457 allocations: 38.32 MiB)
pullback
  54.512 ms (3003397 allocations: 95.48 MiB)

##### This PR

 # Shape (10, 10)
forward in gradient context
  27.025 μs (400 allocations: 29.41 KiB)
pullback
  216.878 μs (708 allocations: 23.17 KiB)
# Shape (100, 100)
forward in gradient context
  192.794 μs (409 allocations: 419.22 KiB)
pullback
  681.668 μs (30692 allocations: 1.01 MiB)
# Shape (1000, 1000)
forward in gradient context
  19.810 ms (409 allocations: 38.21 MiB)
pullback
  53.283 ms (3003392 allocations: 99.29 MiB)

### this PR w/o affine and track_stats branching

# Shape (10, 10)
forward in gradient context
  27.070 μs (365 allocations: 27.00 KiB)
pullback
  198.775 μs (693 allocations: 22.67 KiB)
# Shape (100, 100)
forward in gradient context
  194.250 μs (374 allocations: 416.81 KiB)
pullback
  661.045 μs (30677 allocations: 1.01 MiB)
# Shape (1000, 1000)
forward in gradient context
  20.032 ms (374 allocations: 38.20 MiB)
pullback
  54.227 ms (3003377 allocations: 99.29 MiB)

@DhairyaLGandhi
Copy link
Member

But doing it like this was almost an order of magnitude faster, which seems worth it

@CarloLucibello
Copy link
Member Author

Definitely worth it. It is a super intense time for me at work though, maybe I will be able to work more on this in February. The optimization you suggested takes some time to implement for all layers and also requires separate methods for cuarrays if we want to use avx (which I think we should). So I suggest we just finish discussing the API (which I think is good as it is). This PR already has better performance than master, and most importantly it fixes many issues and adds many features. I can attempt to squeeze out more performance in a later PR if someone doesn't beat me to it. I don't want this to hold v0.12 release.

@CarloLucibello
Copy link
Member Author

@DhairyaLGandhi bump, let's do this

@DhairyaLGandhi
Copy link
Member

Let me do a once over, I still strongly feel the conditionals should be made generic, at least so Julia can produce better code, and have the kwargs only as a convenience constructor over top.

@CarloLucibello
Copy link
Member Author

Let me do a once over, I still strongly feel the conditionals should be made generic, at least so Julia can produce better code, and have the kwargs only as a convenience constructor over top.

I don't oppose this (although I think it will make things much more complicated than needed), I'm just saying that devising composable internals is something that can be done later and shouldn't stop us from merging this PR. Again, this PR fixes a lot of issues, adds a ton of features, is faster than current master. It's been ready for more than a month, there is no point in not letting people benefit from work that has already been done. This is a very general concern of mine about Flux: it lacks a ton of features, it has many bugs. Those are show-stoppers for many people, and we should prioritize them over any performance, consistency, API considerations if needed. And be fast in doing it.

@DhairyaLGandhi
Copy link
Member

devising composable internals is something that can be done later and shouldn't stop us from merging this PR.

Design decisions such as this cause issues down the line. While I appreciate the process of quick movement, it is my experience that doing it later either doesn't happen or is made harder or introduces a bunch of hacks.

@CarloLucibello
Copy link
Member Author

another rebase to be done and another 10 days wasted for no reason. @DhairyaLGandhi could you please respect the work I've done here and stop blocking this PR?

@DhairyaLGandhi
Copy link
Member

It's been a busy week, I'm sure you have seen me sufficiently active on this project. I Haven't intentionally done anything to disrespect your work or block any pr, and I'd apologise if I did.

I can do the rebase myself if you're not up to it. Rebasing is a part of a moving library, if your pr needs to be rebased (in the manifest and news of all things) it meant other things have gone in.

@DhairyaLGandhi DhairyaLGandhi mentioned this pull request Feb 3, 2021
@CarloLucibello
Copy link
Member Author

Squashed and rebased. I don't see anything in #1495 that should block this PR. Can we merge this now and merge #1495 whenever is ready?

@DhairyaLGandhi
Copy link
Member

Well part of the motivation to PR instead of commit/ suggest was that we have to make it easier to remove much of the kwarg handling and lean on the compiler and our composition to take over much of the heavy lifting, which this currently does not. So I would want that side of the design to be addressed before merging.

@CarloLucibello
Copy link
Member Author

I don't see any interface change in #1495, just tiny changes to the implementation, nor has there been any sensible objection to the interface proposed in the 3 months this PR has been open. What I see instead, is several bugs and missing features that could have been addressed 2 months ago by merging this. There is no reason why #1495 should block this PR, please let's merge it. I will be happy to review #1495 and merge to master whenever is ready

@DhairyaLGandhi
Copy link
Member

The change is precisely having a way away from the kwarg syntax. Constant breakages to support this for 2 months only to remove it again is the change I have proposed and have been ignored on. I agree the bugs are annoying, but this would have been breaking either way only to be broken again.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Feb 4, 2021

Please don't misunderstand me, I am not trying to block this for no reason, but expect that code quality standards be put in place. You agree that composable interfaces and internals are important, right?

Let's do this, let's merge this to master, and focus on #1495 immediately and steps forward, but release only after some of the comments here have been addressed.

@DhairyaLGandhi
Copy link
Member

bors r+

@bors
Copy link
Contributor

bors bot commented Feb 4, 2021

Build succeeded:

@bors bors bot merged commit 7e9a180 into master Feb 4, 2021
@CarloLucibello
Copy link
Member Author

The interface proposed here is

bn = BatchNorm(64, affine=true, track_stats=true)
bn(x)

and this is a nice and retro-compatible interface. The only technically breaking change here is a change in the default behavior of InstanceNorm which is a fix in practice, because it now becomes equivalent to what other DL frameworks do and the paper itself suggests.
I don't know what you mean by composable interface, but for sure we don't want something
like

bn = Affine(TrackStats(BatchNorm(64)))

which would be wildly breaking and also quite ugly and inconvenient.
The whole discussion in this PR was about having less branching and more composition in the inner implementation, and I'm totally fine with it, it just can be done somewhere else because it's not breaking and this PR already does a lot. In fact, #1495 is not breaking with respect to this PR. I like the direction of #1495 btw

@DhairyaLGandhi
Copy link
Member

I don't think that was what was proposed at all, what was is that we are able to determine the correct forward pass at compile time, with little to no branching, and a clear structure to add new ones.

function (sn::SomeNorm)(x)
  calc_shapes(...)
  track_stats(...)
  affine(...)
end

In this case, the API for a bunch of these layers would break (using Val for bools for ex)

@CarloLucibello
Copy link
Member Author

In this case, the API for a bunch of these layers would break (using Val for bools for ex)

no need to break, the exposed interface should be bool (users shouldn't have to deal with Val), the constructor can convert and parameterize with Val. Anyways, finishing #1495 before v0.12 is certainly desirable

@DhairyaLGandhi
Copy link
Member

That's just one example, we might want to not have the types grow larger since that can cause compile time regressions or cache misses or fail inlining by mistake. These are tbd, but that work has to be released together to avoid constant breakages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment