Add `dropout` #454

mcabbott · 2023-01-03T02:53:43Z

This adds a dropout function.

At least on CPU, this should be about twice as fast as the one in Flux, and use less memory. There are some Flux PRs, but the code is messy, and I got lost & started from scratch.

It's not quite the fastest variant. The gradient can be faster if we believe that every zero in the output comes from the mask, never from the input. That may not be a terrible assumption -- if you don't have a relu before it, zeros are rare; if you do, then the gradient associated to a zero will be discarded anyway. But... perhaps it's best to be correct.

I can put a gist of these variations somewhere. But local benchmarks are:

let x = randn(Float32, 100, 100)
    @btime dropout($x, 0.3) # Flux
    @btime mydropout($x, 0.3)  # sans mask -- this PR
    @btime mydropout2($x, 0.3)  # with mask
end;
  # min 10.375 μs, mean 19.879 μs (13 allocations, 78.45 KiB)
  # min 5.292 μs, mean 10.736 μs (2 allocations, 39.11 KiB)
  # min 12.958 μs, mean 17.476 μs (5 allocations, 44.67 KiB)

let x = randn(Float32, 100, 100)
    @btime gradient(x -> sum(dropout(x, 0.3)), $x) # Flux
    @btime gradient(x -> sum(mydropout(x, 0.3)), $x)  # sans mask... but sadly wrong answers
    @btime gradient(x -> sum(mydropout2(x, 0.3)), $x)  # with boolean mask
    @btime gradient(x -> sum(mydropout3(x, 0.3)), $x)  # keep random numbers -- this PR
end;
  # min 28.292 μs, mean 49.702 μs (70 allocations, 158.55 KiB)
  # min 12.500 μs, mean 22.952 μs (20 allocations, 78.70 KiB)
  # min 26.916 μs, mean 37.291 μs (37 allocations, 84.69 KiB)
  # min 13.208 μs, mean 26.490 μs (23 allocations, 117.92 KiB)


let x = randn(Float32, 100, 100)
    @btime dropout($x, 0.3; dims=1)
    @btime mydropout($x, 0.3; dims=1)
    @btime mydropout2($x, 0.3; dims=1)
end;
  # min 1.363 μs, mean 5.047 μs (3 allocations, 39.59 KiB)
  # min 1.346 μs, mean 6.003 μs (3 allocations, 39.59 KiB)
  # min 1.675 μs, mean 5.735 μs (5 allocations, 39.72 KiB)

let x = randn(Float32, 100, 100)
    @btime gradient(x -> sum(dropout(x, 0.3; dims=1)), $x)
    @btime gradient(x -> sum(mydropout(x, 0.3; dims=1)), $x)
    @btime gradient(x -> sum(mydropout2(x, 0.3; dims=1)), $x)
    @btime gradient(x -> sum(mydropout3(x, 0.3; dims=1)), $x)
    @btime gradient(x -> sum(mydropout4(x, 0.3; dims=1)), $x)  # changed to same strategy as dims=:
end;
  # min 25.458 μs, mean 47.425 μs (108 allocations, 122.16 KiB)
  # min 8.722 μs, mean 17.797 μs (21 allocations, 79.19 KiB)
  # min 13.709 μs, mean 24.887 μs (33 allocations, 79.61 KiB)
  # min 8.680 μs, mean 18.442 μs (21 allocations, 79.19 KiB)
  # min 9.916 μs, mean 22.267 μs (21 allocations, 79.25 KiB)  # slightly slower, but simpler

Edit: timing things on the GPU, at first the dims=: case seemed slow, but now it's not...

# ] add https://github.com/mcabbott/NNlib.jl/tree/dropout

using NNlib, CUDA, BenchmarkTools, Zygote, Flux, ChainRulesCore, Random

NNlib._rng_from_array(::CuArray) = CUDA.default_rng()

NNlib.dropout(cu(ones(3,5)), 0.3)
NNlib.dropout(cu(ones(3,5)), 0.4; dims=1)

Flux.dropout(cu(ones(3,5)), 0.3)
Flux.dropout(cu(ones(3,5)), 0.4; dims=1)

# To try out the copy-the-RNG idea, only for gradients:

@eval Base.copy(r::CUDA.RNG) = $(Expr(:new, CUDA.RNG, :(r.seed), :(r.counter)))

using NNlib: _fast_broadcast!

mydropout6(args...; kw...) = dropout(args...; kw...)  # same forward as this PR
function ChainRulesCore.rrule(::typeof(mydropout6), rng::AbstractRNG, x::AbstractArray, p::Real; dims = :)
    dup_rng = copy(rng)   # save for backward pass
    y = dropout(dup_rng, x, p; dims)  # for other dims, could re-use buffer...
    val = convert(eltype(y), 1/(1-p))
    function back6(Δ)
        dims isa Colon || error("will be wrong")
        dx = rand!(dup_rng, similar(y))
        _fast_broadcast!(dx, unthunk(Δ)) do q, dy
            (q>p) * val * dy
        end
        (NoTangent(), NoTangent(), dx, NoTangent())
    end
    y, back6
end

ChainRulesCore.@non_differentiable CUDA.default_rng()

mydropout6(CUDA.default_rng(), cu(ones(3,5)), 0.3)
gradient(x -> sum(mydropout6(CUDA.default_rng(), x, 0.3)), cu(ones(3,5)))


##### Forward times & allocations

julia> let x = CUDA.rand(100, 1000)
           @btime CUDA.@sync Flux.dropout($x, 0.3)
           @btime CUDA.@sync NNlib.dropout($x, 0.3)
           println()
           @btime CUDA.@sync Flux.dropout($x, 0.3; dims=1)
           @btime CUDA.@sync NNlib.dropout($x, 0.3; dims=1)
       end;
  49.163 μs (94 allocations: 4.27 KiB)
  29.623 μs (48 allocations: 2.31 KiB)

  43.097 μs (85 allocations: 4.05 KiB)
  42.857 μs (85 allocations: 3.98 KiB)

# one pass strategy, for the last:
  39.118 μs (88 allocations: 5.33 KiB)


# First run?
  # 49.339 μs (94 allocations: 4.27 KiB)
  # 261.194 μs (41 allocations: 392.56 KiB)  # awful

  # 42.835 μs (85 allocations: 4.05 KiB)
  # 47.702 μs (77 allocations: 4.05 KiB)

julia> let x = CUDA.rand(1000, 10_000)
           CUDA.@time Flux.dropout(x, 0.3)
           CUDA.@time NNlib.dropout(x, 0.3)
           println()
           CUDA.@time Flux.dropout(x, 0.3; dims=1)
           CUDA.@time NNlib.dropout(x, 0.3; dims=1)
       end;
  0.000415 seconds (94 CPU allocations: 4.266 KiB) (2 GPU allocations: 76.294 MiB, 5.40% memmgmt time)
  0.000269 seconds (48 CPU allocations: 2.312 KiB) (1 GPU allocation: 38.147 MiB, 2.50% memmgmt time)

  0.000185 seconds (86 CPU allocations: 4.062 KiB) (2 GPU allocations: 38.151 MiB, 6.87% memmgmt time)
  0.000186 seconds (86 CPU allocations: 4.000 KiB) (2 GPU allocations: 38.151 MiB, 4.62% memmgmt time)

# one pass:
  0.000214 seconds (88 CPU allocations: 5.328 KiB) (2 GPU allocations: 38.151 MiB, 3.88% memmgmt time)

# Earlier:
  # 0.000474 seconds (94 CPU allocations: 4.266 KiB) (2 GPU allocations: 76.294 MiB, 2.89% memmgmt time)
  # 0.046926 seconds (41 CPU allocations: 38.149 MiB) (1 GPU allocation: 38.147 MiB, 0.02% memmgmt time)

  # 0.000227 seconds (86 CPU allocations: 4.062 KiB) (2 GPU allocations: 38.151 MiB, 12.16% memmgmt time)
  # 0.000235 seconds (78 CPU allocations: 7.641 KiB) (2 GPU allocations: 38.151 MiB, 4.83% memmgmt time)


##### Gradient times & allocations

julia> let x = CUDA.randn(100, 1000), rng = CUDA.default_rng()
           @btime CUDA.@sync gradient(x -> sum(Flux.dropout(x, 0.3)), $x)
           @btime CUDA.@sync gradient(x -> sum(NNlib.dropout(x, 0.3)), $x)
           @btime CUDA.@sync gradient(x -> sum(mydropout6($rng, x, 0.3)), $x)
           println()
           @btime CUDA.@sync gradient(x -> sum(Flux.dropout(x, 0.3; dims=1)), $x)
           @btime CUDA.@sync gradient(x -> sum(NNlib.dropout(x, 0.3; dims=1)), $x)
       end;
  225.455 μs (364 allocations: 16.27 KiB)
  133.620 μs (242 allocations: 11.19 KiB)
  146.186 μs (258 allocations: 12.05 KiB)

  342.355 μs (509 allocations: 22.77 KiB)
  165.137 μs (284 allocations: 12.70 KiB)


# First try, without mydropout6
  # 194.412 μs (364 allocations: 16.27 KiB)
  # 318.048 μs (235 allocations: 401.44 KiB)  # bad, but not because of alloc

  # 259.827 μs (509 allocations: 22.77 KiB)
  # 150.939 μs (276 allocations: 12.77 KiB)  # great



julia> let x = CUDA.randn(1000, 10_000), rng = CUDA.default_rng()
           CUDA.@time gradient(x -> sum(Flux.dropout(x, 0.3)), x)
           CUDA.@time gradient(x -> sum(NNlib.dropout(x, 0.3)), x)
           CUDA.@time gradient(x -> sum(mydropout6(rng, x, 0.3)), x)
           println()
           CUDA.@time gradient(x -> sum(Flux.dropout(x, 0.3; dims=1)), x)
           CUDA.@time gradient(x -> sum(NNlib.dropout(x, 0.3; dims=1)), x)
       end;
  0.116546 seconds (61.41 k CPU allocations: 2.759 MiB, 26.81% gc time) (7 GPU allocations: 190.735 MiB, 8.33% memmgmt time)
  0.078563 seconds (52.12 k CPU allocations: 2.933 MiB) (6 GPU allocations: 152.589 MiB, 0.07% memmgmt time)
  0.078514 seconds (42.42 k CPU allocations: 2.552 MiB) (5 GPU allocations: 114.442 MiB, 0.07% memmgmt time)

  0.090443 seconds (54.14 k CPU allocations: 2.910 MiB) (8 GPU allocations: 152.596 MiB, 0.07% memmgmt time)
  0.083492 seconds (44.09 k CPU allocations: 2.394 MiB) (6 GPU allocations: 114.445 MiB, 0.07% memmgmt time)

# First try, without mydropout6
  # 0.070830 seconds (35.09 k CPU allocations: 1.953 MiB) (7 GPU allocations: 190.735 MiB, 0.10% memmgmt time)
  # 0.149181 seconds (52.21 k CPU allocations: 41.074 MiB, 14.75% gc time) (6 GPU allocations: 152.589 MiB, 0.18% memmgmt time)

  # 0.087902 seconds (54.21 k CPU allocations: 2.912 MiB) (8 GPU allocations: 152.596 MiB, 0.07% memmgmt time)
  # 0.082117 seconds (44.14 k CPU allocations: 2.399 MiB) (6 GPU allocations: 114.445 MiB, 0.06% memmgmt time)


##### CPU, on cyclops

julia> let x = rand(100, 1000)
           @btime Flux.dropout($x, 0.3)
           @btime NNlib.dropout($x, 0.3)
           println()
           @btime Flux.dropout($x, 0.3; dims=1)
           @btime CUDA.@sync NNlib.dropout($x, 0.3; dims=1)
       end;
  491.842 μs (13 allocations: 1.53 MiB)
  297.053 μs (2 allocations: 781.30 KiB)

  139.822 μs (3 allocations: 782.17 KiB)
  143.676 μs (3 allocations: 782.17 KiB)

# one pass strategy, for the last:
  143.043 μs (3 allocations: 782.17 KiB)

ToucheSir · 2023-01-03T03:19:42Z

One trick for GPU is to save the RNG state and lazily generate the mask in a kernel instead of materializing it (see https://triton-lang.org/master/getting-started/tutorials/04-low-memory-dropout.html for an implementation). This works because the default RNG is counter-based and doesn't have any dependence on previous states. Wikipedia tells me Xoshiro does not fall into this category, but we'd likely be doing a linear traversal on CPU and allocations are cheap(er) anyhow. All this to say that as long as things aren't slower on GPU, we can always optimize them separately in NNlibCUDA.

mcabbott · 2023-01-03T03:36:05Z

Ah that's an idea I did not consider, would probably pay on CPU too.

Copying Xoshiro is certainly fast and appears to work. It's just some bits... where do you read that it has other state?

julia> let rs = [copy(Random.default_rng()) for _ in 1:10]
         @test rand(rs[1]) == rand(rs[2])
         @test randn(rs[3], 3) == randn(rs[4], 3)
         @test rand!(rs[5], similar(1:0, 10, 10)) == rand!(rs[6], similar(1:0, 10, 10))
         rand(rs[1])
       end
0.8655824473102265

julia> dump(copy(Random.default_rng()))
Xoshiro
  s0: UInt64 0x3c8c337c7236f4ab
  s1: UInt64 0x5758ee14bcde3f3c
  s2: UInt64 0xb7bbc6e271015391
  s3: UInt64 0xb76992934d86a2b9

What I'm not sure of is whether copy(::AbstractRNG) is guaranteed to work like this. Copying MersenneTwister takes a μs.

A quick attempt is not quite as fast, but close, and saves memory.

mydropout6(args...; kw...) = dropout(args...; kw...)  # same forward as this PR
function ChainRulesCore.rrule(::typeof(mydropout6), rng::AbstractRNG, x::AbstractArray, p::Real; dims = :)
    dup_rng = copy(rng)   # save for backward pass
    y = dropout(dup_rng, x, p; dims)  # for other dims, could re-use buffer...
    val = convert(eltype(y), 1/(1-p))
    function back6(Δ)
        dims isa Colon || error("will be wrong")
        dx = rand!(dup_rng, similar(y))
        _fast_broadcast!(dx, unthunk(Δ)) do q, dy
            (q>p) * val * dy
        end
        (NoTangent(), NoTangent(), dx, NoTangent())
    end
    y, back6
end
let x = randn(Float32, 100, 100), rng = Random.default_rng()
    @btime gradient(x -> sum(dropout($rng, x, 0.3)), $x)     # first commit of PR
    @btime gradient(x -> sum(mydropout6($rng, x, 0.3)), $x)  # copy the RNG
end;
  # min 13.041 μs, mean 23.654 μs (22 allocations, 117.95 KiB)
  # min 16.916 μs, mean 24.421 μs (22 allocations, 78.97 KiB)

Edit: Here's a test with CUDA, which seems to reliably repeat on small sizes, but fail on large arrays. I presume that means it's dividing the work up in a non-deterministic way. (Maybe there are ways to control that?)

julia> @eval Base.copy(r::CUDA.RNG) = $(Expr(:new, CUDA.RNG, :(r.seed), :(r.counter)))

julia> let r1 = copy(CUDA.default_rng()), r2 = copy(CUDA.default_rng())
         x1 = rand(r1, 3000)
         x2 = rand(r2, 3000)
         x1==x2
       end
true

julia> let r1 = copy(CUDA.default_rng()), r2 = copy(CUDA.default_rng())
         x1 = rand(r1, 30_000)
         x2 = rand(r2, 30_000)
         x1==x2
       end
false

ToucheSir · 2023-01-03T04:08:12Z

where do you read that it has other state?

I was going off https://en.wikipedia.org/wiki/Counter-based_random_number_generator_(CBRNG)#Background. It's not that it has other state, but that you can't get the Nth random value without first calculating some previous N-x values to advance the internal state. Whereas for the Philox RNG CUDA.jl uses, you just set the counter to however many steps forward you want and it gives you a value.

What I'm not sure of is whether copy(::AbstractRNG) is guaranteed to work like this.

This was my concern as well. We could always use dispatch to have a fallback path for unknown RNGs that materializes the mask.

ToucheSir · 2023-01-03T04:11:45Z

src/dropout.jl

+        similar(A, T, ntuple(d -> d in dims ? size(A,d) : 1, ndims(A)))
+    end
+    rand!(rng, keep)
+    Y = @. (keep>p) * A * val


If the mask idea doesn't work out, another idea for making this type stable while still allowing short-circuiting on 0 and 1 is to make a lazy mask wrapper which only creates the mask if necessary. Depending on how devious we want to be, we can even use promote_op or a custom mask_type(xtype::Type{A}, ...) where A <: AbstractArray to determine what type the wrapped mask should be (since we can't determine it in the short-circuiting case).

mcabbott · 2023-01-03T04:39:48Z

but that you can't get the Nth random value without first calculating some previous N-x values to advance the internal state

Ah but that's fine I think. We don't need to skip into the future. We only need that the two copies used forward and backward produce the same, starting from whatever we happen to start.

We could always use dispatch to have a fallback path for unknown RNGs that materializes the mask.

Oh man, I'd be quite keen not to have to maintain two paths. (Already the no-gradient and gradient versions differ a bit here...)

ToucheSir · 2023-01-03T04:53:19Z

Yes, the skipping is more relevant on GPU. IIRC there's a way to split a Xoshiro RNG for multithreading, so that's likely not an issue on CPU.

On divergent paths, I don't think it's too too bad. The dispatch split point could be on what Flux calls dropout_mask. We then override broadcast(*, ::LowMemDropoutMask, x) or introduce our own apply_mask function with a x .* mask fallback. The rrule is technically not required because AD will do dx = mask .* dy automatically.

mcabbott · 2023-01-03T19:58:15Z

I think we should do this, it's a step forward.

The copy-the-RNG story does work, and save memory on the GPU. But needs some constructor methods, and perhaps consultation with experts about safety. And copying MerseinTwister on 1.6 is pretty slow... we should kick the can down the road I think. No chance that changes the interface.

This saves memory & time over what's now in Flux, and will let us remove that. It ~~needs a 2-line NNlibCUDA PR, and some tests there.~~ has a companion PR: FluxML/NNlibCUDA.jl#60

ToucheSir

No objections from me. All the aforementioned tricks can be done as additive or at least non-breaking changes in future PRs.

src/dropout.jl

ToucheSir · 2023-01-03T23:37:22Z

src/dropout.jl

+"""
+    _rng_from_array(x)
+
+Return the random number generator most appropriate for `x`:
+`CUDA.default_rng()` for `CuArray`, else `Random.default_rng()`
+"""
+_rng_from_array(::AbstractArray) = Random.default_rng()
+
+@non_differentiable _rng_from_array(::Any)


Any reason not to copy https://github.com/FluxML/Flux.jl/blob/ee78ce3cefb027228413f8edace3c0385139d786/src/utils.jl#L36-L49 wholesale (minus the CUDA overload)?

Just simplifying a bit, I couldn't figure out why there were so many functions. Why make a different choice on 1.6

julia> using Random julia> Random.default_rng() MersenneTwister(0x9687b6121c4ccb062f473c9c3c8bccc6) julia> Random.GLOBAL_RNG Random._GLOBAL_RNG() julia> VERSION v"1.6.0"

compared to master:

julia> using Random julia> Random.default_rng() TaskLocalRNG() julia> Random.GLOBAL_RNG Random._GLOBAL_RNG() julia> VERSION v"1.10.0-DEV.204"

I don't remember now, but based on FluxML/Flux.jl#1849 (comment) it might've been related to thread safety?

Cthulhu tells me that rand(...) uses default_rng() on 1.6 as well and it returns a thread-local RNG, so maybe this was much ado about nothing. cc @darsnack if I've missed something though, and I think this function can be public like the Flux one.

mcabbott added the enhancement label Jan 3, 2023

ToucheSir reviewed Jan 3, 2023

View reviewed changes

mcabbott added 2 commits January 3, 2023 14:49

add dropout

4db9897

tidy up

4111cb2

mcabbott force-pushed the dropout branch from 65009b2 to 4111cb2 Compare January 3, 2023 19:50

mcabbott marked this pull request as ready for review January 3, 2023 19:58

This was referenced Jan 3, 2023

Add CUDA tests for dropout FluxML/NNlibCUDA.jl#60

Merged

Move dropout to NNlib FluxML/Flux.jl#2150

Merged

mcabbott force-pushed the dropout branch from 814bfff to 55e246d Compare January 3, 2023 23:11

nan & complex fixes

75a6ea3

mcabbott force-pushed the dropout branch from 55e246d to 75a6ea3 Compare January 3, 2023 23:13

ToucheSir reviewed Jan 3, 2023

View reviewed changes

mcabbott added 3 commits January 3, 2023 18:51

test dropout! and allow rng

79666d9

fixup

c416667

fix 1.6

70a03c6

CarloLucibello approved these changes Jan 5, 2023

View reviewed changes

mcabbott merged commit 2bae421 into FluxML:master Jan 5, 2023

mcabbott deleted the dropout branch January 5, 2023 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `dropout` #454

Add `dropout` #454

mcabbott commented Jan 3, 2023 •

edited

Loading

ToucheSir commented Jan 3, 2023

mcabbott commented Jan 3, 2023 •

edited

Loading

ToucheSir commented Jan 3, 2023 •

edited

Loading

ToucheSir Jan 3, 2023

mcabbott commented Jan 3, 2023

ToucheSir commented Jan 3, 2023 •

edited

Loading

mcabbott commented Jan 3, 2023 •

edited

Loading

ToucheSir left a comment

ToucheSir Jan 3, 2023

mcabbott Jan 3, 2023

ToucheSir Jan 3, 2023

ToucheSir Jan 3, 2023 •

edited

Loading

Add dropout #454

Add dropout #454

Conversation

mcabbott commented Jan 3, 2023 • edited Loading

ToucheSir commented Jan 3, 2023

mcabbott commented Jan 3, 2023 • edited Loading

ToucheSir commented Jan 3, 2023 • edited Loading

ToucheSir Jan 3, 2023

Choose a reason for hiding this comment

mcabbott commented Jan 3, 2023

ToucheSir commented Jan 3, 2023 • edited Loading

mcabbott commented Jan 3, 2023 • edited Loading

ToucheSir left a comment

Choose a reason for hiding this comment

ToucheSir Jan 3, 2023

Choose a reason for hiding this comment

mcabbott Jan 3, 2023

Choose a reason for hiding this comment

ToucheSir Jan 3, 2023

Choose a reason for hiding this comment

ToucheSir Jan 3, 2023 • edited Loading

Choose a reason for hiding this comment

Add `dropout` #454

Add `dropout` #454

mcabbott commented Jan 3, 2023 •

edited

Loading

mcabbott commented Jan 3, 2023 •

edited

Loading

ToucheSir commented Jan 3, 2023 •

edited

Loading

ToucheSir commented Jan 3, 2023 •

edited

Loading

mcabbott commented Jan 3, 2023 •

edited

Loading

ToucheSir Jan 3, 2023 •

edited

Loading