Faster generic broadcasting #1001

mcabbott · 2021-06-18T18:18:27Z

During JuliaDiff/ChainRules.jl#441 we realised there's probably a factor 10 left on the table, for no good reason. Not always a factor 10, but still:

julia> xs = randn(10_000);

julia> @btime Zygote.gradient(x -> sum(abs.(x)), $xs);
  478.458 μs (30048 allocations: 1.07 MiB)  # Zygote v0.6.12
  28.625 μs (28 allocations: 469.50 KiB)    # this PR, reverse mode
  19.375 μs (52 allocations: 314.17 KiB)    # this PR, forward mode

julia> @btime Zygote.gradient(x -> sum(cbrt.(x)), $xs);
  545.125 μs (30048 allocations: 938.72 KiB)
  81.500 μs (28 allocations: 391.38 KiB)    # this PR, reverse
  87.417 μs (52 allocations: 314.17 KiB)    # this PR, forward

julia> @btime copy($xs);  # to compare time & memory
  2.644 μs (2 allocations: 78.20 KiB)

julia> @btime Zygote.gradient(x -> sum(tanh.(x)), $xs);  # has its own gradient
  114.250 μs (5 allocations: 156.42 KiB)

julia> @btime Zygote.gradient(x -> sum((identity∘tanh).(x)), $xs); # force generic version
  31.097 ms (339593 allocations: 6.56 MiB)
  29.680 ms (309574 allocations: 5.87 MiB)  # this PR, reverse
  131.667 μs (60 allocations: 314.58 KiB)   # this PR, forward
  128.833 μs (52 allocations: 314.17 KiB)   # ... with (y -> tanh(y)) instead

julia> @btime Zygote.gradient(x -> sum(x .> 0.1), $xs)
  10.042 μs (27 allocations: 20.55 KiB)
  8.347 μs (18 allocations: 20.25 KiB)   # this PR, reverse
  12.708 μs (11 allocations: 5.72 KiB)   # this PR, forward
(nothing,)

Edit -- latest version does what's suggested in #336, roughly, marked "forward" above. That is, using dual numbers (via the existing internal function broadcast_forward) instead of generic reverse-mode when possible. Examples linked there discourse & #592 both get about 5x faster.

The lines marked "reverse" above are the result of inference improvements to the generic reverse mode. These were the first step of this PR. And will still be used when dual numbers cannot be, e.g. on complex numbers, or broadcasting closures.

Edit' -- perhaps worth timing map on the same problems. Not certain we should do this here, but the same purity check used for "forward" should also make it safe to skip reverse stuff in map. Which gives modest speedups & some memory savings:

julia> @btime Zygote.gradient(x -> sum(map(abs, x)), $xs);
  53.666 μs (42 allocations: 938.83 KiB)
  25.791 μs (29 allocations: 391.55 KiB)  # this PR

julia> @btime Zygote.gradient(x -> sum(map(cbrt, x)), $xs);
  108.542 μs (42 allocations: 704.44 KiB)
  76.750 μs (29 allocations: 313.42 KiB)  # this PR

julia> @btime Zygote.gradient(x -> sum(map(tanh, x)), $xs);
  149.917 μs (42 allocations: 704.44 KiB)
  124.875 μs (29 allocations: 313.42 KiB)  # this PR

julia> @btime Zygote.gradient(x -> sum(map(identity∘tanh, x)), $xs);
  32.160 ms (300048 allocations: 6.33 MiB)
  31.064 ms (300034 allocations: 5.65 MiB)  # this PR

julia> @btime Zygote.gradient(x -> sum(map(>(0.1), x)), $xs)
  26.565 ms (330034 allocations: 11.07 MiB)
  25.745 ms (330029 allocations: 10.69 MiB)
(nothing,)

mcabbott · 2021-06-18T18:22:59Z

src/lib/broadcast.jl

+    dxs = ntuple(len) do i
+      collapse_nothings(map(StaticGetter{i}(), dxs_zip))
+    end
    (nothing, accum_sum(dxs[1]), map(unbroadcast, args, Base.tail(dxs))...)


Clearly map(StaticGetter{i}(), dxs_zip) should really be fused with unbroadcast, possibly into mapreduce.

src/lib/broadcast.jl

oxinabox

Some comments but basically seems all sensible.
Once tests pass, merge when happy.

mcabbott · 2021-06-18T18:54:26Z

src/lib/broadcast.jl

  eltype(out) <: Dual || return (out, _ -> nothing)
  y = map(x -> x.value, out)
-  _back(ȳ, i) = unbroadcast(args[i], ((a, b) -> a*b.partials[i]).(ȳ, out))
-  back(ȳ) = ntuple(i -> _back(ȳ, i), N)
+  _back(ȳ, geti) = unbroadcast(geti(args), ((a, b) -> a * geti(b.partials)).(ȳ, out))
+  back(ȳ) = ntuple(i -> _back(ȳ, StaticGetter{i}()), N)


Change for CuArrays, not sure this matters. CI doesn't seem to mind. Pushed to try out on another machine, a bit later.

Some GPU times: no harm, but no help either.

julia> xs = cu(randn(100_000)); julia> @btime CUDA.@sync Zygote.gradient(x -> sum(cbrt.(x)), $xs); 119.805 μs (375 allocations: 10.56 KiB) 119.872 μs (377 allocations: 10.61 KiB) # this PR julia> @btime CUDA.@sync Zygote.gradient(x -> sum(tanh.(x)), $xs); # has its own gradient 99.938 μs (342 allocations: 10.81 KiB) julia> @btime CUDA.@sync Zygote.gradient(x -> sum((identity∘tanh).(x)), $xs); # force generic version 116.577 μs (372 allocations: 10.52 KiB) 117.023 μs (371 allocations: 10.52 KiB) # this PR

Also, the generic is as fast as the specific. Remind me why we don't do this on the CPU, and have this 300x slower generic thing?

I know nothing about how Zygote works with GPU, but taking a guess:
because moving memory between GPU./VRAM and CPU/RAM is even more expensive?

I don't quite know why the generic broadcast isn't used for the GPU, perhaps something would break. It uses dual numbers instead, and perhaps this makes purity assumptions about the function involved?

If we very simply call that on the CPU, we get times like this:

julia> @btime Zygote.gradient(x -> sum(abs.(x)), $xs); # earlier 478.458 μs -> 28.625 μs 15.209 μs (20 allocations: 313.16 KiB) julia> @btime Zygote.gradient(x -> sum((identity∘tanh).(x)), $xs); # earlier 29.680 ms 125.166 μs (20 allocations: 313.16 KiB)

This is with 9227168, which might be too crude... will think a bit & see what CI says.

Here's what it gets wrong:

julia> gradient((x,y) -> sum((z->z^2+y[1]).(x)), [1,2,3], [4,5]) ([2, 4, 6], nothing) # this commit ([2, 4, 6], [3, 0]) # tagged Zygote

(now caught & sent the slow path)

I don't quite know why the generic broadcast isn't used for the GPU, perhaps something would break. It uses dual numbers instead, and perhaps this makes purity assumptions about the function involved

I think the reason is that you can't send IdDicts to the GPU.
Which means you can't send Zygote.Context to the GPU.
Which means you can't run Zygote._pullback, which needs the Context to track some things, I am not sure what, but I think like tasks and globals?
but the things that the Context is mean to track don't need to be tracked in forwards mode.

This reverts commit 4584136.

src/lib/broadcast.jl

ToucheSir · 2021-06-20T02:06:55Z

With the caveat that I don't really understand what's going on, has this subsumed #807?

mcabbott · 2021-06-20T02:41:11Z

I was unaware of #807. This does not try to make broadcasting better able to handle state; it does try to make it ignore state whenever possible, in the interest of not being slow.

The last few commits add a similar test to map. Trying the test from the issue at the REPL, it fails to spot the state. Which seems not to happen in the tests, since they pass.

julia> s = 0;

julia> function f(x)
            global s += x
          end;

julia> gradient(x -> sum(map(f, x)), 1:10)
([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],)  # this PR
([10, 9, 8, 7, 6, 5, 4, 3, 2, 1],)  # tagged version, test is == (10:-1:1,)

julia> fieldnames(typeof(f))
()

julia> Base.issingletontype(typeof(f))
true

Not sure what to make of that, global scope is weird? Maybe the example is too artificial, this is not actually a sensible gradient calculation.

The examples I invented (and added as tests here) do detect when a variable is closed over:

julia> gradient((x,y) -> sum((z->z^2+y[1]).(x)), [1,2,3], [4,5])
([2, 4, 6], [3, 0])

julia> fieldnames(typeof((z->z^2+y[1])))  # and Base.issingletontype is true
()

julia> let y = [1,2]
         fieldnames(typeof((z->z^2+y[1])))  # and Base.issingletontype is false
       end
(:y,)

mcabbott · 2021-06-20T02:48:17Z

src/lib/array.jl

-        Δf_and_args = unzip(_tryreverse($mapfunc, Δf_and_args_zipped))
-        Δf = reduce(accum, Δf_and_args[1])
-        (Δf, Δf_and_args[2:end]...)
+        if Base.issingletontype(F) && length(args) == 1


Variants of the weird map example from existing tests. Tests pass, but at the REPL it would fail with this PR; some other variants are fine:

julia> s = 0; julia> function f(x) global s += x end; julia> gradient(x -> sum(map(f, x)), 1:10) ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],) # this PR ([10, 9, 8, 7, 6, 5, 4, 3, 2, 1],) # tagged version, test is == (10:-1:1,) julia> fieldnames(typeof(f)) () julia> Base.issingletontype(typeof(f)) true julia> S = Ref(0) # mutable, global scope Base.RefValue{Int64}(0) julia> function F(x) S[] += x end; julia> fieldnames(typeof(F)) () julia> gradient(x -> sum(map(F, x)), 1:10) ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],) julia> let S2 = Ref(0) # mutable, local scope function F2(x) S2[] += x end gradient(x -> sum(map(F2, x)), 1:10) end ([10, 9, 8, 7, 6, 5, 4, 3, 2, 1],) julia> let s2 = 0 # immutable, local scope function f2(x) s2 += x end gradient(x -> sum(map(f2, x)), 1:10) end ([10, 9, 8, 7, 6, 5, 4, 3, 2, 1],)

My inclination is to say that none of this matters -- the motivation for reverse in stateful map is RNNs, where f is a whole network. That will definitely contain parameters, and hence go the slow path.

yeah, probably we could afford a regression in those very corner cases. I don't know why things in the Main module act so differently. Maybe we could exclude methods defined in Main from the fast path? (I don't like this idea too much though)

Yea there might be a way to test for this, but parentmodule(f2) is also Main so I don't think that's quite right.

But the example seems pretty contrived. We're concerned about getting gradients right, which sometimes depends on evaluation order. This is a naked test of evaluation order, as a proxy for the real concern. Perhaps it's not a close enough one, and we should test slightly more realistic things.

More generally I'm not sure Julia makes guarantees about the order of evaluation of map or broadcast. Especially for broadcast. I'm not too sure what that means for AD yet.

ok, then I think I'm fine with merging this PR as it is

DhairyaLGandhi · 2021-06-26T04:01:42Z

@mcabbott please address the concerns brought up and please use bors to merge.

mcabbott · 2021-06-26T04:17:52Z

What concerns?

DhairyaLGandhi · 2021-06-26T04:20:54Z

My comment above would be a start.

mcabbott · 2021-06-26T04:30:39Z

Yes that's why I asked, I have no idea what that comment was meant to say. Apparently the people tagged also don't know. If you have actual technical concerns and can explain them, or at least provide a MWE, then I'm happy to have a look.

DhairyaLGandhi · 2021-06-26T04:44:08Z

How would this support immutable structures like static arrays without redoing work of writing kernels for every kind of array type that comes along?

mcabbott · 2021-06-26T04:58:29Z

Julia just specialises like usual. It's not mutating anything. What makes you think this is less likely to work on immutable arrays than the previous code?

There are right now zero tests for StaticArrays. But when I try things, it seems pretty easy to find examples where broadcasting fails, on the tagged version.

oschulz · 2021-10-01T13:44:43Z

28.625 μs (28 allocations: 469.50 KiB) # this PR, reverse mode
19.375 μs (52 allocations: 314.17 KiB) # this PR, forward mode

@mcabbott , how did you (meaning, how can I :-) ) select forward mode in Zygote - was this done using Zygote.forwarddiff?

mcabbott · 2021-10-01T13:57:40Z

The merged version will automatically use forward mode whenever possible.

The times quoted were from different variants tried while writing this PR -- the first step was map(StaticGetter{i}(), dxs_zip) which improved the generic reverse mode, the second step was to call broadcast_forward when possible (this internal function was previously used only for CUDA stuff).

Applying Zygote.forwarddiff to the whole broadcast will probably be much slower, as it works out the jacobian. While in broadcasting you know that each output only depends on certain inputs (the "sparsity pattern" is how some people like to say it), which broadcast_forward knows about.

oschulz · 2021-10-01T19:26:49Z

Thanks, @mcabbott !

Applying Zygote.forwarddiff to the whole broadcast will probably be much slower

Oh, sure - I meant for the function broadcasted over.

mcabbott · 2021-10-01T19:57:01Z

Ah OK. I never thought to try that. In principle that ought to compile down to roughly the same thing, right? But in practice:

julia> struct Cbrt ignore end;

julia> (::Cbrt)(x) = Zygote.forwarddiff(cbrt, x)

julia> Base.issingletontype(typeof(Cbrt(pi)))  # hence Zygote will not use dual numbers
false

julia> g1 = @btime gradient(xs -> sum(broadcast(Cbrt(pi), xs)), $xs);
  2.143 ms (40035 allocations: 2.90 MiB)

julia> g2 = @btime gradient(xs -> sum(broadcast(cbrt, xs)), $xs);  # default, as above
  90.667 μs (13 allocations: 312.78 KiB)

julia> g1[1] ≈ g2[1]
true

oschulz · 2021-10-01T20:35:59Z

Oh, interesting! The Zygote compiler now outperforms ForwardDiffPullbacks.fwddiff (which I created mainly because Zygote often used to be so slow with broadcasts) in this case:

julia> using Zygote, ForwardDiffPullbacks, BenchmarkTools
julia> xs = randn(10_000)

julia> @benchmark gradient(xs -> sum(broadcast(cbrt, xs)), $xs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  100.410 μs …  1.835 ms  ┊ GC (min … max): 0.00% … 91.63%
 Time  (median):     105.558 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   118.242 μs ± 91.057 μs  ┊ GC (mean ± σ):  6.04% ±  7.26%

 Memory estimate: 312.78 KiB, allocs estimate: 13.

julia> @benchmark gradient(xs -> sum(broadcast(fwddiff(cbrt), xs)), $xs)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  150.934 μs …  1.446 ms  ┊ GC (min … max): 0.00% … 87.23%
 Time  (median):     155.647 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   166.397 μs ± 61.921 μs  ┊ GC (mean ± σ):  2.00% ±  4.95%

 Memory estimate: 156.48 KiB, allocs estimate: 10.

I guess ForwardDiffPullbacks.fwddiff may still speed things up if the broadcasted function contains calculations that have many steps, in cases where Zygote fails to use Duals automatically (maybe not so often, anymore?). And fwddiff seems to reduce the GC overhead a bit.

mcabbott · 2021-10-01T20:49:23Z

I hadn't seen this package, thanks. Will have to understand what it does, to save memory.

Xref also JuliaDiff/ChainRules.jl#531, where there are some ideas on how to use derivatives_given_output to do better.

For now Zygote uses dual numbers a lot... well the list of conditions is fairly long. The arguments and output must be real numbers. You shouldn't be inside a second derivative. The function must not be a closure, for which Base.issingletontype(typeof(f)) is its (imperfect) test.

broadcasted function contains calculations that have many steps

Do you mean fusion, or just a complicated f in f.(x)?

Zygote remains always un-fused, i.e. f.(g.(x)) is exactly like y=g.(x); f.(y) done one-by-one. But (f∘g).(x) should still use dual numbers.

oschulz · 2021-10-01T20:50:24Z

Looks like for the ForwardDiffPullbacks.fwddiff example, fwddiff still yields a massive speedup:

julia> using Zygote, ForwardDiffPullbacks, StaticArrays, BenchmarkTools

julia> f = (xs...) -> (sum(map(x -> sum(map(x -> x^2, x)), xs)))

julia> xs = (2, (3, 4), SVector(5, 6, 7));
julia> Xs = map(x -> fill(x, 100), xs);

julia> @benchmark Zygote.gradient(($Xs...) -> sum(f.($Xs...)), $Xs...)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  403.472 μs …   5.890 ms  ┊ GC (min … max): 0.00% … 91.96%
 Time  (median):     452.634 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   474.041 μs ± 294.882 μs  ┊ GC (mean ± σ):  3.51% ±  5.18%

 Memory estimate: 143.06 KiB, allocs estimate: 3229.

julia> @benchmark Zygote.gradient(($Xs...) -> sum(fwddiff($f).($Xs...)), $Xs...)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.543 μs … 393.863 μs  ┊ GC (min … max): 0.00% … 95.79%
 Time  (median):     5.010 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.637 μs ±  12.513 μs  ┊ GC (mean ± σ):  8.13% ±  3.62%

 Memory estimate: 9.11 KiB, allocs estimate: 64.

Whew, not all my work was in vain - for now .... ;-)

oschulz · 2021-10-01T20:52:55Z

broadcasted function contains calculations that have many steps
Do you mean fusion, or just a complicated f in f.(x)?

Just an f that has many steps, loops, etc., stuff that will result in a long tape if Zygote reverse-mode diffs it.

oschulz · 2021-10-01T21:07:51Z

I hadn't seen this package, thanks. Will have to understand what it does

ForwardDiffPullbacks.jl basically "just" provides an rrule for a function wrapped with fwddiff. It supports functions with multiple arguments, in contrast to ForwardDiff.jl itself, but it's currently limited to arguments that are real numbers, tuples or static vectors. I should add support for NamedTuple and maybe Complex and so on. I'd like to add support for arrays as well at some point (won't be efficient for large arrays, of course).

The main idea behind ForwardDiffPullbacks.jl is to allow users to force forward-mode AD, compatible with anything the respects rrule, and with fairly lightweight dependencies (no dependency on Zygote, etc.). The main use case would be forcing forward-mode for functions with few degrees of freedom but involved/many-step calculations and functions that are not Zygote/Diffractor compatible (because the use mutation or so).

mcabbott · 2021-10-01T21:38:48Z

I see, so I should think of this as being a bit like Zygote.forwarddiff, but accepting different types. I think.

for now .... ;-)

Well I certainly have no plans to make Zygote's broadcasting subsume this, doing stuff beyond arrays of numbers I mean. But I would like to figure out how broadcast(fwddiff(cbrt), xs) gets away with less memory.

Xref also JuliaDiff/ChainRulesDeclarationHelpers.jl#2 , not the same but perhaps of interest.

mcabbott · 2021-10-02T13:28:24Z

Will have to understand what it does, to save memory.

OK now I see -- it looks like it evaluates twice, which avoids having to save the partials between the forward pass and the backward.

julia> using Zygote, ForwardDiffPullbacks

julia> pr(x) = @show x;

julia> gradient(fwddiff(pr), 2pi)
x = 6.283185307179586
x = Dual{ForwardDiff.Tag{Tuple{typeof(pr), Val{1}}, Float64}}(6.283185307179586,1.0)
(1.0,)

julia> gradient(x -> Zygote.forwarddiff(pr,x), 2pi)
x = Dual{Nothing}(6.283185307179586,1.0)
(1.0,)

oschulz · 2021-10-02T19:36:32Z

it looks like it evaluates twice

Yes - it's a bit wasteful, but the forward pass is without dual numbers and the reverse pass uses separate duals for each thunk. I should optimize that, at least for the case where there's only one arguments.

oschulz · 2021-10-02T19:41:46Z

I'm very open to changes/suggestions/PRs regarding ForwardDiffPullbacks - I'd be glad to move it to the JuliaDiff GitHub org once it's considered solid enough.

mcabbott · 2021-10-02T20:15:31Z

For broadcasting it seems tricky to guess how to make this trade-off, in general. Some functions are more expensive than others. But for an explicit wrapper like this, you know what you're getting... I suppose it could in principle offer an option to specify whether to re-compute or not.

For numbers, as you say, a chunked calculation seems like it should not have downsides.

julia> pr(x, y) = (@show x y; x/y);

julia> gradient(fwddiff(pr), 2pi, 4)
x = 6.283185307179586
y = 4
x = Dual{ForwardDiff.Tag{Tuple{typeof(pr), Val{1}}, Float64}}(6.283185307179586,1.0)
y = 4
x = 6.283185307179586
y = Dual{ForwardDiff.Tag{Tuple{typeof(pr), Val{2}}, Int64}}(4,1)
(0.25, -0.39269908169872414)

julia> gradient((x,y) -> Zygote.forwarddiff(Base.splat(pr), [x,y]), 2pi, 4)
x = Dual{Nothing}(6.283185307179586,1.0,0.0)
y = Dual{Nothing}(4.0,0.0,1.0)
(0.25, -0.39269908169872414)

Zygote uses ForwardDiff for differentiating broadcasting operations. See FluxML/Zygote.jl#1001. However ForwardDiff cannot differentiate non-generic code that accepts only Float64 (it needs to pass its Dual type). In addition ForwardDiff defines rules via DiffRules. It doesn't understand ChainRules. This means we need to define some rules here to make things work. See discussion here to understand how to add new rules for ForwardDiff: https://discourse.julialang.org/t/issue-with-forwarddiff-custom-ad-rule/72886/6 If JuliaDiff/DiffRules.jl#74 gets merged I should not need this anymore.

faster generic broadcasting

61d4eeb

mcabbott commented Jun 18, 2021

View reviewed changes

oxinabox reviewed Jun 18, 2021

View reviewed changes

src/lib/broadcast.jl Outdated Show resolved Hide resolved

oxinabox reviewed Jun 18, 2021

View reviewed changes

src/lib/broadcast.jl Show resolved Hide resolved

oxinabox reviewed Jun 18, 2021

View reviewed changes

src/lib/broadcast.jl Show resolved Hide resolved

oxinabox approved these changes Jun 18, 2021

View reviewed changes

mcabbott added 4 commits June 18, 2021 14:38

(StaticGetter)(::Nothing)

7f52f56

style change

079c287

use StaticGetter for CuArrays too

4584136

rm _get

b81c2e3

mcabbott commented Jun 18, 2021

View reviewed changes

mcabbott added 6 commits June 18, 2021 20:27

Revert "use StaticGetter for CuArrays too", ++

0720358

This reverts commit 4584136.

use forward mode sometimes

9227168

safer version

9578ae7

power

bdb1d1f

ComposedFunction is new

f7e96b1

add & fix some tests

c34b2b9

mcabbott commented Jun 19, 2021

View reviewed changes

src/lib/broadcast.jl Outdated Show resolved Hide resolved

mcabbott commented Jun 19, 2021

View reviewed changes

src/lib/broadcast.jl Outdated Show resolved Hide resolved

widen types

2c10638

CarloLucibello reviewed Jun 19, 2021

View reviewed changes

src/lib/broadcast.jl Outdated Show resolved Hide resolved

mcabbott added 3 commits June 19, 2021 10:25

simpler purity check

86c3bb4

use purity check in map, too?

570db21

simplify

72eb681

mcabbott commented Jun 20, 2021

View reviewed changes

mcabbott mentioned this pull request Jun 20, 2021

Rule for sum(f, xs) JuliaDiff/ChainRules.jl#441

Merged

CarloLucibello approved these changes Jun 20, 2021

View reviewed changes

mcabbott mentioned this pull request Jun 21, 2021

ensure sum(f,x) works on GPU #1004

Merged

mcabbott deleted the broadget branch June 25, 2021 21:09

mcabbott mentioned this pull request Jul 15, 2021

It seems a breaking change was introduced after v0.6.14 #1029

Closed

ToucheSir mentioned this pull request Jul 23, 2021

GRU implementation details FluxML/Flux.jl#1671

Closed

mcabbott mentioned this pull request Sep 9, 2021

Second derivatives not working #1067

Closed

This was referenced Oct 1, 2021

Use forward-mode in broadcasting? #336

Closed

Slow code and excessive allocations when backpropagating broadcasting #592

Closed

Implement fusible forward/reverse-modes AD for broadcasting #338

Open

cossio mentioned this pull request Dec 10, 2021

Broadcast gradient error #1132

Closed

ToucheSir mentioned this pull request Nov 17, 2022

Gradients Through Broadcasts Slow Due to Type Inference Failing? #885

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster generic broadcasting #1001

Faster generic broadcasting #1001

mcabbott commented Jun 18, 2021 •

edited

Loading

mcabbott Jun 18, 2021

oxinabox left a comment

mcabbott Jun 18, 2021

mcabbott Jun 18, 2021

oxinabox Jun 18, 2021

mcabbott Jun 19, 2021

mcabbott Jun 19, 2021 •

edited

Loading

oxinabox Jun 20, 2021

ToucheSir commented Jun 20, 2021

mcabbott commented Jun 20, 2021 •

edited by CarloLucibello

Loading

mcabbott Jun 20, 2021 •

edited by CarloLucibello

Loading

CarloLucibello Jun 20, 2021

mcabbott Jun 20, 2021

CarloLucibello Jun 20, 2021

DhairyaLGandhi commented Jun 26, 2021

mcabbott commented Jun 26, 2021

DhairyaLGandhi commented Jun 26, 2021

mcabbott commented Jun 26, 2021

DhairyaLGandhi commented Jun 26, 2021

mcabbott commented Jun 26, 2021 •

edited

Loading

oschulz commented Oct 1, 2021

mcabbott commented Oct 1, 2021

oschulz commented Oct 1, 2021

mcabbott commented Oct 1, 2021

oschulz commented Oct 1, 2021 •

edited

Loading

mcabbott commented Oct 1, 2021

oschulz commented Oct 1, 2021 •

edited

Loading

oschulz commented Oct 1, 2021 •

edited

Loading

oschulz commented Oct 1, 2021 •

edited

Loading

mcabbott commented Oct 1, 2021

mcabbott commented Oct 2, 2021

oschulz commented Oct 2, 2021

oschulz commented Oct 2, 2021

mcabbott commented Oct 2, 2021

Faster generic broadcasting #1001

Faster generic broadcasting #1001

Conversation

mcabbott commented Jun 18, 2021 • edited Loading

Choose a reason for hiding this comment

oxinabox left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott Jun 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ToucheSir commented Jun 20, 2021

mcabbott commented Jun 20, 2021 • edited by CarloLucibello Loading

mcabbott Jun 20, 2021 • edited by CarloLucibello Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DhairyaLGandhi commented Jun 26, 2021

mcabbott commented Jun 26, 2021

DhairyaLGandhi commented Jun 26, 2021

mcabbott commented Jun 26, 2021

DhairyaLGandhi commented Jun 26, 2021

mcabbott commented Jun 26, 2021 • edited Loading

oschulz commented Oct 1, 2021

mcabbott commented Oct 1, 2021

oschulz commented Oct 1, 2021

mcabbott commented Oct 1, 2021

oschulz commented Oct 1, 2021 • edited Loading

mcabbott commented Oct 1, 2021

oschulz commented Oct 1, 2021 • edited Loading

oschulz commented Oct 1, 2021 • edited Loading

oschulz commented Oct 1, 2021 • edited Loading

mcabbott commented Oct 1, 2021

mcabbott commented Oct 2, 2021

oschulz commented Oct 2, 2021

oschulz commented Oct 2, 2021

mcabbott commented Oct 2, 2021

mcabbott commented Jun 18, 2021 •

edited

Loading

mcabbott Jun 19, 2021 •

edited

Loading

mcabbott commented Jun 20, 2021 •

edited by CarloLucibello

Loading

mcabbott Jun 20, 2021 •

edited by CarloLucibello

Loading

mcabbott commented Jun 26, 2021 •

edited

Loading

oschulz commented Oct 1, 2021 •

edited

Loading

oschulz commented Oct 1, 2021 •

edited

Loading

oschulz commented Oct 1, 2021 •

edited

Loading

oschulz commented Oct 1, 2021 •

edited

Loading