Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accumulate gradient with the new gradient API? #707

Closed
chengchingwen opened this issue Mar 26, 2019 · 10 comments · Fixed by FluxML/Zygote.jl#902
Closed

accumulate gradient with the new gradient API? #707

chengchingwen opened this issue Mar 26, 2019 · 10 comments · Fixed by FluxML/Zygote.jl#902

Comments

@chengchingwen
Copy link
Member

I want to accumulate the gradient value of multiple data before I do back-prop. In the old API, I can just run multiple times of back!(loss) then call the udpate!(opt, params(model)). However, with the new gradient API, I have to collect all the Grad(...) beforehand and then call multiple times of update!(opt, params(model)), grad). Are there any better ways to do this?

@MikeInnes
Copy link
Member

Just sum the losses you're interested in; the gradient for the sum will be the same as the sum of the gradients.

@chengchingwen
Copy link
Member Author

Wouldn't that requires more memories?

@MikeInnes
Copy link
Member

Yes, true. The other option is just to define + on gradient objects and loop over them. I don't think that works yet but it'd be easy to implement.

@darsnack
Copy link
Member

@chengchingwen I added this to the community triage for discussion. Do you have any more comments on this? I'm inclined to say that this isn't something that needs an explicit function in Flux, but I am curious to hear if there is a use-case where this would be really helpful.

Note that you can write this quite elegantly with

map(params(m), grads) do p, g
  update!(opt, p, g)
end

or something.

@CarloLucibello
Copy link
Member

For generic optimizers, what you suggest is not equivalent to first accumulating gradients then performing a single update.
We should fix this by implementing gradient algebra in Zygote, as Mike suggested

@CarloLucibello
Copy link
Member

This is useful when one wants to perform updates with large batchsize but is memory constrained

@CarloLucibello
Copy link
Member

Also useful for data parallellism

@darsnack
Copy link
Member

Oh I totally missed the accumulation part. Yeah perhaps we can open an issue on Zygote to track gradient algebra in general?

@darsnack
Copy link
Member

During triage, people were fairly enthusiastic about supporting something like this. There were not too many concerns about adding functionality to Params or Grads (it seems like gradient algebra is a sensible and safe feature to add).

The implementation approach that we discussed was to create map/map! that allows mapping arbitrary functions over the gradients in an out-of-place and in-place fashion. This would effectively "forward" the function down to the underlying weight arrays (maybe this is already possible since Params/Grads forward Base.iterate). For specific operations like +/*, we suggested implementing a broadcast rule:

gs1 .+ gs2  # becomes map(+, gs1, gs2)
gs1 .+= gs2 # becomes map!(+, gs1, gs1, gs2)

@Maximilian-Stefan-Ernst

Since this is the first hit when searching the internet for how to do gradient accumulation in Flux.jl, maybe it's helpful to add here that this is now implemented in Optimisers.jl as AccumGrad.

(Just realized this after spending quite some time writing it myself, which can be annoying if you already have a complicated training loop.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants