Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support parallel maps (tmap, pmap, vmap, etc.) #728

Merged
merged 11 commits into from
Jul 10, 2020

Conversation

ChrisRackauckas
Copy link
Member

@ChrisRackauckas ChrisRackauckas commented Jul 8, 2020

This PR is quite simple and does it. It would be a shame if we had to keep waiting to support data parallelism just because we can't figure out where to put the code, so I hope this PR can drive the conversation.

@ChrisRackauckas ChrisRackauckas changed the title support pmap support parallel maps (tmap, pmap, vmap, etc.) Jul 8, 2020
@ChrisRackauckas
Copy link
Member Author

Note that I would very much understand not wanting the tmap here. It's really to show how it would be done. That's pulled from https://github.com/jw3126/ThreadingTools.jl which isn't registered

@DhairyaLGandhi
Copy link
Member

I am not sure adding all the utilities here is such a good idea, it bloats up the package and is arguably the domain of a different package (ideally Base) that we can support in terms of writing adjoints etc. But I get the point of having this.

@CarloLucibello
Copy link
Member

yeah, clearly tmap doesn't belong here. We should definitely have pmap though, being in Base, and also vmap since it's used in NNlib

@ChrisRackauckas
Copy link
Member Author

The issue with the maps is that, in some sense, it should be using multiple dispatch, but it doesn't. So the adjoints for all of these different maps is literally the same piece of code which makes it easier to do all in one loop. What I am thinking might be a good quick solution is:

@chriselrod
Copy link

Note that LoopVectorization also has vmapnt (uses nontemporal stores) and vmapntt (uses nontermporal stores and threads). Documented here. I'm not too excited about the non-temporal stores (I'm assuming 100 million element vectors is longer than practical applications we'll be seeing on the CPU?; non-temporal stores can severely hurt performance at smaller sizes), but I'd be happy to add the missing vmapt for temporal stores + threads.

But I wonder if instead of Threading the maps and threading the matrix multiplications, we should divide batches up by threads?
I'd have to do tests, but I speculate that if layers aren't too large, we could divide the batch up into chunks that fit in the L2 cache and get better cache locality by working on these mini-batches sequentially in each thread. If the matrices are small enough that we don't need much memory churn within those mini-batches, that could decrease the amount of memory churn needed.

@CarloLucibello
Copy link
Member

Until we figure out the best way forward, let's merge this with just pmap and vmap. Some tests are needed as well

Project.toml Show resolved Hide resolved
@ChrisRackauckas
Copy link
Member Author

Added some tests now too. I really just wanted to fix the fact that "can Zygote differentiate parallel code?" had a solution that could be done in about 10 minutes from a cellphone screen during a call, so there's really no reason to keep waiting. This'll work on the 90% of codes which happen to be data-parallel, and we can get a tmap in a different package, and the rest of parallelism can take a lot longer.

yeah, clearly tmap doesn't belong here. We should definitely have pmap though, being in Base, and also vmap since it's used in NNlib

The nicer solution is JuliaLang/julia#36579 (comment), but that doesn't exist yet, so I think it's best to just get something working until data-parallelism in Julia itself gets more organized, and then users will naturally switch to that along with Zygote.

@CarloLucibello
Copy link
Member

bors r+

bors bot added a commit that referenced this pull request Jul 8, 2020
728: support parallel maps (tmap, pmap, vmap, etc.) r=CarloLucibello a=ChrisRackauckas

This PR is quite simple and does it. It would be a shame if we had to keep waiting to support data parallelism just because we can't figure out where to put the code, so I hope this PR can drive the conversation.

Co-authored-by: Chris Rackauckas <accounts@chrisrackauckas.com>
@bors
Copy link
Contributor

bors bot commented Jul 8, 2020

Build failed:

@ChrisRackauckas
Copy link
Member Author

Hmm. @chriselrod is the LoopVectorization stack blocking something in LLVM.jl that CUDA.jl needs?

@chriselrod
Copy link

chriselrod commented Jul 8, 2020

VectorizationBase is currently set to require 1.5, 1.6, 1.7, or 2.0. I'll see if I can set it to 1.0, 2.0.

@CarloLucibello
Copy link
Member

CUDA.jl has compat
LLVM = "1.5.2, 2"
so I don't see what is causing the problem

@DhairyaLGandhi
Copy link
Member

Quick fix would be to wrap it in @requires

@ChrisRackauckas
Copy link
Member Author

How would that fix it? Seems like that just wouldn't test it?

@maleadt
Copy link
Contributor

maleadt commented Jul 9, 2020

CUDA 1.1 has dropped support for Julia 1.3, but only 1.1 is compatible with LLVM 2.0 (which is required explicitly by the build here -- not sure why).

@DhairyaLGandhi
Copy link
Member

bors try

bors bot added a commit that referenced this pull request Jul 9, 2020
@bors
Copy link
Contributor

bors bot commented Jul 9, 2020

try

Build failed:

@DhairyaLGandhi
Copy link
Member

bors try

bors bot added a commit that referenced this pull request Jul 9, 2020
@chriselrod
Copy link

It needs to depend on and using Distributed for pmap.

@bors
Copy link
Contributor

bors bot commented Jul 9, 2020

try

Build failed:

@chriselrod
Copy link

chriselrod commented Jul 9, 2020

I'll generalize the vmap method and release a new LoopVectorization version shortly. EDIT: 0.8.15 has merged. It uses a SIMD version when the arguments are all <: DenseArray{<:Union{Bool,Base.HWReal}} and just falls back to map! otherwise.

@ChrisRackauckas
Copy link
Member Author

Tests are passing locally with the latest LoopVectorization 🎉

@ChrisRackauckas
Copy link
Member Author

bors try

@bors
Copy link
Contributor

bors bot commented Jul 10, 2020

🔒 Permission denied

Existing reviewers: click here to make ChrisRackauckas a reviewer

@ChrisRackauckas
Copy link
Member Author

:(

@DhairyaLGandhi
Copy link
Member

bors try

bors bot added a commit that referenced this pull request Jul 10, 2020
@bors
Copy link
Contributor

bors bot commented Jul 10, 2020

try

Build succeeded:

@ChrisRackauckas
Copy link
Member Author

Bors ran on v1.4 (https://gitlab.com/JuliaGPU/Zygote-jl/-/jobs/632198576#L50) but I forgot to change the name of the job so it said v1.3. So that last commit is just fixing the name but it should be good to go.

@CarloLucibello
Copy link
Member

still not clear what is forcing LLVM to 2.0 on julia 1.3. I don't think we had this issue before this PR

@CarloLucibello
Copy link
Member

actually the issue is already on master, see #729

bors r+

@bors
Copy link
Contributor

bors bot commented Jul 10, 2020

Build succeeded:

@bors bors bot merged commit 1996835 into FluxML:master Jul 10, 2020
@ChrisRackauckas ChrisRackauckas deleted the pmap branch July 10, 2020 11:23
bors bot added a commit that referenced this pull request Jul 10, 2020
733: add back julia 1.3 compatibility r=CarloLucibello a=CarloLucibello

unnecessarily dropped in #728 

Having a separate test/Project.toml as specified in the [Pkg docs](https://julialang.github.io/Pkg.jl/v1/creating-packages/#Test-specific-dependencies-in-Julia-1.2-and-above-1) solves the problem of forcing the same Manifest.toml created from the root Project.toml on each testing environment, which may lead to incompatibilities with some extra packages required in the test phase. In this case in fact, LLVM,jl 2.0 was installed on julia 1.3, but then no CUDA.jl version could be resolved

Co-authored-by: CarloLucibello <carlo.lucibello@gmail.com>
bors bot added a commit that referenced this pull request Dec 8, 2020
842: Remove LoopVectorization.jl r=CarloLucibello a=mcabbott

Zygote depends on LoopVectorization only in order to add one gradient definition, for `vmap`. That doesn't seem like a good idea. Surely the rule can live elsewhere, via ZygoteRules ~~or ChainRulesCore~~ which exist for this purpose. Seems to have been added along with `pmap` in #728. 

Marked as draft here until such a rule exists elsewhere. (It's possible that, after that, Zygote should keep the dep. & not load it, for a while, just all Pkg to bound versions.)

The rule might also be simplified, noted here #838 (comment), as I don't think `vmap` can guarantee that the elements will be acted on in order. ~~Since `pmap` also can't guarantee to work in order, it also seems to have no need to reverse the order on the reverse pass.~~

Co-authored-by: Michael Abbott <me@escbook>
bors bot added a commit that referenced this pull request Dec 8, 2020
842: Remove LoopVectorization.jl r=CarloLucibello a=mcabbott

Zygote depends on LoopVectorization only in order to add one gradient definition, for `vmap`. That doesn't seem like a good idea. Surely the rule can live elsewhere, via ZygoteRules ~~or ChainRulesCore~~ which exist for this purpose. Seems to have been added along with `pmap` in #728. 

Marked as draft here until such a rule exists elsewhere. (It's possible that, after that, Zygote should keep the dep. & not load it, for a while, just all Pkg to bound versions.)

The rule might also be simplified, noted here #838 (comment), as I don't think `vmap` can guarantee that the elements will be acted on in order. ~~Since `pmap` also can't guarantee to work in order, it also seems to have no need to reverse the order on the reverse pass.~~

Co-authored-by: Michael Abbott <me@escbook>
@alok alok mentioned this pull request Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants