-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable BLAS threading within conv_im2col!
etc.
#395
Conversation
if nthreads() > 1 | ||
th = BLAS.get_num_threads() | ||
BLAS.set_num_threads(1) | ||
# conv_im2col! has a loop with @threads, and benchmarks show that this is usually | ||
# faster without BLAS multithreading, and without @spawn in the zip(x_cs, w_cs) loop. | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know of any way to do this on a task or thread-local level? It's too bad that this requires mutating global state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't, although I agree it's ugly.
Shouldn't this also test a convolution with groups > 1? That's what the spawn was for. |
Maybe, got an example? In this case the |
im2col!
etc.conv_im2col!
etc.
Shall we go on with this? It could test also the grouped conv in FluxML/Flux.jl#1921 (comment) I can benchmark on a amd threadripper platform |
Benchmarks would be great. I stopped here because I thought we ought to check a whole range -- what if there is one group, or fewer channels than threads, etc... just timing one thing could lead you anywhere. Ideally we'd have multi-threading only on one outermost loop (and with some heuristic to decide when the problem is big enough). If one loop can't cover all cases then IIRC |
@CarloLucibello thoughts on expanding https://github.com/FluxML/Flux.jl/blob/master/perf/conv.jl to use as a benchmarking suite? |
Looking at #234, the implementation of
im2col!
seems to have 3 nested multi-threading operations: a@spawn
, a@threads
loop over the batch dim, and then BLAS threads. That might not be optimal.This PR finds some 30% speedups by keeping just the
@threads
loop. But could use more testing, etc. Especially for someone to try on a newer Intel machine with MKL.Before:
After: