Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support groups in DenseConvDims #289

Merged
merged 36 commits into from
Jul 16, 2021
Merged

Support groups in DenseConvDims #289

merged 36 commits into from
Jul 16, 2021

Conversation

DhairyaLGandhi
Copy link
Member

Also propagate it to the kernel calls.

@CarloLucibello
Copy link
Member

Our cpu conv kernels don't seem to support groups, do they?

@CarloLucibello
Copy link
Member

Our cpu conv kernels don't seem to support groups, do they?

@DhairyaLGandhi can I get any answers here? Which conv implementation supports groups? Can you point to some specific line of code?

@CarloLucibello
Copy link
Member

ah you are implementing groups support right now, don't have to be that mysterious :)

@DhairyaLGandhi
Copy link
Member Author

So 115784b is a "working" version which is very memory wasteful, but I have it here right now in case someone was eager to see it working in NNlib. I will be moving to the non-allocating version of this soon. The issue is Julia doesn't like to look into the stride of deep and lazy wrapper types around SubArray, ReshapedArray etc and can cause races, or return garbage. In the previous version, we had segfaults because the DenseConvDims needs to be extra careful about the encoding dims that include the groups.

@DhairyaLGandhi
Copy link
Member Author

For the record, not using tasks is as fast as the original version for the regular convolutions (ie with groups = 1), and maybe the serial version is faster anyway.

src/conv.jl Outdated Show resolved Hide resolved
@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Mar 25, 2021

Oh this is satisfying. Starting Julia with -t2 on my tiny MacBook Pro with a dual core. For context this results in about a 20% reduction compared to doing the sequential thing. It might be safer to not do that, but RFC?

julia> @btime y = conv($(rand(Float32, 28, 28, 100, 2)), $(rand(Float32, 3, 3, 20, 15)), groups = 5);
  947.722 μs (159 allocations: 4.81 MiB)

Either way we now have the system in place to pass it to CUDNN. Should I also add a stub to DepthwiseConvDims for consistency? I might do that actually.

src/conv.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

For context this results in about a 20% reduction compared to doing the sequential thing. It might be safer to not do that, but RFC?

I don't see any safety issue in going multithreaded here, computations are independent.

We should check that using threads doesn't have a performance impact on the standard convolution case (G=1).

@DhairyaLGandhi
Copy link
Member Author

Its a question of computational complexity. The existing cases aren't affected much at all.

@CarloLucibello
Copy link
Member

I think the existing cases are affected because spawning threads doesn't come for free, but hopefully could be negligible compared to the convolution cost. If that's not the case, we can still dispatch on the G=1 case.

In any case, we should check we don't have performance regressions in standard convolution before merging
(+ the obvious stuff: existing tests are not passing, new tests are needed, docstring should be updated)

@DhairyaLGandhi
Copy link
Member Author

Way ahead of you :)

@darsnack
Copy link
Member

For context this results in about a 20% reduction compared to doing the sequential thing.

🎉

src/conv.jl Outdated Show resolved Hide resolved
@DhairyaLGandhi
Copy link
Member Author

Addressed all the comments. I've also added more tests and docs. @pxl-th could you rebase your NNlibCUDA PR? Any other comments?

@CarloLucibello
Copy link
Member

not sure this respecting the fact that channels_in(cdims) == size(x, 3) and channels_out(cdims) == size(y, 3)

src/conv.jl Outdated Show resolved Hide resolved
src/conv.jl Outdated Show resolved Hide resolved
@DhairyaLGandhi
Copy link
Member Author

Done, also made channels_out respect groups. I think storing the groups is needed anyway along with the regular C_out (since we ultimately use the G = 1 definition).

@DhairyaLGandhi
Copy link
Member Author

Ah, almost forgot - we need channels_out to mean the same as master - since that data is needed. See ∇conv_data!

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Jul 15, 2021

I've addressed the reviews - thanks @staticfloat! I had a couple of debugging steps that I've cleaned up now. Also comes with more tests, and docs!

Re the added check - its there so we can give a clearer message about groups config.

Re threading - it allows using the newer scheduler which reasons about in terms of tasks, so its pretty smart about how to handle the resources. It is composable too https://julialang.org/blog/2019/07/multithreading/

@DhairyaLGandhi
Copy link
Member Author

@pxl-th do you think you can test your NNlibCUDA PR off of this one? The API is mostly the same, but good to get a check that we are passing into CUDNN properly.

@DhairyaLGandhi
Copy link
Member Author

Let's merge this. I'll wait till tomorrow, so if anyone has final thoughts we can address those.

@pxl-th
Copy link
Member

pxl-th commented Jul 16, 2021

@DhairyaLGandhi I've tested it. Everything seems to work, all the tests are passing.

@DhairyaLGandhi
Copy link
Member Author

Nice, let's get these PRs merged then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants