Add groups to Conv1d #948

Rifur13 · 2024-04-01T19:12:11Z

Proposed changes

Adding groups to 1D convolutions. Resolves #237.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Rifur13 · 2024-04-01T19:12:41Z

Wdyt? This is for CPU only. The GPU code should be very similar to this so I want to get some feedback before I continue.

Main changes:

The input and kernel weights need to be transposed to cleanly split up the input into groups for the matmuls.
The result of each grouped convolution won’t be contiguous in the output - so they need to we inserted with a slice.

awni · 2024-04-03T04:36:32Z

@Rifur13 this looks cool! Do you intend to add the GPU kernel here? Also this will just be for 1D grouped convolutions, correct?

Also would be great to if you can run some benchmarks:

Regular conv pre/post (make sure no change)
Group conv (ideally much faster when using lots of groups as compared to same shape conv with a single group)

Rifur13 · 2024-04-03T14:16:07Z

Yep I intend to add the GPU kernel as well. And yes, this PR will focus on 1D convolutions only.

Benchmarks coming soon!

Rifur13 · 2024-04-09T14:37:25Z

Performance doesn’t look great, it scales worse with more groups.  

(N, iH, C)	(O, wH, C)	dtype	stride	pads	groups	diff%
(4, 32, 32)	(32, 5, 32)	float32	1	2	1	+179.77%
(4, 32, 32)	(32, 5, 32)	float32	1	2	2	+59.62%
(4, 32, 32)	(32, 5, 32)	float32	1	2	4	+33.96%
(4, 32, 32)	(32, 5, 32)	float32	1	2	8	+0.71%
(4, 32, 32)	(32, 5, 32)	float32	1	2	8	+15.49%
(4, 32, 32)	(32, 5, 32)	float32	1	2	16	-32.81%
(4, 32, 32)	(32, 5, 32)	float32	1	2	32	-62.36%
(4, 32, 256)	(512, 5, 256)	float32	1	2	2	+41.59%
(4, 32, 256)	(512, 5, 256)	float32	1	2	128	-88.60%
(4, 32, 256)	(512, 5, 256)	float32	1	2	256	-93.96%

What we really need is a specialized steel_matmul that splits up the inputs into groups and dispatches the kernels in parallel. It might take me a while to understand all the gemm kernel code. I’m not sure how much time I’ll have so if something really needs it they can take up this work.

It would be good to have some working version in the meantime to unblock people (like me).

Rifur13 · 2024-04-09T15:53:12Z

I’ll take another look actually. If I ignore the split k specialization this seems very doable.

awni · 2024-04-12T13:49:49Z

Just curious what is the last column measuring? It's a difference from what to what exactly? CPU -> GPU?

Rifur13 · 2024-04-12T14:15:42Z

No it's actually mlx vs pytorch. They should scale similarly so I use these numbers to measure performance.

Also small update: I'm trying to parallelize the groups for loop by sending each kernel to a different command buffer. So I will create groups streams, groups command queues, etc... Working through some errors right now, but lmk if that makes sense

awni · 2024-04-12T14:19:01Z

Also small update: I'm trying to parallelize the groups for loop by sending each kernel to a different command buffer. So I will create groups streams, groups command queues, etc... Working through some errors right now, but lmk if that makes sense

Actually, I would not do that. That is going to introduce a lot of overhead and subvert how we do job submission for the GPU.

The best approach is to have a single kernel to do all the groups and handle that extra dimension in the thread grid or something like that. But I realize that might be a lot more work.

A less good option that you could try is to use a concurrent command encoder. If you rebase on main, you will get some functionality to make that much easier.

awni · 2024-04-12T14:20:24Z

Here is a very simple example of how we do that in concatenate now: https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/primitives.cpp#L556-L564

Rifur13 · 2024-04-14T21:03:17Z

Thanks for guiding me in the right direction! Numbers looks very good now and it’s review for review.

N	iH	C	O	wH	C	dtype	stride	pads	groups	diff%
4	32	32	32	5	32	float32	1	2	1	+189.00%
4	32	32	32	5	32	float32	1	2	2	+176.95%
4	32	32	32	5	32	float32	1	2	4	+185.48%
4	32	32	32	5	32	float32	1	2	8	+183.16%
4	32	32	32	5	32	float32	1	2	8	+181.10%
4	32	32	32	5	32	float32	1	2	16	+145.79%
4	32	32	32	5	32	float32	1	2	32	+102.98%
4	32	256	512	5	256	float32	1	2	2	+110.27%
4	32	256	512	5	256	float32	1	2	128	+50.08%
4	32	256	512	5	256	float32	1	2	256	+28.68%

awni · 2024-04-15T13:11:21Z

Very nice result!! Will review soon.

mlx/ops.cpp

Rifur13 · 2024-04-16T23:19:54Z

mlx/backend/metal/conv.cpp

+  // Transpose unfolded inputs
+  array in_view(
+      {in_unfolded.shape(0), conv_params.C, kernel_size},
+      in_unfolded.dtype(),
+      nullptr,
+      {});
+  in_view.copy_shared_buffer(
+      in_unfolded,
+      {in_unfolded.strides(0), 1, static_cast<size_t>(conv_params.C)},
+      in_unfolded.flags(),
+      in_unfolded.data_size());
+
+  // Materialize
+  auto in_transpose = array(in_view.shape(), in_view.dtype(), nullptr, {});
+  copy_gpu(in_view, in_transpose, CopyType::General, s);


I can also create a new unfold kernel and do this transpose directly in there. It will avoid an extra copy. wdyt?

Sounds like a good idea to me!

CC @jagrit06

awni · 2024-04-18T02:37:46Z

@Rifur13 did you do any benchmarking for the CPU version? It's not a super high priority to make it fast, but we also don't want to make it worse than it was

Rifur13 · 2024-04-18T20:46:41Z

There’s an extra copy so in theory it should be worse but I didn’t see a noticeable difference in my tests. The code for convolutions when groups = 1 is unchanged now, so the performance is identical as before.

I refactored the code to remove this copy and I think it also looks a lot cleaner. It’s easier to understand the code for groups vs without groups.

Rifur13 · 2024-04-23T21:04:41Z

Any notes or concerns?

awni · 2024-04-24T00:03:57Z

Not really on my side. I think we can merge this, results are very nice and code looks good!! @jagrit06 or @angeloskath do either of you care to take a quick look?

jagrit06

It looks great
Just a couple of things

Could you add an error being thrown in the vjp of convolutions for now if groups != 1

Also, is there any reason that the gemm we go to for grouped convs needs to be a separate kernel ? It looks the same gemm kernel - so we don’t need to have it as a separate kernel and add to the size of the metallib

Rifur13 · 2024-04-24T14:28:35Z

@jagrit06 Good catch I’ll add a comment for the jvp.

The existing gemm kernel using the 3rd grid dim as the batch size.
Are you suggesting to repurpose batches as groups? Readability would take a hit imo.

I think it’s possible if we set:

params->batch_ndim = 1
params->batch_stride_a = K
params->batch_stride_b = N * K
params->batch_stride_d = N

jagrit06 · 2024-04-24T16:15:28Z

@jagrit06 Good catch I’ll add a comment for the jvp.

The existing gemm kernel using the 3rd grid dim as the batch size. Are you suggesting to repurpose batches as groups? Readability would take a hit imo.

I think it’s possible if we set:
params->batch_ndim = 1
params->batch_stride_a = K
params->batch_stride_b = N * K
params->batch_stride_d = N

Exactly as you suggest, we can set the batch strides to let the tid.z handle that
I don't particularly think this is a bad enough readability hit for us to include the overhead of compiling and packing all new sets of gemm kernels which are basically the same as the ones we already have

Thanks!

Rifur13 · 2024-04-24T22:07:23Z

Done! Thanks for all the suggestions.

Ready for a final review.

jagrit06

Thank you so much for the good work!
We should be good to merge once the tests pass

awni · 2024-04-25T02:43:05Z

@Rifur13 the conv 1d test failed. Do you mind checking it?

Rifur13 · 2024-04-25T04:00:35Z

Tests should pass now. Tricky one..

awni · 2024-04-25T19:34:48Z

It's failling metal validation. You should be able to reproduce locally with:

METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 python ..

Rifur13 · 2024-04-25T20:37:27Z

Fixed! Probably a good idea to add these test options in the docs somewhere

awni · 2024-04-27T05:09:10Z

@Rifur13 sorry for the delay in merging this caused a conflict. If you can fix it we can merge asap. I also don't mind fixing the conflict tomorrow sometime.

Co-authored-by: Awni Hannun <awni.hannun@gmail.com>

Rifur13 · 2024-04-27T07:00:41Z

Rebased. Should be fixed now

awni

Thanks, this is awesome!

Rifur13 force-pushed the groups branch from da3c5fd to b750487 Compare April 9, 2024 14:34

Rifur13 force-pushed the groups branch from b750487 to 72a9081 Compare April 14, 2024 21:00

Rifur13 marked this pull request as ready for review April 14, 2024 21:03

awni reviewed Apr 16, 2024

View reviewed changes

mlx/ops.cpp Outdated Show resolved Hide resolved

Rifur13 commented Apr 16, 2024

View reviewed changes

jagrit06 reviewed Apr 24, 2024

View reviewed changes

Rifur13 force-pushed the groups branch from 7ea6527 to 0bf2696 Compare April 24, 2024 22:05

jagrit06 approved these changes Apr 24, 2024

View reviewed changes

Rifur13 and others added 10 commits April 27, 2024 02:36

Add conv1d grouped convs on CPU

8096ba6

Add GPU support

ef666a1

Parallelize inside metal kernel

029bf18

clenaup

3463a9f

Update mlx/ops.cpp

0787595

Co-authored-by: Awni Hannun <awni.hannun@gmail.com>

New unfold kernel + remove unused code

34382a3

Remove copy and refactor

13499e9

Update vjp and reuse steel gemm

2b43998

Fixed groups on cpu

861ec7e

Fix metal validation

48efb11

Rifur13 force-pushed the groups branch from 908336e to 48efb11 Compare April 27, 2024 06:58

awni approved these changes Apr 27, 2024

View reviewed changes

awni merged commit c4a471c into ml-explore:main Apr 27, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add groups to Conv1d #948

Add groups to Conv1d #948

Rifur13 commented Apr 1, 2024 •

edited

Loading

Rifur13 commented Apr 1, 2024 •

edited

Loading

awni commented Apr 3, 2024

Rifur13 commented Apr 3, 2024

Rifur13 commented Apr 9, 2024

Rifur13 commented Apr 9, 2024

awni commented Apr 12, 2024

Rifur13 commented Apr 12, 2024

awni commented Apr 12, 2024

awni commented Apr 12, 2024

Rifur13 commented Apr 14, 2024

awni commented Apr 15, 2024

Rifur13 Apr 16, 2024

awni Apr 16, 2024

awni commented Apr 18, 2024

Rifur13 commented Apr 18, 2024 •

edited

Loading

Rifur13 commented Apr 23, 2024

awni commented Apr 24, 2024

jagrit06 left a comment

Rifur13 commented Apr 24, 2024

jagrit06 commented Apr 24, 2024

Rifur13 commented Apr 24, 2024

jagrit06 left a comment

awni commented Apr 25, 2024

Rifur13 commented Apr 25, 2024

awni commented Apr 25, 2024

Rifur13 commented Apr 25, 2024

awni commented Apr 27, 2024

Rifur13 commented Apr 27, 2024

awni left a comment

Add groups to Conv1d #948

Add groups to Conv1d #948

Conversation

Rifur13 commented Apr 1, 2024 • edited Loading

Proposed changes

Checklist

Rifur13 commented Apr 1, 2024 • edited Loading

awni commented Apr 3, 2024

Rifur13 commented Apr 3, 2024

Rifur13 commented Apr 9, 2024

Rifur13 commented Apr 9, 2024

awni commented Apr 12, 2024

Rifur13 commented Apr 12, 2024

awni commented Apr 12, 2024

awni commented Apr 12, 2024

Rifur13 commented Apr 14, 2024

awni commented Apr 15, 2024

Rifur13 Apr 16, 2024

Choose a reason for hiding this comment

awni Apr 16, 2024

Choose a reason for hiding this comment

awni commented Apr 18, 2024

Rifur13 commented Apr 18, 2024 • edited Loading

Rifur13 commented Apr 23, 2024

awni commented Apr 24, 2024

jagrit06 left a comment

Choose a reason for hiding this comment

Rifur13 commented Apr 24, 2024

jagrit06 commented Apr 24, 2024

Rifur13 commented Apr 24, 2024

jagrit06 left a comment

Choose a reason for hiding this comment

awni commented Apr 25, 2024

Rifur13 commented Apr 25, 2024

awni commented Apr 25, 2024

Rifur13 commented Apr 25, 2024

awni commented Apr 27, 2024

Rifur13 commented Apr 27, 2024

awni left a comment

Choose a reason for hiding this comment

Rifur13 commented Apr 1, 2024 •

edited

Loading

Rifur13 commented Apr 1, 2024 •

edited

Loading

Rifur13 commented Apr 18, 2024 •

edited

Loading