-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Minor perf improvement for GPU scatter #7233
Conversation
Would it be a better idea to have to separate scatter implementations (the parallel one and the sequential one) and let autotvm figure out which is better? Then we don't have to have all this special casing and magic input sizes. Do you also have some benchmarks you could show for these changes? (I'm not clear what the second text block is showing) |
The second text block is an excerpt from the output of I don't have benchmark other than the data from MaskRCNN. For the first kernel of 4D scatter, since it is just a memcpy, I don't see why we should do threading differently than other injective ops. I hope we don't need thorough benchmarking to justify this change. After this change, the trace becomes (only the first line changes, note the elapsed time and thread launch config).
hmm, this sounds better than picking a random threshold, but do we have existing uses of autotvm to make such decision? Given that scatter kernels are extern, I'm not sure if autotvm can work with them. |
Autotvm does this for external libraries which are all extern, so it will work here. I trust you when you say these are faster, I just wondered if you had done any benchmarking. Looking at the code, it seems it should be equally as fast, but sometimes it surprises you. That is when benchmarks are useful. |
Yes, there are 4 calls to 4D scatter in MaskRCNN, the old kernel was taking 11.6 milli seconds on them in total, making it one of the bottlenecks as shown in the profile above. This change brings it down to 1.9873 milli seconds total and it is no longer a bottleneck. So this is a solid improvement. I think the reason the old kernel was slow for this input (1000, 256, 7, 7) is because thread block is too small (32, 1, 1) and we are launching too many of them (1000 * 256 * 7 blocks). |
@tkonolige I like the idea of separating sorting based implementation of scatter, so I want to try this. Can you point me where in the codebase autotvm deals with external libs? I found something like One issue is that currently sorting based approach is only implemented for 1D scatter. For higher dimensions, I think sorting based approach is a bad idea. So dispatching decision needs to take input dimension into account (not sure if this could be a problem for autotvm or relay strategy). |
@masahi Here is an example of having multiple implementations for the same op, with some of them being external. https://github.com/apache/tvm/blob/main/python/tvm/relay/op/strategy/x86.py#L371-L393 In this example, tvm/python/tvm/topi/x86/dense.py Line 265 in 54c995d
You can conditionally call |
ad00c94
to
57fc2d8
Compare
This reverts commit 1fed883.
57fc2d8
to
4f844f5
Compare
@tkonolige @mbrookhart I separated the two scatter implementations, things should look clean now. The sequential one is chosen by default, and I confirmed that by tuning the scatter op the parallel one can be chosen. Tuning the scatter op revealed an interesting issue in AutoTVM, discussed in https://discuss.tvm.apache.org/t/autotvm-cuda-runtime-error-when-tuning-extern-ops/8832/7. Thanks @FrozenGene for help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Seems like splitting it into two implementations made things cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @masahi @tkonolige @FrozenGene |
* improve scatter 4d init * do not launch sorting based scatter for small input * do not use hard coded num threads * separate sort based implementation * register scatter as autotvm task * add missing import * fix strategy * add dedicated schedule and dummy flop * add test tuning script * try adding dummy knob * skip random_fill when a tuning workload is from scatter This reverts commit 1fed883. * cleanup memcpy ir * remove scatter tuning script * make sure zero init arguments * add comment on why skip random init for scatter * restore ctx sync Co-authored-by: masa <masa@pop-os.localdomain>
* improve scatter 4d init * do not launch sorting based scatter for small input * do not use hard coded num threads * separate sort based implementation * register scatter as autotvm task * add missing import * fix strategy * add dedicated schedule and dummy flop * add test tuning script * try adding dummy knob * skip random_fill when a tuning workload is from scatter This reverts commit 1fed883. * cleanup memcpy ir * remove scatter tuning script * make sure zero init arguments * add comment on why skip random init for scatter * restore ctx sync Co-authored-by: masa <masa@pop-os.localdomain>
* improve scatter 4d init * do not launch sorting based scatter for small input * do not use hard coded num threads * separate sort based implementation * register scatter as autotvm task * add missing import * fix strategy * add dedicated schedule and dummy flop * add test tuning script * try adding dummy knob * skip random_fill when a tuning workload is from scatter This reverts commit 1fed883. * cleanup memcpy ir * remove scatter tuning script * make sure zero init arguments * add comment on why skip random init for scatter * restore ctx sync Co-authored-by: masa <masa@pop-os.localdomain>
This updates GPU scatter in two ways, to improve performance on GPU MaskRCNN (should be better for other workloads).
PyTorch frontend uses 1D scatter of one element to emulate inplace assignment
arr[i] = v
,tvm/python/tvm/relay/frontend/pytorch.py
Lines 467 to 472 in d1399f3
The first kernel (initialization) of 4D scatter turns out very slow. It is actually much slower than the second sequential kernel, taking more than 10 milli sec of MaskRCNN runs as shown in the profile and trace below. It's likely the performance depends on input shape, but I found the way threading is done a bit strange. This PR changes the threading of the first kernel to be the same as other injective ops, to scale better regardless of input shapes.
These changes are not big deal, but it brings a good speed up on MaskRCNN: it cuts MaskRCNN runtime by 20 milli sec.
please review @mbrookhart @tkonolige