[TOPI] Minor perf improvement for GPU scatter #7233

masahi · 2021-01-08T20:48:02Z

This updates GPU scatter in two ways, to improve performance on GPU MaskRCNN (should be better for other workloads).

PyTorch frontend uses 1D scatter of one element to emulate inplace assignment arr[i] = v,

tvm/python/tvm/relay/frontend/pytorch.py

Lines 467 to 472 in d1399f3

    
           end = _op.scatter( 
        
               end, 
        
               _op.expand_dims(_expr.const(dim), axis=0), 
        
               _op.expand_dims(target_end, axis=0), 
        
               axis=0, 
        
           )

Using sorting based scatter involves too much overhead for such small inputs. For small inputs, sequential scatter is better. The size threshold was chosen arbitrary (50).

The first kernel (initialization) of 4D scatter turns out very slow. It is actually much slower than the second sequential kernel, taking more than 10 milli sec of MaskRCNN runs as shown in the profile and trace below. It's likely the performance depends on input shape, but I found the way threading is done a bit strange. This PR changes the threading of the first kernel to be the same as other injective ops, to scale better regardless of input shapes.

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   14.56%  31.765ms        58  547.67us  28.160us  1.0697ms  fused_expand_dims_concatenate_1_kernel0
                    8.46%  18.457ms       483  38.213us     512ns  5.2853ms  [CUDA memcpy HtoD]
                    7.38%  16.101ms         4  4.0253ms  4.0085ms  4.0364ms  fused_nn_conv2d_add_nn_relu_20_kernel0
                    7.24%  15.799ms         1  15.799ms  15.799ms  15.799ms  fused_nn_conv2d_transpose_add_nn_relu_kernel0
                    6.85%  14.952ms      2287  6.5370us     640ns  7.9143ms  [CUDA memcpy DtoH]
                    5.34%  11.648ms         4  2.9121ms  2.9034ms  2.9217ms  fused_scatter_1_kernel0
                    3.84%  8.3777ms         1  8.3777ms  8.3777ms  8.3777ms  fused_nn_conv2d_add_nn_relu_16_kernel0
                    3.71%  8.0883ms         1  8.0883ms  8.0883ms  8.0883ms  fused_nn_conv2d_add_5_kernel0
                    3.25%  7.1009ms         2  3.5504ms  1.2477ms  5.8531ms  fused_dyn_full_kernel0
                    2.23%  4.8683ms         2  2.4341ms  365.28us  4.5030ms  sgemm_128x128x8_NT_vec

...
30.5532s  2.9201ms         (1000 256 7)        (32 1 1)         8        0B        0B         -           -           -           -  GeForce GTX 107         1         7  fused_scatter_1_kernel0 [3059]
30.5561s  519.65us            (1 256 7)        (32 1 1)        16        0B        0B         -           -           -           -  GeForce GTX 107         1         7  fused_scatter_1_kernel1 [3061]
...

These changes are not big deal, but it brings a good speed up on MaskRCNN: it cuts MaskRCNN runtime by 20 milli sec.

please review @mbrookhart @tkonolige

tkonolige · 2021-01-08T21:12:05Z

Would it be a better idea to have to separate scatter implementations (the parallel one and the sequential one) and let autotvm figure out which is better? Then we don't have to have all this special casing and magic input sizes.

Do you also have some benchmarks you could show for these changes? (I'm not clear what the second text block is showing)

masahi · 2021-01-08T21:28:00Z

The second text block is an excerpt from the output of nvprof --print-gpu-trace, showing elapsed time, launch config etc of each kernel executed, in order. The first line is for the initialization kernel, the second one the actual scatter kernel.

I don't have benchmark other than the data from MaskRCNN. For the first kernel of 4D scatter, since it is just a memcpy, I don't see why we should do threading differently than other injective ops. I hope we don't need thorough benchmarking to justify this change. After this change, the trace becomes (only the first line changes, note the elapsed time and thread launch config).

31.2518s  495.68us          (12250 1 1)      (1024 1 1)         8        0B        0B         -           -           -           -  GeForce GTX 107         1         7  fused_scatter_1_kernel0 [2980]
31.2523s  522.78us            (1 256 7)        (32 1 1)        16        0B        0B         -           -           -           -  GeForce GTX 107         1         7  fused_scatter_1_kernel1 [2982]

Would it be a better idea to have to separate scatter implementations (the parallel one and the sequential one) and let autotvm figure out which is better? Then we don't have to have all this special casing and magic input sizes.

hmm, this sounds better than picking a random threshold, but do we have existing uses of autotvm to make such decision? Given that scatter kernels are extern, I'm not sure if autotvm can work with them.

python/tvm/topi/cuda/scatter.py

tkonolige · 2021-01-08T21:36:46Z

hmm, this sounds better than picking a random threshold, but do we have existing uses of autotvm to make such decision? Given that scatter kernels are extern, I'm not sure if autotvm can work with them.

Autotvm does this for external libraries which are all extern, so it will work here.

I trust you when you say these are faster, I just wondered if you had done any benchmarking. Looking at the code, it seems it should be equally as fast, but sometimes it surprises you. That is when benchmarks are useful.

masahi · 2021-01-08T21:55:28Z

Yes, there are 4 calls to 4D scatter in MaskRCNN, the old kernel was taking 11.6 milli seconds on them in total, making it one of the bottlenecks as shown in the profile above. This change brings it down to 1.9873 milli seconds total and it is no longer a bottleneck. So this is a solid improvement.

I think the reason the old kernel was slow for this input (1000, 256, 7, 7) is because thread block is too small (32, 1, 1) and we are launching too many of them (1000 * 256 * 7 blocks).

masahi · 2021-01-08T22:34:38Z

Autotvm does this for external libraries which are all extern, so it will work here

@tkonolige I like the idea of separating sorting based implementation of scatter, so I want to try this. Can you point me where in the codebase autotvm deals with external libs? I found something like @autotvm.register_topi_compute("conv2d_cudnn.cuda") but I'm not sure how it interacts with normal conv2d ops.

One issue is that currently sorting based approach is only implemented for 1D scatter. For higher dimensions, I think sorting based approach is a bad idea. So dispatching decision needs to take input dimension into account (not sure if this could be a problem for autotvm or relay strategy).

tkonolige · 2021-01-08T22:52:45Z

@masahi Here is an example of having multiple implementations for the same op, with some of them being external. https://github.com/apache/tvm/blob/main/python/tvm/relay/op/strategy/x86.py#L371-L393 In this example, schedule_dense_cblas is just schedule_extern (

tvm/python/tvm/topi/x86/dense.py

Line 265 in 54c995d

def schedule_dense_cblas(_, outs):

).

You can conditionally call strategy.add_implementation based on the input sizes (also in this example).

This reverts commit 1fed883.

python/tvm/autotvm/measure/measure_methods.py

masahi · 2021-01-19T09:11:55Z

@tkonolige @mbrookhart I separated the two scatter implementations, things should look clean now. The sequential one is chosen by default, and I confirmed that by tuning the scatter op the parallel one can be chosen.

Tuning the scatter op revealed an interesting issue in AutoTVM, discussed in https://discuss.tvm.apache.org/t/autotvm-cuda-runtime-error-when-tuning-extern-ops/8832/7. Thanks @FrozenGene for help.

tkonolige

Looks good! Seems like splitting it into two implementations made things cleaner.

mbrookhart

LGTM

mbrookhart · 2021-01-19T19:09:16Z

Thanks @masahi @tkonolige @FrozenGene

* improve scatter 4d init * do not launch sorting based scatter for small input * do not use hard coded num threads * separate sort based implementation * register scatter as autotvm task * add missing import * fix strategy * add dedicated schedule and dummy flop * add test tuning script * try adding dummy knob * skip random_fill when a tuning workload is from scatter This reverts commit 1fed883. * cleanup memcpy ir * remove scatter tuning script * make sure zero init arguments * add comment on why skip random init for scatter * restore ctx sync Co-authored-by: masa <masa@pop-os.localdomain>

mbrookhart reviewed Jan 8, 2021

View reviewed changes

python/tvm/topi/cuda/scatter.py Outdated Show resolved Hide resolved

masahi marked this pull request as draft January 9, 2021 00:02

masahi force-pushed the gpu-scatter-improvement branch 2 times, most recently from ad00c94 to 57fc2d8 Compare January 18, 2021 13:30

masahi marked this pull request as ready for review January 18, 2021 13:34

masahi and others added 14 commits January 19, 2021 07:59

improve scatter 4d init

4020730

do not launch sorting based scatter for small input

cb0e21e

do not use hard coded num threads

1b76109

separate sort based implementation

d60acbe

register scatter as autotvm task

b6d6541

add missing import

9c87d8d

fix strategy

9b077db

add dedicated schedule and dummy flop

3090af8

add test tuning script

28c08dc

try adding dummy knob

bf27ba8

skip random_fill when a tuning workload is from scatter

a876443

This reverts commit 1fed883.

cleanup memcpy ir

5cfd32e

remove scatter tuning script

528f755

make sure zero init arguments

4f844f5

masahi force-pushed the gpu-scatter-improvement branch from 57fc2d8 to 4f844f5 Compare January 18, 2021 22:59

add comment on why skip random init for scatter

6e3acff

FrozenGene reviewed Jan 19, 2021

View reviewed changes

python/tvm/autotvm/measure/measure_methods.py Show resolved Hide resolved

restore ctx sync

7b1a6ea

FrozenGene reviewed Jan 19, 2021

View reviewed changes

python/tvm/autotvm/measure/measure_methods.py Show resolved Hide resolved

FrozenGene approved these changes Jan 19, 2021

View reviewed changes

tkonolige approved these changes Jan 19, 2021

View reviewed changes

mbrookhart approved these changes Jan 19, 2021

View reviewed changes

mbrookhart merged commit 2290cc0 into apache:main Jan 19, 2021

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Minor perf improvement for GPU scatter #7233

[TOPI] Minor perf improvement for GPU scatter #7233

masahi commented Jan 8, 2021 •

edited

Loading

tkonolige commented Jan 8, 2021 •

edited

Loading

masahi commented Jan 8, 2021 •

edited

Loading

tkonolige commented Jan 8, 2021

masahi commented Jan 8, 2021

masahi commented Jan 8, 2021 •

edited

Loading

tkonolige commented Jan 8, 2021 •

edited

Loading

masahi commented Jan 19, 2021

tkonolige left a comment

mbrookhart left a comment

mbrookhart commented Jan 19, 2021

	end = _op.scatter(
	end,
	_op.expand_dims(_expr.const(dim), axis=0),
	_op.expand_dims(target_end, axis=0),
	axis=0,
	)

[TOPI] Minor perf improvement for GPU scatter #7233

[TOPI] Minor perf improvement for GPU scatter #7233

Conversation

masahi commented Jan 8, 2021 • edited Loading

tkonolige commented Jan 8, 2021 • edited Loading

masahi commented Jan 8, 2021 • edited Loading

tkonolige commented Jan 8, 2021

masahi commented Jan 8, 2021

masahi commented Jan 8, 2021 • edited Loading

tkonolige commented Jan 8, 2021 • edited Loading

masahi commented Jan 19, 2021

tkonolige left a comment

Choose a reason for hiding this comment

mbrookhart left a comment

Choose a reason for hiding this comment

mbrookhart commented Jan 19, 2021

masahi commented Jan 8, 2021 •

edited

Loading

tkonolige commented Jan 8, 2021 •

edited

Loading

masahi commented Jan 8, 2021 •

edited

Loading

masahi commented Jan 8, 2021 •

edited

Loading

tkonolige commented Jan 8, 2021 •

edited

Loading