-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Custom schedule for standalone transpose in cuda #8030
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits about code organization and the test, but otherwise looks good and is a welcome addition. For other reviewers: offline testing of this transpose schedule (when applicable) showed about a 2x improvement over the baseline injective transpose schedule on BERT Large training workloads. I do wonder how we could select whether or not to opt in to this vs trying to tune the original injective schedule, but I think in most cases this new kernel will probably be faster.
Dispatches to and optimized schedule if the transpose is standalone (not fused). | ||
""" | ||
warp_size = int(Target.current(allow_none=False).thread_warp_size) | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a more principled way to do this? like maybe with an OpStrategy or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, there is not a better way to do this. There is a way to add implementations based on input sizes, but these are not on a per-target basis. If you know a better way, let me know.
python/tvm/topi/cuda/sparse.py
Outdated
@@ -105,13 +105,22 @@ def _callback(op): | |||
return s | |||
|
|||
|
|||
def schedule_cuda_transpose(s, out): | |||
def schedule_transpose(outs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels a bit weird to have this in sparse.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to transform.py
r = np.random.rand(*shape) | ||
tvm.testing.assert_allclose(ex.evaluate()(r).asnumpy(), np.transpose(r)) | ||
|
||
# make sure schedule does not fire here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a TODO? Also I wonder if it would be good to parametrize the test shape by warp size (rather than hard coding) for future proofing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more like a wish. Ideally we would be able to know which schedules were used, but there is to way to introspect on what was used. I've updated the comment to reflect this.
@@ -357,7 +357,7 @@ def tune_and_evaluate(tuning_opt): | |||
) | |||
|
|||
# filter out non-packed conv2d task | |||
tasks = list(filter(lambda t: len(t.args[0][1]) > 4, tasks)) | |||
tasks = list(filter(lambda t: len(t.args[0][1]) > 4 and "conv" in t.name, tasks)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happened here, did this transpose change introduce a new task or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, no, but this check makes sure anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the new added schedule not tunable? Or is there any concern of adding knobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to tune it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the only thing I'm wondering about is if someone (for whatever reason) really wanted to tune the default injective schedule for transpose, is there any way to allow that?
cc @comaniac for additional review (feel free to tag more relevant reviewers)
There's no reason to tune inject schedule and you basically cannot do it because injective schedule doesn't have AutoTVM knobs for tuning. |
# We want to make sure schedule does not fire here, but there is no way of | ||
# inspecting which schedules were used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this comment mentions, there is no way of inspecting which schedules were used, so it seems to me that the difference between this test and test_transpose
is the workload in this test includes add
to test the case of fusion. Accordingly, could we just extend test_transpose
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. I like to keep it separate so the intention is known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Then it might be better to name it test_transpose_fuse
or something like that (nit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to test_transpose_unfused_schedule
@@ -357,7 +357,7 @@ def tune_and_evaluate(tuning_opt): | |||
) | |||
|
|||
# filter out non-packed conv2d task | |||
tasks = list(filter(lambda t: len(t.args[0][1]) > 4, tasks)) | |||
tasks = list(filter(lambda t: len(t.args[0][1]) > 4 and "conv" in t.name, tasks)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the new added schedule not tunable? Or is there any concern of adding knobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have other comments. LGTM.
* [TOPI] Custom schedule for standalone transpose in cuda * check if input is not Any * fix vta test * check input shape * fix injective * move transpose out of sparse.py * update comments, use warp size * missspelled transform * formatting * rename test * comment * fix tests
* [TOPI] Custom schedule for standalone transpose in cuda * check if input is not Any * fix vta test * check input shape * fix injective * move transpose out of sparse.py * update comments, use warp size * missspelled transform * formatting * rename test * comment * fix tests
This PR adds an optimized schedule for transpose if the transpose is not fused into anything else.
@altanh @junrushao1994