-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HeteroLinear/SEGMM: switch from heuristic to timing-cache #8615
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this PR do on a high level?
instead of using my sklearn hueristics which were trained on ampere A100 (so probably not as accurate on other very diff cards), @stadlmax sets up a timer that figures out which is better and uses that. imo this makes more sense to work on all hardware. he will work on resolving the conflicts when he has bandwidth, i spoke to him about it yesterday |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #8615 +/- ##
==========================================
+ Coverage 89.38% 89.40% +0.02%
==========================================
Files 479 479
Lines 31152 31165 +13
==========================================
+ Hits 27845 27864 +19
+ Misses 3307 3301 -6 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is this something similar to https://pytorch.org/docs/stable/backends.html#torch.backends.cudnn.benchmark in a sense that it first benchmarks all algos and then select the fastest one? Shall we make this as an opt-in feature just like
cudnn.benchmark
? - I'm not sure how unlikely it is, but is the measurament robust enough for each rank to always pick up the same algorithm in a distributed setting?
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
for more information, see https://pre-commit.ci
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
i think in our case there are only two algorithms to check and in many cases the speed difference is drastic(in either direction). we are currently selecting it based on sklearn hueristics i trained from benchmarks on a single A100 machine on a single software stack from about half a year ago. this is far less robust on the majority of cards besides A100. i think the new solution is much better. I think instead of making the hueristic opt-in seperately, we can make it a part of the existing `torch_geometric.backend.use_segment_matmul
i think this is a very low likelihood and we run a similarly low risk with the current heuristic when using dynamic shapes on each rank |
@rusty1s any thoughts on the above discussion? |
Yes, I agree that this solution is much better than the heuristic-based one. I think we would need to avoid measuring warmup times though, and likely adjust |
that sounds like a good idea to me. do you have a suggested method of implementing such a system? |
Currently, we use |
okay thanks @rusty1s, do you have a desired value for MEASURE_ITER when we are testing? how about for when we aren't |
I think for testing |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (but I'd suggest waiting for another review from Matthias)
Also, I think we can follow up in a separate PR, but we should update RGCNConv
right?
|
@rusty1s anything else needed to merge? |
Please let me take a look tomorrow :) I'll try to get to it. If you are feeling blocked, please go ahead and merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's nice :)
needed to rebase #8472