-
Notifications
You must be signed in to change notification settings - Fork 16
Attempt to get new distributed DAG partitioner working #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to get new distributed DAG partitioner working #401
Conversation
|
Just looked at the code some more. Like I said in the meeting, I agree with the analysis of what the problem is, but I'm not particularly fond of all the (fairly implicit) logic that flows from "sends live here, and receives live one floor down". I think this would look cleaner (and be simpler to reason about if we made proper, separate nodes for sends and receives before entering into the topological sort. This could be as simple as just (Of course make sure that these don't end up comparing equal to one another.) |
|
I'd attribute |
means that a receive node "stuck around" in the DAG until code generation, when in fact the pytato/pytato/distributed/partition.py Line 167 in 94b7d19
|
|
|
|
Added some fixes. The distributed example is working now, and Kaushik's reproducer almost works. Specifically, it works if I add a node after the final receive; there's still some issue for parts in which a receive is both an input and an output. (I haven't looked at the test error yet, or the |
537486f to
7e452ae
Compare
aefb354 to
e0dba7c
Compare
61518e4 to
a43973b
Compare
|
@inducer FYI, there's (at least) one more lingering issue. When I run Kaushik's hang reproducer, I see the following errors: and (I see both of them at once; I guess one rank is emitting one and the other rank is emitting the other?) Both errors go away if I make the following change: diff --git a/hang_reproducer_orig.py b/hang_reproducer_almost_orig.py
index 7d653ef..30e1446 100644
--- a/hang_reproducer_orig.py
+++ b/hang_reproducer_almost_orig.py
@@ -34,12 +34,12 @@ elif rank == 1:
recv = pt.make_distributed_recv(
src_rank=recv_rank, comm_tag=43,
shape=(10,), dtype=np.float64)
- out = pt.make_dict_of_named_arrays({"out1": send, "out2": recv})
+ out = pt.make_dict_of_named_arrays({"out1": send, "out2": 2*recv})
else:
raise AssertionError()So I think it's something to do with the receive being both an input and an output. |
|
Good to know! I'll add that as a to-do to #393. |
effd604 to
692e681
Compare
Co-authored-by: Matt Smith <mjsmith6@illinois.edu>
c8db611 to
f8dacd3
Compare
|
@inducer, @matthiasdiener To get the new partitioning working on the prediction case, I had to change the code to avoid making some assumptions about the incoming arrays. Specifically:
@inducer I haven't incorporated your changes for the hang reproducer issue into this branch yet. Interestingly, with the above changes I no longer get the error; not sure if it's actually fixed or if the non-determinism issue is still lurking somewhere. Edit: Oh, and I also had to disable the check for partition disjointness (I was getting errors here, both with and without the other changes). |
b9eb8ad to
da52e71
Compare
The earlier commit ensures FusionContractorArraycontext does not fallback to SingleWorkGroupPytatoArraycontext for a reasonably wide class of array expressions and hence improves the kernel execution time by a significant factor. The latter commit is a minor optimization to amortize some operations in the transformation pipeline and has no effect on the quality of the generated code. |
Sorry, I should have specified I was looking at compile time performance improvement. 🙂 It was taking significantly longer without that pymbolic expressions commit. |
Aah yes. SingleWorkGroupArrayContext takes very long to compile as well. One of the limitations of loopy's implementation (fortunately not a limitation of the IR itself). |
607dd13 to
acd764a
Compare
acd764a to
42982b9
Compare
pytato/distributed/partition.py
Outdated
|
|
||
| # {{{ crude ordered set | ||
|
|
||
| class _OrderedSet(collections.abc.MutableSet[Any]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
| class _OrderedSet(collections.abc.MutableSet[Any]): | |
| class _OrderedSet(collections.abc.MutableSet[Hashable]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into some mypy errors when attempting this, and after struggling with it for a while I ultimately ended up with 7c3f47d.
f8dacd3 to
7b38c26
Compare
|
Superseded by #434. |
Bug fixes for #393. Depends on #417.
Original first post (now outdated):
I think the way we were identifying communication batches with parts wasn't quite right, because it still allows for deadlocks. For example, in the
examples/distributed.pycase, neither the send nor receive depend on any prior local sends/receives, so they end up in the first comm batch. But they can't both be executed in the first part because the sends have to be done before the receives. This PR reinterprets communication batches as existing in between consecutive parts and then assembles the parts around them.With this change I can almost get the distributed example working. I get an error:
which I was able to fix in a hacky way in order to get it to run successfully. I don't know how to fix this the proper way though.
When I try to run Kaushik's reproducer I get a different error:
and if I run the tests I get yet a third error:
Any idea how to fix these?