Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Futures by sending del msgs when fetched #23172

Merged
merged 1 commit into from
Aug 8, 2017

Conversation

amitmurthy
Copy link
Contributor

@amitmurthy amitmurthy commented Aug 8, 2017

0.5 introduced an optimization for distributed gc which

  1. had the remote node releasing a referenced Future when it was fetched (since a Future cached a fetch value locally).
  2. This was done to avoid sending a del_msg (sent when a reference is collected locally) and
  3. enable early release of large remote data instead of waiting for finalization/collection of the smaller Future objects on the node holding references.

This does not cover the following 2 scenarios (reported in #23109).

f = @spawn rand(1,1)
@sync begin
     @async fetch(f)
     @async fetch(f)
end

and

wid1,wid2 = workers()[1:2]
f = @spawnat wid1 rand(1,1)
@sync begin
        @async fetch(f)
        @async remotecall_fetch(()->fetch(f), wid2)
end

In both cases the first async fetch results in the remote value being freed and hence the second async fetch hangs.

While the first scenario can be addressed by tracking multiple fetch calls on the same Future locally, the second scenario (which has the Future serialized to other nodes) cannot be addressed in the same manner.

This PR reverts the optimization and falls back to collection/finalization of Futures for distributed gc.

Distributed gc of Futures now rely on an explicit del msg. However, this is now sent when the Future is fetched and is not dependent on local gc.

Closes #23109

@amitmurthy amitmurthy requested a review from JeffBezanson August 8, 2017 08:14
@JeffBezanson
Copy link
Member

If we do this, is there any remaining reason to have both Futures and RemoteChannels? If we want to keep both types, it should probably be an error to fetch a future again on another worker.

@amitmurthy
Copy link
Contributor Author

The issue is internal (distributed gc) and not the interface. Broadly they really address different use cases.

A Future is a write-once-read-multiple type of reference - typically the result of a computation. Therefore fetched values are cacheable locally and multiple fetches on the same Future does not incur any penalty.

A RemoteChannel is useful for a producer-consumer pattern of work distribution across multiple processes.

I don't subscribe to the idea of disallowing a fetch(::Future) on multiple workers - it is a common pattern - spawn off computations from the master and distribute the futures to all workers.

Early release of the remote value on a fetch is a good optimization and I would be loathe to see it go.

I think we can try another approach - release the remote value only a del_msg which we now send early - as soon as the value is fetched instead of waiting for the finalizer.

Will update the PR.

@amitmurthy amitmurthy changed the title Revert back to relying on del_msg for all Futures distributed gc Release Futures by sending del msgs when fetched Aug 8, 2017
@amitmurthy
Copy link
Contributor Author

Updated.

@JeffBezanson
Copy link
Member

I don't see how this fixes either scenario. As long as del_client can happen before a future is GC'd, it will be possible to send the future to another worker after the RemoteValue is deleted. It seems to me something here will have to give an error.

@amitmurthy
Copy link
Contributor Author

amitmurthy commented Aug 8, 2017

Nope. Consider these statements executed in the same order.

f = @spawnat 2 foo()                 # 2 knows that 1 has a reference to f
@async remotecall(()->fetch(f), 3)   # this also sends a add_msg to 2, to add a reference to f w.r.t. pid 3
@async fetch(f)                      # sends a del_msg to 2, but no worries, the add_msg has been sent to 2 previously.
f = @spawnat 2 foo()                 # 2 knows that 1 has a reference to f
@async fetch(f)                      # sends a del_msg to 2, 2 deletes the reference
@async remotecall(()->fetch(f), 3)   # Now since f has been fetched, the fetched value is sent to 3. 2 is no longer contacted for this reference.

To note: add_msgs are sent before del_msgs in flush_gc_msgs - the order is important.

@amitmurthy
Copy link
Contributor Author

amitmurthy commented Aug 8, 2017

The existing bug is due to this flow and logic

f = @spawnat 2 foo()   # 2 knows that 1 has a reference to f
@async fetch(f)        # this sends the request, and yields to the next call. 
                       # On 2, the value is returned and the RemoteValue deleted
                       # (without waiting for a del_msg) since 2 knows only 1 has a
                       # reference to f.
@async remotecall(()->fetch(f), 3)   # This sends an add_msg, but since 2 does not find a reference
                                     # to f, it creates a new RemoteValue. Also since the above fetch has not yet returned,
                                     # at the time of this call, `f` does not have the fetched value. 

@ararslan ararslan added the parallelism Parallel or distributed computation label Aug 8, 2017
@JeffBezanson
Copy link
Member

I see, the fix is that del_client doesn't happen until the local future value has been assigned. I think there is still a possible (but probably unlikely in practice) race condition since two Tasks are testing isnull(r.v) without synchronization.

@amitmurthy
Copy link
Contributor Author

since two Tasks are testing isnull(r.v) without synchronization.

This will be inefficient but will not result in an error condition. What will happen is that both the tasks will send a fetch request which will be processed by the remote node - multiple del_msgs follow later, only the first will be acted upon. The assumption of course is that both the tasks are on the same OS thread and that isnull(r.v) is not a yieldpoint (which it is not).

This situation can be optimized in a separate PR.

@JeffBezanson
Copy link
Member

There is a possible schedule where Task A completes its whole fetch+send_del_client, after Task B tests isnull but before Task B finishes its fetch. I agree it seems very unlikely though.

@amitmurthy
Copy link
Contributor Author

Just to clarify - the reason it cannot happen is because there is no yield between Task B testing and sending a fetch request.

The fetch call is a yieldpoint.

On the wire, the messages will be in the following order (left to right).

Send : Task A fetch <-- Task B fetch <--- Task A send_del_client <-- Task B send_del_client
Response : Task A fetch response <-- Task B fetch response (This is the inefficiency, same data is returned in both response messages)

In the other case

Task A fetch <-- Task A send_del_client <-- Task B finds !isnull and hence no request is sent

@amitmurthy
Copy link
Contributor Author

On a unrelated note, all green CI - haven't seen this in quite a while!

@JeffBezanson
Copy link
Member

Wow, congrats on having the only pull request that's actually safe to merge! :)

@amitmurthy
Copy link
Contributor Author

amitmurthy commented Aug 8, 2017

One other dependency is the order in which messages are processed on the remote node. Each message pulled from socket is wrapped in a new task and scheduled. As the current task scheduler just adds newly scheduled tasks to the end of the work queue, it is not currently an issue.

@JeffBezanson
Copy link
Member

Yes I agree; it probably works now but depends on things being ordered in the expected way elsewhere. Seems ok for now since it at least works better than before.

@JeffBezanson JeffBezanson merged commit 60bc26e into master Aug 8, 2017
@JeffBezanson JeffBezanson deleted the amitm/23109_fetfut branch August 8, 2017 22:02
ararslan pushed a commit that referenced this pull request Sep 11, 2017
ararslan pushed a commit that referenced this pull request Sep 13, 2017
vtjnash pushed a commit that referenced this pull request Sep 14, 2017
ararslan pushed a commit that referenced this pull request Sep 15, 2017
ararslan pushed a commit that referenced this pull request Sep 16, 2017
ararslan pushed a commit that referenced this pull request Sep 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants