Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple store providers in ObjectInterface #5452

Merged
merged 22 commits into from
Aug 21, 2019

Conversation

zhijunfu
Copy link
Contributor

What do these changes do?

This is to separate the support for multiple store providers from #5443 , which can be reviewed and merged independently.

Related issue number

Linter

  • [Y] I've run scripts/format.sh to lint the changes in this PR.

@zhijunfu zhijunfu requested review from raulchen, pcmoritz and edoakes and removed request for raulchen and pcmoritz August 14, 2019 11:42
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16288/
Test PASSed.

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level design looks good, just need to clean up some of the details. Please request my review again when you've addressed the comments.

zhijunfu and others added 6 commits August 15, 2019 14:09
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
@zhijunfu
Copy link
Contributor Author

@edoakes Thanks for reviewing! comments are resolved.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16310/
Test PASSed.

@edoakes
Copy link
Collaborator

edoakes commented Aug 15, 2019

Looks much better, I think it's almost there - left a few more comments.

The build also seems to be failing with a compile error.

@zhijunfu
Copy link
Contributor Author

Thanks, code has been updated.

Looks much better, I think it's almost there - left a few more comments.

The build also seems to be failing with a compile error.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16314/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16324/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16344/
Test PASSed.

@zhijunfu
Copy link
Contributor Author

@edoakes Let me know if you have any additional comments, or please give a Go if this looks ok?:) Thanks.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16359/
Test PASSed.

@edoakes
Copy link
Collaborator

edoakes commented Aug 17, 2019

Looks good except I think we should rethink the way that we're giving both the ObjectInterface and the DirectActorTransport a reference to the LocalMemoryStore.

Right now, the ObjectInterface creates the store as a shared_ptr and then creates separate LocalMemoryStoreProvider unique_ptrs for itself and the transport. Why not just have the ObjectInterface pass in a shared_ptr to its LocalMemoryStoreProvider? That way we wouldn't have to keep around the local_memory_store_ member and the DirectActorTransport wouldn't need to take a reference to the ObjectInterface in its constructor just to create the store provider.

LMK if I'm missing something here.

@zhijunfu
Copy link
Contributor Author

This is a good question. This is because the task submitter needs to use its own store provider to put the return objects to store, it cannot share the one in ObjectInterface as that one is used by user code otherwise it could deadlock. Imagine that user submits a task and blocks on the Get for the return object of that task, and then transport receives task return object and then puts it into store using the same provider.

Looks good except I think we should rethink the way that we're giving both the ObjectInterface and the DirectActorTransport a reference to the LocalMemoryStore.

Right now, the ObjectInterface creates the store as a shared_ptr and then creates separate LocalMemoryStoreProvider unique_ptrs for itself and the transport. Why not just have the ObjectInterface pass in a shared_ptr to its LocalMemoryStoreProvider? That way we wouldn't have to keep around the local_memory_store_ member and the DirectActorTransport wouldn't need to take a reference to the ObjectInterface in its constructor just to create the store provider.

LMK if I'm missing something here.

@edoakes
Copy link
Collaborator

edoakes commented Aug 18, 2019

I don't quite understand the problem - does a Get() request block the memory store provider (i.e., hold a mutex) the whole time? If this is the only solution given the current memory store design it seems like we should rethink that rather than work around it by making separate interfaces to it. The current design is not very semantically clear and seems like it could lead to bugs pretty easily - it would be best to have a single interface that gives explicit guarantees about what happens when multiple threads call into it at the same time.

@zhijunfu
Copy link
Contributor Author

Let me explain it with more details. The problem here is that we cannot guarantee multiple threads access for the same store provider instance would not block for all the store providers. It actually depends on the implementation for that store provider.

In this particular case, it's actually ok for the memory store provider to be accessed by multiple threads, given its current implementation; but it is NOT ok for plasma store provider, because when doing a Get, plasma client would block on reply from plasma store, thus other threads accessing the same plasma client would get blocked - this is inherent from the design of plasma, it is not because of core worker implementation.

Since we cannot guarantee multi-thread access can work for all the store providers, I think it's better for the transport to always create a separate store provider, so that it always works regardless of the exact store provider used. I think the code is more robust this way, instead of having to think about whether the specific store provider can be shared in multi-thread access, which is implementation specific to that store provider, and is more likely to produce bugs.

I do agree that we should avoid workarounds as much as possible, but I think in this case it's not a workaround, it instead deals with different store providers uniformly, avoids tight coupling with store provider specific implementations, and helps to avoid unintended bugs.

Anyway, this is not introduced by this PR, and is not for memory store provider. I think we can still discuss this if you have different thoughts, but this should not block this PR from merging.

Please kindly let me know if this doesn't make sense to you.

I don't quite understand the problem - does a Get() request block the memory store provider (i.e., hold a mutex) the whole time? If this is the only solution given the current memory store design it seems like we should rethink that rather than work around it by making separate interfaces to it. The current design is not very semantically clear and seems like it could lead to bugs pretty easily - it would be best to have a single interface that gives explicit guarantees about what happens when multiple threads call into it at the same time.

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the problem now, thank you for the detailed explanation!

I'm still not very content with the solution of creating separated providers for the object interface and transport, but you're right that this wasn't introduced my this PR and we shouldn't block it on that. Let's please keep this in mind as we continue developing the core worker.

I left one more comment, otherwise LGTM. Should probably get at least another pair of eyes on it, though.

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only left some comments about naming and documentation

zhijunfu and others added 6 commits August 19, 2019 17:11
Copy link
Contributor Author

@zhijunfu zhijunfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated code. thanks.

@zhijunfu
Copy link
Contributor Author

zhijunfu commented Aug 19, 2019

Thanks for your detailed review also.

Yes going forward let's keep it in mind and try to simplify when possible. I think we're on the same page on this.

On the other perspective, on the design wise, having a self-contained and loosely coupled architecture is also very important, it makes it easier to make code changes, and it helps to reduce unintentional bugs. As the system grows bigger, it's very hard for a single person to know all the details of the system, thus having a loosely coupled design is important as it helps to make sure that making changes inside a module would not requires full understanding of other parts, and would not result in unintended effects/bugs.

There are a few examples in this PR (and the original large PR) that can be viewed differently from different perspectives.

Sometimes it's about to find the right balance between code simplicity and architecture design, I think they are both very important, and it would be good to keep both in mind going forward! :)

I understand the problem now, thank you for the detailed explanation!

I'm still not very content with the solution of creating separated providers for the object interface and transport, but you're right that this wasn't introduced my this PR and we shouldn't block it on that. Let's please keep this in mind as we continue developing the core worker.

I left one more comment, otherwise LGTM. Should probably get at least another pair of eyes on it, though.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16390/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16391/
Test PASSed.

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have caught this earlier, but I'm now realizing that we aren't handling duplicate ObjectIDs in Get() or Wait() at all here. Also, given that we're using unordered data structures to keep track of the ObjectIDs, can't the ordering in the result set not match that of the input ids?

I think for Get(), the desired behavior is clear: we should be able to take a list of possibly duplicated IDs and always return the same length list of result objects in the same order as they were passed in. To do this, we should probably de-duplicate them immediately in the top-level Get() call and then assume the lists passed to store providers are unique. As an optimization we could alternatively pass a unordered_map<ObjectID, vector<int>> to the store providers and write results directly to the top-level results vector, but that might get messy.

For Wait(), the desired behavior is much less clear but we should be explicit in our handling of it - we discussed it a bit in person and agree that it's best to raise an exception if a non-unique list is passed in, forcing users to handle it at the application level.

Please add the above changes, document the semantics and assumptions for each method, and add to the object interface test to verify that the desired behavior is met.

@zhijunfu
Copy link
Contributor Author

Please see my reply inline:

Sorry, I should have caught this earlier, but I'm now realizing that we aren't handling duplicate ObjectIDs in Get() or Wait() at all here. Also, given that we're using unordered data structures to keep track of the ObjectIDs, can't the ordering in the result set not match that of the input ids?

The above statement is not true.

Get() and Wait already handles duplicated object ids. And there's no ordering issue as mentioned, because at the end of Get it will loop all the object ids passed in and fill the result vector.

I think for Get(), the desired behavior is clear: we should be able to take a list of possibly duplicated IDs and always return the same length list of result objects in the same order as they were passed in. To do this, we should probably de-duplicate them immediately in the top-level Get() call and then assume the lists passed to store providers are unique.

This is already the case with existing implementation.

As an optimization we could alternatively pass a unordered_map<ObjectID, vector<int>> to the store providers and write results directly to the top-level results vector, but that might get messy.

Yes, this would complicate the code. I'd prefer the first approach.

For Wait(), the desired behavior is much less clear but we should be explicit in our handling of it - we discussed it a bit in person and agree that it's best to raise an exception if a non-unique list is passed in, forcing users to handle it at the application level.

The PR actually handles the duplicates for Wait as well. I do agree the semantics for Wait is not as clear as Get, I think it's fine to fail it with an exception when there are duplicates.

Please add the above changes, document the semantics and assumptions for each method, and add to the object interface test to verify that the desired behavior is met.

As mentioned, Get already uses the suggested approach with proper tests. I can change Wait as suggested above.

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'm sorry about that - I clearly didn't read the final write into results closely enough. The current implementation looks good and given that you already are handling duplicates in Wait() let's leave it as-is (at least for now).

Could you please document the semantics for Get() and Wait() in the header comments though? And maybe a couple of short comments in the code where we de-duplicate and then replace the duplicates.

@zhijunfu
Copy link
Contributor Author

Ah, I'm sorry about that - I clearly didn't read the final write into results closely enough. The current implementation looks good and given that you already are handling duplicates in Wait() let's leave it as-is (at least for now).

Could you please document the semantics for Get() and Wait() in the header comments though? And maybe a couple of short comments in the code where we de-duplicate and then replace the duplicates.

Thanks! Will update with more comments.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16417/
Test PASSed.

@raulchen raulchen merged commit eab5957 into ray-project:master Aug 21, 2019
@raulchen raulchen deleted the multi-store-providers branch August 21, 2019 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants