Support multiple store providers in ObjectInterface #5452

zhijunfu · 2019-08-14T11:42:07Z

What do these changes do?

This is to separate the support for multiple store providers from #5443 , which can be reviewed and merged independently.

Related issue number

Linter

[Y] I've run scripts/format.sh to lint the changes in this PR.

AmplabJenkins · 2019-08-14T14:55:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16288/
Test PASSed.

edoakes

High level design looks good, just need to clean up some of the details. Please request my review again when you've addressed the comments.

src/ray/core_worker/object_interface.cc

src/ray/core_worker/object_interface.h

src/ray/core_worker/store_provider/store_provider.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

…lliance/ray into multi-store-providers

zhijunfu · 2019-08-15T09:23:51Z

@edoakes Thanks for reviewing! comments are resolved.

AmplabJenkins · 2019-08-15T09:52:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16310/
Test PASSed.

src/ray/core_worker/object_interface.cc

edoakes · 2019-08-15T17:40:01Z

Looks much better, I think it's almost there - left a few more comments.

The build also seems to be failing with a compile error.

zhijunfu · 2019-08-16T03:45:48Z

Thanks, code has been updated.

Looks much better, I think it's almost there - left a few more comments.

The build also seems to be failing with a compile error.

AmplabJenkins · 2019-08-16T05:03:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16314/
Test FAILed.

AmplabJenkins · 2019-08-16T08:54:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16324/
Test PASSed.

AmplabJenkins · 2019-08-16T19:16:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16344/
Test PASSed.

zhijunfu · 2019-08-17T00:16:16Z

@edoakes Let me know if you have any additional comments, or please give a Go if this looks ok?:) Thanks.

AmplabJenkins · 2019-08-17T04:28:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16359/
Test PASSed.

edoakes · 2019-08-17T23:46:30Z

Looks good except I think we should rethink the way that we're giving both the ObjectInterface and the DirectActorTransport a reference to the LocalMemoryStore.

Right now, the ObjectInterface creates the store as a shared_ptr and then creates separate LocalMemoryStoreProvider unique_ptrs for itself and the transport. Why not just have the ObjectInterface pass in a shared_ptr to its LocalMemoryStoreProvider? That way we wouldn't have to keep around the local_memory_store_ member and the DirectActorTransport wouldn't need to take a reference to the ObjectInterface in its constructor just to create the store provider.

LMK if I'm missing something here.

zhijunfu · 2019-08-18T04:17:30Z

This is a good question. This is because the task submitter needs to use its own store provider to put the return objects to store, it cannot share the one in ObjectInterface as that one is used by user code otherwise it could deadlock. Imagine that user submits a task and blocks on the Get for the return object of that task, and then transport receives task return object and then puts it into store using the same provider.

Looks good except I think we should rethink the way that we're giving both the ObjectInterface and the DirectActorTransport a reference to the LocalMemoryStore.

Right now, the ObjectInterface creates the store as a shared_ptr and then creates separate LocalMemoryStoreProvider unique_ptrs for itself and the transport. Why not just have the ObjectInterface pass in a shared_ptr to its LocalMemoryStoreProvider? That way we wouldn't have to keep around the local_memory_store_ member and the DirectActorTransport wouldn't need to take a reference to the ObjectInterface in its constructor just to create the store provider.

LMK if I'm missing something here.

edoakes · 2019-08-18T05:14:34Z

I don't quite understand the problem - does a Get() request block the memory store provider (i.e., hold a mutex) the whole time? If this is the only solution given the current memory store design it seems like we should rethink that rather than work around it by making separate interfaces to it. The current design is not very semantically clear and seems like it could lead to bugs pretty easily - it would be best to have a single interface that gives explicit guarantees about what happens when multiple threads call into it at the same time.

zhijunfu · 2019-08-18T07:02:56Z

Let me explain it with more details. The problem here is that we cannot guarantee multiple threads access for the same store provider instance would not block for all the store providers. It actually depends on the implementation for that store provider.

In this particular case, it's actually ok for the memory store provider to be accessed by multiple threads, given its current implementation; but it is NOT ok for plasma store provider, because when doing a Get, plasma client would block on reply from plasma store, thus other threads accessing the same plasma client would get blocked - this is inherent from the design of plasma, it is not because of core worker implementation.

Since we cannot guarantee multi-thread access can work for all the store providers, I think it's better for the transport to always create a separate store provider, so that it always works regardless of the exact store provider used. I think the code is more robust this way, instead of having to think about whether the specific store provider can be shared in multi-thread access, which is implementation specific to that store provider, and is more likely to produce bugs.

I do agree that we should avoid workarounds as much as possible, but I think in this case it's not a workaround, it instead deals with different store providers uniformly, avoids tight coupling with store provider specific implementations, and helps to avoid unintended bugs.

Anyway, this is not introduced by this PR, and is not for memory store provider. I think we can still discuss this if you have different thoughts, but this should not block this PR from merging.

Please kindly let me know if this doesn't make sense to you.

I don't quite understand the problem - does a Get() request block the memory store provider (i.e., hold a mutex) the whole time? If this is the only solution given the current memory store design it seems like we should rethink that rather than work around it by making separate interfaces to it. The current design is not very semantically clear and seems like it could lead to bugs pretty easily - it would be best to have a single interface that gives explicit guarantees about what happens when multiple threads call into it at the same time.

edoakes

I understand the problem now, thank you for the detailed explanation!

I'm still not very content with the solution of creating separated providers for the object interface and transport, but you're right that this wasn't introduced my this PR and we shouldn't block it on that. Let's please keep this in mind as we continue developing the core worker.

I left one more comment, otherwise LGTM. Should probably get at least another pair of eyes on it, though.

src/ray/core_worker/transport/direct_actor_transport.cc

raulchen

LGTM. Only left some comments about naming and documentation

src/ray/core_worker/object_interface.cc

src/ray/core_worker/object_interface.h

src/ray/core_worker/test/core_worker_test.cc

src/ray/core_worker/object_interface.cc

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

…lliance/ray into multi-store-providers

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

zhijunfu

updated code. thanks.

src/ray/core_worker/object_interface.cc

zhijunfu · 2019-08-19T10:15:08Z

Thanks for your detailed review also.

Yes going forward let's keep it in mind and try to simplify when possible. I think we're on the same page on this.

On the other perspective, on the design wise, having a self-contained and loosely coupled architecture is also very important, it makes it easier to make code changes, and it helps to reduce unintentional bugs. As the system grows bigger, it's very hard for a single person to know all the details of the system, thus having a loosely coupled design is important as it helps to make sure that making changes inside a module would not requires full understanding of other parts, and would not result in unintended effects/bugs.

There are a few examples in this PR (and the original large PR) that can be viewed differently from different perspectives.

Sometimes it's about to find the right balance between code simplicity and architecture design, I think they are both very important, and it would be good to keep both in mind going forward! :)

I understand the problem now, thank you for the detailed explanation!

I'm still not very content with the solution of creating separated providers for the object interface and transport, but you're right that this wasn't introduced my this PR and we shouldn't block it on that. Let's please keep this in mind as we continue developing the core worker.

I left one more comment, otherwise LGTM. Should probably get at least another pair of eyes on it, though.

AmplabJenkins · 2019-08-19T12:21:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16390/
Test PASSed.

AmplabJenkins · 2019-08-19T13:33:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16391/
Test PASSed.

src/ray/core_worker/test/core_worker_test.cc

edoakes

Sorry, I should have caught this earlier, but I'm now realizing that we aren't handling duplicate ObjectIDs in Get() or Wait() at all here. Also, given that we're using unordered data structures to keep track of the ObjectIDs, can't the ordering in the result set not match that of the input ids?

I think for Get(), the desired behavior is clear: we should be able to take a list of possibly duplicated IDs and always return the same length list of result objects in the same order as they were passed in. To do this, we should probably de-duplicate them immediately in the top-level Get() call and then assume the lists passed to store providers are unique. As an optimization we could alternatively pass a unordered_map<ObjectID, vector<int>> to the store providers and write results directly to the top-level results vector, but that might get messy.

For Wait(), the desired behavior is much less clear but we should be explicit in our handling of it - we discussed it a bit in person and agree that it's best to raise an exception if a non-unique list is passed in, forcing users to handle it at the application level.

Please add the above changes, document the semantics and assumptions for each method, and add to the object interface test to verify that the desired behavior is met.

zhijunfu · 2019-08-20T03:27:02Z

Please see my reply inline:

Sorry, I should have caught this earlier, but I'm now realizing that we aren't handling duplicate ObjectIDs in Get() or Wait() at all here. Also, given that we're using unordered data structures to keep track of the ObjectIDs, can't the ordering in the result set not match that of the input ids?

The above statement is not true.

Get() and Wait already handles duplicated object ids. And there's no ordering issue as mentioned, because at the end of Get it will loop all the object ids passed in and fill the result vector.

I think for Get(), the desired behavior is clear: we should be able to take a list of possibly duplicated IDs and always return the same length list of result objects in the same order as they were passed in. To do this, we should probably de-duplicate them immediately in the top-level Get() call and then assume the lists passed to store providers are unique.

This is already the case with existing implementation.

As an optimization we could alternatively pass a unordered_map<ObjectID, vector<int>> to the store providers and write results directly to the top-level results vector, but that might get messy.

Yes, this would complicate the code. I'd prefer the first approach.

For Wait(), the desired behavior is much less clear but we should be explicit in our handling of it - we discussed it a bit in person and agree that it's best to raise an exception if a non-unique list is passed in, forcing users to handle it at the application level.

The PR actually handles the duplicates for Wait as well. I do agree the semantics for Wait is not as clear as Get, I think it's fine to fail it with an exception when there are duplicates.

Please add the above changes, document the semantics and assumptions for each method, and add to the object interface test to verify that the desired behavior is met.

As mentioned, Get already uses the suggested approach with proper tests. I can change Wait as suggested above.

edoakes

Ah, I'm sorry about that - I clearly didn't read the final write into results closely enough. The current implementation looks good and given that you already are handling duplicates in Wait() let's leave it as-is (at least for now).

Could you please document the semantics for Get() and Wait() in the header comments though? And maybe a couple of short comments in the code where we de-duplicate and then replace the duplicates.

zhijunfu · 2019-08-20T04:35:22Z

Ah, I'm sorry about that - I clearly didn't read the final write into results closely enough. The current implementation looks good and given that you already are handling duplicates in Wait() let's leave it as-is (at least for now).

Could you please document the semantics for Get() and Wait() in the header comments though? And maybe a couple of short comments in the code where we de-duplicate and then replace the duplicates.

Thanks! Will update with more comments.

…lliance/ray into multi-store-providers

AmplabJenkins · 2019-08-20T15:28:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16417/
Test PASSed.

zhijunfu added 2 commits August 14, 2019 19:35

support multiple store providers

d2d6f85

format

9442ef0

zhijunfu requested review from raulchen, pcmoritz and edoakes and removed request for raulchen and pcmoritz August 14, 2019 11:42

edoakes requested changes Aug 14, 2019

View reviewed changes

zhijunfu and others added 6 commits August 15, 2019 14:09

Update src/ray/core_worker/object_interface.cc

a66ee2e

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

Update src/ray/core_worker/object_interface.h

901b21a

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

Update src/ray/core_worker/store_provider/store_provider.h

6111b08

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

resolve comments

a086140

format

fa39a15

Merge branch 'multi-store-providers' of https://github.com/ant-tech-a…

dd739cd

…lliance/ray into multi-store-providers

edoakes requested changes Aug 15, 2019

View reviewed changes

resolve comments

2ff729c

zhijunfu added 2 commits August 16, 2019 16:02

update per comments

5cb4043

format

3a8e810

fix

fc7474d

update

6a3cc99

edoakes approved these changes Aug 19, 2019

View reviewed changes

src/ray/core_worker/transport/direct_actor_transport.cc Outdated Show resolved Hide resolved

raulchen approved these changes Aug 19, 2019

View reviewed changes

zhijunfu and others added 6 commits August 19, 2019 17:11

Update src/ray/core_worker/object_interface.cc

47b2181

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

Update src/ray/core_worker/object_interface.h

b9d7050

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

comments

b4ccc23

Merge branch 'multi-store-providers' of https://github.com/ant-tech-a…

b5489b0

…lliance/ray into multi-store-providers

format

a732b63

Update src/ray/core_worker/test/core_worker_test.cc

3c5cc1b

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

zhijunfu commented Aug 19, 2019

View reviewed changes

src/ray/core_worker/object_interface.cc Outdated Show resolved Hide resolved

pcmoritz reviewed Aug 19, 2019

View reviewed changes

src/ray/core_worker/test/core_worker_test.cc Show resolved Hide resolved

edoakes requested changes Aug 19, 2019

View reviewed changes

edoakes approved these changes Aug 20, 2019

View reviewed changes

zhijunfu added 3 commits August 20, 2019 19:38

add comments

fd43d4d

format

bd3b57e

Merge branch 'multi-store-providers' of https://github.com/ant-tech-a…

8e8483f

…lliance/ray into multi-store-providers

raulchen merged commit eab5957 into ray-project:master Aug 21, 2019

raulchen deleted the multi-store-providers branch August 21, 2019 03:16

kfstorm mentioned this pull request Aug 27, 2019

[Java] ObjectID::fromRandom sets proper flags #5548

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple store providers in ObjectInterface #5452

Support multiple store providers in ObjectInterface #5452

zhijunfu commented Aug 14, 2019

AmplabJenkins commented Aug 14, 2019

edoakes left a comment

zhijunfu commented Aug 15, 2019

AmplabJenkins commented Aug 15, 2019

edoakes commented Aug 15, 2019

zhijunfu commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

zhijunfu commented Aug 17, 2019

AmplabJenkins commented Aug 17, 2019

edoakes commented Aug 17, 2019

zhijunfu commented Aug 18, 2019

edoakes commented Aug 18, 2019

zhijunfu commented Aug 18, 2019

edoakes left a comment

raulchen left a comment

zhijunfu left a comment

zhijunfu commented Aug 19, 2019 •

edited

Loading

AmplabJenkins commented Aug 19, 2019

AmplabJenkins commented Aug 19, 2019

edoakes left a comment

zhijunfu commented Aug 20, 2019

edoakes left a comment

zhijunfu commented Aug 20, 2019

AmplabJenkins commented Aug 20, 2019

Support multiple store providers in ObjectInterface #5452

Support multiple store providers in ObjectInterface #5452

Conversation

zhijunfu commented Aug 14, 2019

What do these changes do?

Related issue number

Linter

AmplabJenkins commented Aug 14, 2019

edoakes left a comment

Choose a reason for hiding this comment

zhijunfu commented Aug 15, 2019

AmplabJenkins commented Aug 15, 2019

edoakes commented Aug 15, 2019

zhijunfu commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

zhijunfu commented Aug 17, 2019

AmplabJenkins commented Aug 17, 2019

edoakes commented Aug 17, 2019

zhijunfu commented Aug 18, 2019

edoakes commented Aug 18, 2019

zhijunfu commented Aug 18, 2019

edoakes left a comment

Choose a reason for hiding this comment

raulchen left a comment

Choose a reason for hiding this comment

zhijunfu left a comment

Choose a reason for hiding this comment

zhijunfu commented Aug 19, 2019 • edited Loading

AmplabJenkins commented Aug 19, 2019

AmplabJenkins commented Aug 19, 2019

edoakes left a comment

Choose a reason for hiding this comment

zhijunfu commented Aug 20, 2019

edoakes left a comment

Choose a reason for hiding this comment

zhijunfu commented Aug 20, 2019

AmplabJenkins commented Aug 20, 2019

zhijunfu commented Aug 19, 2019 •

edited

Loading