Skip to content

Conversation

heyucongtom
Copy link
Contributor

What do these changes do?

Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.).
[+] Implement sharding for gcs tables.
[+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented.
[+] Move AsyncGcsClient's initialization into Connect function.
[-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.

Related issue number

@heyucongtom heyucongtom changed the title Gcs sharding 2 [WIP] Implementing Gcs sharding after putting Gcs data into redis data shard Jul 17, 2018
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6672/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6673/
Test PASSed.

@heyucongtom
Copy link
Contributor Author

@robertnishihara
"bash ../src/plasma/test/run_tests.sh" this test seems to be failing.
Also I need some help to check the sharding scheme for request-notification & subscribeAsync. Still on the way of reading through the codebase.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6695/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6696/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6706/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6707/
Test PASSed.

Yucong He and others added 2 commits July 19, 2018 22:41
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6709/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6710/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6725/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6727/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6729/
Test FAILed.

@robertnishihara
Copy link
Collaborator

In this PR, you should get rid of this if statement

ray/test/runtest.py

Lines 1157 to 1159 in 2a3b026

if os.environ.get("RAY_USE_XRAY") == "1":
print("XRAY currently supports only a single Redis shard.")
kwargs["num_redis_shards"] = 1

so that the sharded case will be tested more.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6754/
Test FAILed.

static void GetRedisShards(redisContext *context, std::vector<std::string> *addresses,
std::vector<int> *ports) {
static void GetRedisShards(redisContext *context, std::vector<std::string>& addresses,
std::vector<int>& ports) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the nit, but to keep the style consistent, we usually put the & with the variable name, so &addresses and &ports instead of & addresses and & ports.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left one small comment, but otherwise looks great :)

@heyucongtom
Copy link
Contributor Author

heyucongtom commented Aug 22, 2018 via email

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7677/
Test PASSed.

@heyucongtom
Copy link
Contributor Author

@robertnishihara @stephanie-wang There's some error where "GET NumRedisShards" is getting a nil reply after I move the things into the constructor. My guess is that this command is not properly setup when the constructor is called. Please feel free to take a look at it

/home/travis/build/ray-project/ray/src/ray/gcs/client.cc:23 Check failed: num_attempts < RayConfig::instance().redis_db_connect_retries() No entry found for NumRedisShards

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7765/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7769/
Test PASSed.

// Used since changing the API of Connect is breaking some tests.
// Would be removed soon in the future.
const bool DUMMY_BOOL = true;
RAY_CHECK_OK(primary_context_->Connect(address, port, DUMMY_BOOL));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just pass in true instead of defining DUMMY_BOOL. Maybe something like pass in true /* sharding */.

I agree it would be best to just have a default argument, but it's ok to do that separately.

}

// Populate shard_contexts.
for (unsigned int i = 0; i < addresses.size(); ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We essentially don't use unsigned int anywhere in the codebase. Can you just use int here? This also applies below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep


// Populate shard_contexts.
for (unsigned int i = 0; i < addresses.size(); ++i) {
// Slower than emplace but resource safe.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this comment is helpful. This piece of code is definitely not performance critical. Maybe remove the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

shard_contexts_.push_back(std::make_shared<RedisContext>());
}

// Call connect for all contexts. Safe to do many times.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe to call Connect multiple times? Why is that?

}

// Call connect for all contexts. Safe to do many times.
// Here shard_contexts_.size() == addresses.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this comment a RAY_CHECK(shard_contexts_.size() == addresses.size()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7865/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7864/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7867/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7903/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7914/
Test FAILed.

@robertnishihara robertnishihara changed the title [WIP] Implementing Gcs sharding after putting Gcs data into redis data shard Implementing Gcs sharding after putting Gcs data into redis data shard Aug 31, 2018
@robertnishihara robertnishihara changed the title Implementing Gcs sharding after putting Gcs data into redis data shard [xray] Implementing Gcs sharding Aug 31, 2018
@robertnishihara robertnishihara merged commit 5b45f0b into ray-project:master Aug 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants