Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Put GCS data into the redis data shard #2298

Merged
merged 33 commits into from
Jul 1, 2018

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Jun 25, 2018

This moves the data for the new GCS into a separate data shard. It also changes the RAY_NEW_GCS codepath to use XRay (the codepath was only there for development until XRay is ready).

freeReplyObject(reply);
usleep(RayConfig::instance().redis_db_connect_wait_milliseconds() * 1000);
num_attempts++;
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove continue; it is useless here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6239/
Test FAILed.

while (num_attempts < RayConfig::instance().redis_db_connect_retries()) {
// Try to read the number of Redis shards from the primary shard. If the
// entry is present, exit.
reply = reinterpret_cast<redisReply *>(redisCommand(context, "GET NumRedisShards"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a function std::unique_ptr< redisReply, std::function<void(redisReply*)>> ExecutionCommand(context, const char* command) {
return std::unique_ptr< redisReply, std::function<void(redisReply*)>>(reinterpret_cast<redisReply *>(redisCommand(context, command)), [](redisReply *reply) {freeReplyObject(reply);});
}
to avoid manually call freeReplyObject(reply);

@@ -133,6 +135,71 @@ RedisContext::~RedisContext() {
}
}

void GetRedisShards(redisContext *context, std::vector<std::string> *addresses,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static void GetRedisShards to avoid polluted global namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6242/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6236/
Test PASSed.

@@ -37,7 +37,7 @@ class RAY_EXPORT AsyncGcsClient {
/// \param address The GCS IP address.
/// \param port The GCS port.
/// \return Status.
Status Connect(const std::string &address, int port);
Status Connect(const std::string &address, int port, bool sharding);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document the argument

@@ -133,7 +135,71 @@ RedisContext::~RedisContext() {
}
}

Status RedisContext::Connect(const std::string &address, int port) {
static void GetRedisShards(redisContext *context, std::vector<std::string> *addresses,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some overlap with #2281. @heyucongtom can you take a look?

Copy link
Contributor

@heyucongtom heyucongtom Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are both doing a translation (from old ray get_redis_shards to new) here. I think the major difference is that
[+] #2298 add an use_sharding bool to indicate if we are using sharding.
[-] #2281 defaultly using all shards.
[-] #2281 modifies table.cc so that when tables do the work, they randomly select a redis shard by the ID.
[-] #2281 attaches all contexts to event loop, while only connecting the main context (I guess it may be wrong: since tests are giving timeouts.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thing is: we are writing the exactly same thing in getRedisShards function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds right. This PR is actually really doing sharding. It's just creating a single extra shard (as opposed to an arbitrary number of additional shards).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am guessing that there is some overlap(having sent out a msg on Slack on this), in which #2281 focuses more on distributing workload for table’s operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be little or no conflicts, though.

@@ -133,7 +135,71 @@ RedisContext::~RedisContext() {
}
}

Status RedisContext::Connect(const std::string &address, int port) {
static void GetRedisShards(redisContext *context, std::vector<std::string> *addresses,
std::vector<int> *ports) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaner to pass in as std::vector<int> &ports I think.

std::vector<int> *ports) {
// Get the total number of Redis shards in the system.
int num_attempts = 0;
redisReply *reply = NULL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullptr

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6254/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6255/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6257/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6267/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6278/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6281/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6279/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6303/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6359/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6361/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6362/
Test FAILed.

@concretevitamin
Copy link
Contributor

In the latest travis build for this PR, Build #751, I do not see the quoted error. Just restarted that build let's see what's going on.

@@ -142,8 +142,9 @@ GlobalSchedulerState *GlobalSchedulerState_init(event_loop *loop,
db_attach(state->db, loop, false);

RAY_CHECK_OK(state->gcs_client.Connect(std::string(redis_primary_addr),
redis_primary_port));
redis_primary_port, true));
Copy link
Contributor

@concretevitamin concretevitamin Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all C++ files in this PR, please use /*sharding=*/true (or false, correspondingly).

std::shared_ptr<RedisContext> context_;
std::unique_ptr<RedisAsioClient> asio_async_client_;
std::unique_ptr<RedisAsioClient> asio_subscribe_client_;

// The following context writes everything to the primary shard
std::shared_ptr<RedisContext> auxiliary_context_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to call this auxiliary instead of primary?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6382/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6401/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6402/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6403/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6404/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6405/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6406/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6408/
Test PASSed.

@robertnishihara robertnishihara merged commit 762bdf6 into ray-project:master Jul 1, 2018
@robertnishihara robertnishihara deleted the xray-data-shard branch July 1, 2018 01:42
robertnishihara pushed a commit that referenced this pull request Aug 31, 2018
Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.).
[+] Implement sharding for gcs tables.
[+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented.
[+] Move AsyncGcsClient's initialization into Connect function.
[-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants