Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LRU eviction of client notification datastructure #4021

Merged
merged 16 commits into from
Feb 14, 2019

Conversation

pcmoritz
Copy link
Contributor

This should fix #3997

One question that is still open is if this datastructure is cleaned up properly (both if the wait exits and also if clients die that are blocked in a wait).

@@ -39,6 +39,9 @@ extern RedisChainModule module;
} \
}

/// Map from pub sub channel to clients that are waiting on that channel.
std::unordered_map<std::string, std::vector<std::string>> notification_map;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the size of this to the debug string somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could alternatively print a warning if this map exceeds certain size bounds, e.g. 1000, 10k, 100k.

Copy link
Contributor Author

@pcmoritz pcmoritz Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I added a redis command that can query the DEBUG_STRING

@ericl
Copy link
Contributor

ericl commented Feb 11, 2019

Could you also add

import ray

@ray.remote
def f():
    return 0

@ray.remote
def g():
    import time
    start = time.time()
    while time.time() < start + 1:
        ray.get([f.remote() for _ in range(10)])

# 10MB -> hangs after ~5 iterations
# 20MB -> hangs after ~20 iterations
# 50MB -> hangs after ~50 iterations
ray.init(redis_max_memory=1024 * 1024 * 10)

i = 0
for i in range(50):
    i += 1
    a = g.remote()
    [ok], _ = ray.wait([a])
    print("iter", i)

to the jenkins tests? This will hang-fail before this fix.

@robertnishihara
Copy link
Collaborator

@pcmoritz @ericl how long does that take to run assuming it's successful? The Jenkins tests take quite a long time as it is.

@ericl
Copy link
Contributor

ericl commented Feb 12, 2019

It takes about 30 seconds.

RedisModuleString *channel;
RAY_RETURN_NOT_OK(FormatPubsubChannel(&channel, ctx, pubsub_channel_str, keyname));
RAY_CHECK_OK(FormatPubsubChannel(&channel, ctx, pubsub_channel_str, keyname));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guoyuhong removed all RAY_CHECK calls in #3855. So let's keep it that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11800/
Test FAILed.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11805/
Test FAILed.

// therefore we construct a temporary redis string here, which
// will be garbage collected by redis.
auto channel = RedisModule_CreateString(
ctx, client_channel.data(), client_channel.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcmoritz Can we delete this temporary redis string immediately after PUBLISH command finishes? In our system we do not enable redis LRU by default. That may cause redis memory leak.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about LRU eviction or about automatic memory management? We shouldn’t have to free strings or close keys because redis automatic memory management takes care of that. If you do need to free strings then we you would need to do it in many places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought it was a key... It is just a Redis String, which is fine.

@guoyuhong
Copy link
Contributor

One more question, what is the typical request key number in the key value of the map std::unordered_map<std::string, std::vector<std::string>> notification_map;. If the number is not big, it is fine. Otherwise, std::remove may be very slow for a vector, since we often call CancelNotifications in task table and object table.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11807/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11808/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11809/
Test FAILed.

@ericl
Copy link
Contributor

ericl commented Feb 12, 2019

jenkins retest this please

if (reply == NULL) {
return RedisModule_ReplyWithError(ctx, "error during PUBLISH");
}
}
notification_map.erase(it);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this line? You said it was a bug, but shouldn’t we only update this when a client calls subscribe or unsubscribe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a fluke, I just had forgotten how to erase from a c++ vector properly

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11826/
Test FAILed.

@ericl
Copy link
Contributor

ericl commented Feb 12, 2019

jenkins retest this please

Also some lint errors...

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11836/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11842/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Feb 13, 2019

@guoyuhong The number of elements in the vector will be fairly small (= the number of clients subscribed to that key), probably < 100 in most cases. I expect vectors to be faster in this case than other datastructures.

Copy link
Contributor

@guoyuhong guoyuhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcmoritz Thanks for the confirmation.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11851/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11852/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11858/
Test FAILed.

@ericl
Copy link
Contributor

ericl commented Feb 13, 2019

jenkins retest this please

@ericl
Copy link
Contributor

ericl commented Feb 13, 2019

OOM command not allowed when used memory > 'maxmemory'.

Odd this happens when the test is initializing, perhaps 10MB is too small on the jenkins environment.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11870/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11862/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11863/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

Jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11891/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11908/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

alright, looks like Jenkins is finally happy (the multinode_test script might not pass through the flag to redis, it's a little strange but also not the intended use case...)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11921/
Test FAILed.

@ericl
Copy link
Contributor

ericl commented Feb 14, 2019

Ok, no luck with jenkins. I'm going to merge this and we can figure out the test issue separately.

@ericl ericl merged commit 810cc17 into ray-project:master Feb 14, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11933/
Test FAILed.

@robertnishihara robertnishihara deleted the fix-wait branch February 15, 2019 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redis LRU eviction causes hangs with ray.wait() (but get() is fine)
5 participants