-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Avoid looping over all snapshot ids for each long poll request #45881
[Serve] Avoid looping over all snapshot ids for each long poll request #45881
Conversation
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
except KeyError: | ||
# Initial snapshot id must be >= 0, so that the long poll client | ||
# can send a negative initial snapshot id to get a fast update. | ||
self.snapshot_ids[object_key] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was previously a random number https://github.com/ray-project/ray/pull/45881/files#diff-f138b21f7ddcd7d61c0b2704c8b828b9bbe7eb5021531e2c7fabeb20ec322e1aL195 but I couldn't figure out why - I can of course use a random number here but this seems simpler if that's not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shrekris-anyscale do you know why this was a random number / is that necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set it to a random number to handle an edge case when the controller crashes and recovers. Suppose the controller always starts with ID 0, and it's currently at ID 1. Suppose all the clients are also at ID 1. Now suppose:
- The
ServeController
(which runs theLongPollHost
) crashes and restarts. TheLongPollHost
's snapshot IDs are reset to 0. - All clients– except 1 slow client– reconnect to the
LongPollHost
on theServeController
. Their snapshot IDs are still 1, so the controller propagates an update, and all the connected clients' snapshot IDs are set back to 0. - One of the connected clients sends an update to the
LongPollHost
usingnotify_changed
. The controller bumps the snapshot ID to 1 and updates the connected clients. - The slow client finally connects to the controller. Its ID is also 1, so it doesn't receive the update.
To correctly avoid this edge case, we should cache the snapshot IDs and restore them when the controller recovers. That's pretty complex though, so instead we initialize the snapshot IDs to a random number between 0 and 1,000,000. That makes this edge case very unlikely.
Could we switch back to using a default dict with the random number generator as the factory function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! That totally makes sense, and I will restore that behavior.
Could we switch back to using a default dict with the random number generator as the factory function?
Let me play around with this, I might be able to make it work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I just remembered why I changed it do a plain dict - the problem is that snapshot_ids
is a defaultdict, but object_snapshots
isn't, so I was trying to get items from one that didn't exist in the other. Now I'm wondering if those two mappings can be combined, since they have the same keys 🤔
self.snapshot_ids[object_key] += 1 | ||
try: | ||
self.snapshot_ids[object_key] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a defaultdict
anymore :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing! I left some comments.
try: | ||
existing_id = self.snapshot_ids[key] | ||
except KeyError: | ||
# The caller may ask for keys that we don't know about (yet), | ||
# just ignore them. | ||
continue | ||
|
||
if existing_id != snapshot_id: | ||
updated_objects[key] = UpdatedObject( | ||
self.object_snapshots[key], existing_id | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid raising and catching an error by using a conditional block.
try: | |
existing_id = self.snapshot_ids[key] | |
except KeyError: | |
# The caller may ask for keys that we don't know about (yet), | |
# just ignore them. | |
continue | |
if existing_id != snapshot_id: | |
updated_objects[key] = UpdatedObject( | |
self.object_snapshots[key], existing_id | |
) | |
# The caller may ask for keys that we don't know about (yet), | |
# just ignore them. | |
if key in self.snapshot_ids: | |
latest_id = self.snapshot_ids[key] | |
if latest_id != snapshot_id: | |
updated_objects[key] = UpdatedObject( | |
self.object_snapshots[key], latest_id | |
) |
except KeyError: | ||
# Initial snapshot id must be >= 0, so that the long poll client | ||
# can send a negative initial snapshot id to get a fast update. | ||
self.snapshot_ids[object_key] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set it to a random number to handle an edge case when the controller crashes and recovers. Suppose the controller always starts with ID 0, and it's currently at ID 1. Suppose all the clients are also at ID 1. Now suppose:
- The
ServeController
(which runs theLongPollHost
) crashes and restarts. TheLongPollHost
's snapshot IDs are reset to 0. - All clients– except 1 slow client– reconnect to the
LongPollHost
on theServeController
. Their snapshot IDs are still 1, so the controller propagates an update, and all the connected clients' snapshot IDs are set back to 0. - One of the connected clients sends an update to the
LongPollHost
usingnotify_changed
. The controller bumps the snapshot ID to 1 and updates the connected clients. - The slow client finally connects to the controller. Its ID is also 1, so it doesn't receive the update.
To correctly avoid this edge case, we should cache the snapshot IDs and restore them when the controller recovers. That's pretty complex though, so instead we initialize the snapshot IDs to a random number between 0 and 1,000,000. That makes this edge case very unlikely.
Could we switch back to using a default dict with the random number generator as the factory function?
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
# They should also be randomized to try to avoid situations where, | ||
# if the controller restarts and a client has a now-invalid snapshot id | ||
# that happens to match what the controller restarts with, | ||
# the client wouldn't receive an update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of explaining the edge case here, could you link my comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
# if the controller restarts and a client has a now-invalid snapshot id | ||
# that happens to match what the controller restarts with, | ||
# the client wouldn't receive an update. | ||
self.snapshot_ids[object_key] = random.randint(0, 1_000_000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this isn't a defaultdict
anymore, could you double check whether any other places we access self.snapshot_ids
needs this KeyError
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see four uses:
The first two are in
ray/python/ray/serve/_private/long_poll.py
Lines 411 to 418 in fd37d93
try: | |
self.snapshot_ids[object_key] += 1 | |
except KeyError: | |
# Initial snapshot id must be >= 0, so that the long poll client | |
# can send a negative initial snapshot id to get a fast update. | |
# They should also be randomized to try to avoid situations where, | |
# if the controller restarts and a client has a now-invalid snapshot id | |
# that happens to match what the controller restarts with, |
KeyError
check.
The next is
ray/python/ray/serve/_private/long_poll.py
Lines 256 to 264 in fd37d93
try: | |
existing_id = self.snapshot_ids[key] | |
except KeyError: | |
# The caller may ask for keys that we don't know about (yet), | |
# just ignore them. | |
# This can happen when, for example, | |
# a deployment handle is manually created for an app | |
# that hasn't been deployed yet (by bypassing the safety checks). | |
continue |
KeyError
check.
The last is
ray/python/ray/serve/_private/long_poll.py
Lines 308 to 317 in fd37d93
else: | |
updated_object_key: str = async_task_to_watched_keys[done.pop()] | |
updated_object = { | |
updated_object_key: UpdatedObject( | |
self.object_snapshots[updated_object_key], | |
self.snapshot_ids[updated_object_key], | |
) | |
} | |
self._count_send(updated_object) | |
return updated_object |
ray/python/ray/serve/_private/long_poll.py
Lines 422 to 423 in fd37d93
logger.debug(f"LongPollHost: Notify change for key {object_key}.") | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for being thorough here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My pleasure!
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Premerge is failing. Could you merge master
and retry?
Hmmm, I'm seeing the same failure on my unrelated PR #45939 - looks like maybe CI is busted? |
I'm also seeing a similar failure on |
We're working on a fix here: #46361. |
@shrekris-anyscale looks like CI is passing again - mind taking another look at this PR? Thanks in advance! |
Thanks! |
Why are these changes needed?
See #45872 for background - TLDR we are trying to improve the scalability of the Serve Controller and I'm hunting down optimizations in the controller code.
In this case, it looks like there's an expensive operation happening on each long poll client request - see comments for the details.
Before, 1800 apps with one deployment each and maybe ~5k handles in the system, probably more:
After, 1800 apps with one deployment each, averaging ~7.5k active
listen_for_change
tasks on the controller, with both #45872 and #45878 as well:Related issue number
Discovered while looking at Serve Controller flamegraphs for #45777
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.