-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci][release][core] rewrite RuntimeEnvAgentClient with reusable TCP connection, also test_many_runtime_envs.py with env vars #38772
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…fix-runtime-env-release-test Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
I will review the PR after release blockers are cleaned up. Is this okay? |
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…ease-test Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
ETA review tmrw! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I dive deep into a review, I wonder if we can find a simpler approach (than implementing keepalive again). Please note that I am not against the current approach (but I want to make sure we choose the simplest solution here).
Since having 3000 concurrent requests already is pretty abnormal, I wonder why not we can just rate limit the # of concurrent requests (e.g., 1000) from the server and fail or retry from the runtime env agent side with exponential backoff? Do you think this approach could have some sort of limitation?
@@ -211,6 +211,7 @@ def parent_dead_callback(): | |||
host=args.node_ip_address, | |||
port=args.runtime_env_agent_port, | |||
loop=loop, | |||
keepalive_timeout=None, # we want to always keep alive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actually I am confused with this comment. If keepalive_timeout = None, does that mean keepalive will never timeout, meaning it can actually never detect the client side failures?
Omits the last chars except for the last char which should be "}". | ||
""" | ||
if len(dict_str) > 120: | ||
return dict_str[:116] + "...}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add an env var to enable full string?
@@ -498,11 +521,15 @@ async def _create_runtime_env_with_retry( | |||
) | |||
|
|||
async def DeleteRuntimeEnvIfPossible(self, request): | |||
|
|||
runtime_env_truncated = truncate_dict_str(request.serialized_runtime_env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actuallly I wonder if it is possible to make it a part of __str__
implementation instead? Seems a bit fragile (i.e., if you use this dict by mistake for functional purpose, that will be a huge issue & if you add new logs, it is easy to miss)
// Will call callback exactly once with pair{non-ok, any} or pair{ok, reply body}. | ||
// Will call either succ_callback or fail_callback exactly once. | ||
// | ||
// Keep-alive and reuses TCP connection, but only handle 1 request at a time (no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this causing any performance related issues? E.g., if one of runtime env takes long time to create, everything else will be delayed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, is it correct this problem is handled by session pool?
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Why are these changes needed?
Also, we had a test_many_runtime_envs release test that creates lots of working_dirs and run tasks. However it did not work because each time we updated files in the dir and use it as working_dir, but the path stays the same so Ray thinks it's the same runtime env. Changed it to test on env vars.
Also made summarized runtime envs in the logs to reduce log size.
Related issue number
Fixes #38662.