Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci][release][core] rewrite RuntimeEnvAgentClient with reusable TCP connection, also test_many_runtime_envs.py with env vars #38772

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Aug 23, 2023

Why are these changes needed?

  • What? if you concurrently create too many (~3000) rt envs on a same node, it fails.
  • Regression? yes, it IS regression. It comes from the new boost beast based http client.
  • Release blocker? no, it's not a regular use case, we are ok as long as we don't burst 1000s of rt envs in a same time.
  • Why? I used boost beast to implement a simple http client, but it's too simple - it makes 1 tcp connection per request. And if you ask for thousands of concurrent connections, OS will refuse to assign more ports to the client and it fails.
  • Next step? I rewrote the HTTP client to reuse the connection.

Also, we had a test_many_runtime_envs release test that creates lots of working_dirs and run tasks. However it did not work because each time we updated files in the dir and use it as working_dir, but the path stays the same so Ray thinks it's the same runtime env. Changed it to test on env vars.

Also made summarized runtime envs in the logs to reduce log size.

Related issue number

Fixes #38662.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
rynewang and others added 9 commits August 23, 2023 14:22
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…fix-runtime-env-release-test

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang changed the title [ci][release][core] rewrite test_many_runtime_envs.py with env vars [ci][release][core] rewrite RuntimeEnvAgentClient with reusable TCP connection, also test_many_runtime_envs.py with env vars Sep 1, 2023
@rkooo567
Copy link
Contributor

rkooo567 commented Sep 1, 2023

I will review the PR after release blockers are cleaned up. Is this okay?

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…ease-test

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rkooo567
Copy link
Contributor

rkooo567 commented Oct 4, 2023

ETA review tmrw!

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I dive deep into a review, I wonder if we can find a simpler approach (than implementing keepalive again). Please note that I am not against the current approach (but I want to make sure we choose the simplest solution here).

Since having 3000 concurrent requests already is pretty abnormal, I wonder why not we can just rate limit the # of concurrent requests (e.g., 1000) from the server and fail or retry from the runtime env agent side with exponential backoff? Do you think this approach could have some sort of limitation?

@@ -211,6 +211,7 @@ def parent_dead_callback():
host=args.node_ip_address,
port=args.runtime_env_agent_port,
loop=loop,
keepalive_timeout=None, # we want to always keep alive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually I am confused with this comment. If keepalive_timeout = None, does that mean keepalive will never timeout, meaning it can actually never detect the client side failures?

Omits the last chars except for the last char which should be "}".
"""
if len(dict_str) > 120:
return dict_str[:116] + "...}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an env var to enable full string?

@@ -498,11 +521,15 @@ async def _create_runtime_env_with_retry(
)

async def DeleteRuntimeEnvIfPossible(self, request):

runtime_env_truncated = truncate_dict_str(request.serialized_runtime_env)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actuallly I wonder if it is possible to make it a part of __str__ implementation instead? Seems a bit fragile (i.e., if you use this dict by mistake for functional purpose, that will be a huge issue & if you add new logs, it is easy to miss)

// Will call callback exactly once with pair{non-ok, any} or pair{ok, reply body}.
// Will call either succ_callback or fail_callback exactly once.
//
// Keep-alive and reuses TCP connection, but only handle 1 request at a time (no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this causing any performance related issues? E.g., if one of runtime env takes long time to create, everything else will be delayed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, is it correct this problem is handled by session pool?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 10, 2023
Copy link

stale bot commented Dec 15, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 15, 2023
@jjyao jjyao removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release test stress_test_many_runtime_envs.aws failed
6 participants