-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCS]Do not save task spec in GCS actor table(Part1) #11592
Conversation
Is it ready for review? Can you resolve the merge conflict and remove the tag in that case? |
Thanks, i will resolve the merge conflict. I find that we can't remove task_spec directly because the new dashboard depends on it. |
What are values that the dashboard depends on? We can probably extract them when creating an Actor class in gcs_actor_manager.cc. I am sure only a few fields are used for the dashboard, and it is probably waste to keep it just for the dashboard. |
18da7de
to
b625dd6
Compare
def actor_classname_from_task_spec(task_spec):
return task_spec.get("functionDescriptor", {})\
.get("pythonFunctionDescriptor", {})\
.get("className", "Unknown actor class") We only use these 3 fields, so keeping the whole task spec would be wasteful? Maybe we can create a "metadata" field which contains data that is required by the dashboard (for now just actor class name). |
But cc @mfitton just in case |
@rkooo567 @ffbin Sang is right that we only need those fields from the task spec, but we pull that data from the If this data is removed from the reply, we will no longer be able to show users information about pending and infeasible actors within the logical view. Correct me if I'm wrong because I only glanced through the PR, but it doesn't seem like this affects the data stored in the local task queue per node at all, so it shouldn't actually affect the dashboard. |
Thank you for your detailed explains :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. Just a couple comments.
Btw, does it mean after this PR is merged actor restart or gcs restart will still work? Or should we wait until 3 PRs are merged?
After this PR is merged, gcs restart will not work until 3 PRs are merged, thanks. |
@ffbin Can you add skip for gcs server restart tests? |
Oh, i will add skip for gcs server restart tests, thanks. |
c0a39b7
to
dd89730
Compare
de2dd35
to
59000a6
Compare
Hi @rkooo567 , Can you help review this PR, thanks :) |
@@ -47,6 +47,7 @@ def test_gcs_server_restart(ray_start_regular): | |||
assert result == 2 | |||
|
|||
|
|||
@pytest.mark.skip("This test does not work yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops nvm. I re-read the description.
03541e8
to
78e1abe
Compare
cc @mfitton |
Hey @ffbin do you have a plan to merge this soon? |
Yeah, this will bring great benefits to the performance. I will see how to solve the problem of dashboard. |
78e1abe
to
46216bb
Compare
46216bb
to
508c334
Compare
bc6db6e
to
635eced
Compare
cc @ffbin any status on this PR? |
@rkooo567 Sorry, I haven't invested in this PR recently. I will ask @clay4444 to help me finish it. |
cc @ffbin no plan to work on this further? Should I close it? |
@clay4444 Could you update the status? |
|
I'll pick this up. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
|
This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs. As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.
Why are these changes needed?
In the 200 node cluster test, we found that after GCS restart, it takes a long time for GCS client to fetch actor information from GCS server, because the task spec is large and the serialization time is long. And GCS will mistakenly think that the node heartbeat timeout.
Currently, task spec is only required when GCS creates or restarts an actor. Publishing an actor and getting an actor/actors do not need to return a task spec, so we store task spec in actor task spec table separately.
This feature is split into three PR:
Currently, it is the first PR, and the second PR supports synchronous acquisition of actor task spec data. The third PR deals with the recovery process after GCS restart.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.