Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS]Do not save task spec in GCS actor table(Part1) #11592

Closed
wants to merge 11 commits into from

Conversation

ffbin
Copy link
Contributor

@ffbin ffbin commented Oct 24, 2020

Why are these changes needed?

In the 200 node cluster test, we found that after GCS restart, it takes a long time for GCS client to fetch actor information from GCS server, because the task spec is large and the serialization time is long. And GCS will mistakenly think that the node heartbeat timeout.
Currently, task spec is only required when GCS creates or restarts an actor. Publishing an actor and getting an actor/actors do not need to return a task spec, so we store task spec in actor task spec table separately.
This feature is split into three PR:
Currently, it is the first PR, and the second PR supports synchronous acquisition of actor task spec data. The third PR deals with the recovery process after GCS restart.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ffbin ffbin requested a review from rkooo567 October 24, 2020 09:02
@ffbin ffbin added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 25, 2020
@ffbin ffbin changed the title [GCS]Add actor task_spec table(Part1) [GCS]Separate actor task_spec from actor table(Part1) Oct 25, 2020
@rkooo567
Copy link
Contributor

Is it ready for review? Can you resolve the merge conflict and remove the tag in that case?

@ffbin
Copy link
Contributor Author

ffbin commented Nov 11, 2020

Is it ready for review? Can you resolve the merge conflict and remove the tag in that case?

Thanks, i will resolve the merge conflict. I find that we can't remove task_spec directly because the new dashboard depends on it.

@rkooo567
Copy link
Contributor

What are values that the dashboard depends on? We can probably extract them when creating an Actor class in gcs_actor_manager.cc. I am sure only a few fields are used for the dashboard, and it is probably waste to keep it just for the dashboard.

@ffbin ffbin changed the title [GCS]Separate actor task_spec from actor table(Part1) [GCS]Do not save task spec in GCS actor table(Part1) Nov 11, 2020
@ffbin
Copy link
Contributor Author

ffbin commented Nov 11, 2020

What are values that the dashboard depends on? We can probably extract them when creating an Actor class in gcs_actor_manager.cc. I am sure only a few fields are used for the dashboard, and it is probably waste to keep it just for the dashboard.

image
Does removing task_spec affect get_actor_creation_tasks function?

@ffbin ffbin removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 11, 2020
@rkooo567
Copy link
Contributor

rkooo567 commented Nov 11, 2020

@ffbin

def actor_classname_from_task_spec(task_spec):
    return task_spec.get("functionDescriptor", {})\
                .get("pythonFunctionDescriptor", {})\
                .get("className", "Unknown actor class")

We only use these 3 fields, so keeping the whole task spec would be wasteful?

Maybe we can create a "metadata" field which contains data that is required by the dashboard (for now just actor class name).

@rkooo567
Copy link
Contributor

But cc @mfitton just in case

@mfitton
Copy link
Contributor

mfitton commented Nov 11, 2020

@rkooo567 @ffbin Sang is right that we only need those fields from the task spec, but we pull that data from the GetNodeStatsReply protobuf (defined in node_manager.proto), so some kind of task spec data (even if it has unnecessary fields removed) will need to remain in that reply. This data is currently supplied in the NodeManager function HandleGetNodeStats taken from the local_queues_ attribute.

If this data is removed from the reply, we will no longer be able to show users information about pending and infeasible actors within the logical view.

Correct me if I'm wrong because I only glanced through the PR, but it doesn't seem like this affects the data stored in the local task queue per node at all, so it shouldn't actually affect the dashboard.

@ffbin
Copy link
Contributor Author

ffbin commented Nov 12, 2020

@rkooo567 @ffbin Sang is right that we only need those fields from the task spec, but we pull that data from the GetNodeStatsReply protobuf (defined in node_manager.proto), so some kind of task spec data (even if it has unnecessary fields removed) will need to remain in that reply. This data is currently supplied in the NodeManager function HandleGetNodeStats taken from the local_queues_ attribute.

If this data is removed from the reply, we will no longer be able to show users information about pending and infeasible actors within the logical view.

Correct me if I'm wrong because I only glanced through the PR, but it doesn't seem like this affects the data stored in the local task queue per node at all, so it shouldn't actually affect the dashboard.

Thank you for your detailed explains :)

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Just a couple comments.

Btw, does it mean after this PR is merged actor restart or gcs restart will still work? Or should we wait until 3 PRs are merged?

src/ray/protobuf/gcs.proto Outdated Show resolved Hide resolved
src/ray/protobuf/gcs.proto Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_actor_manager.cc Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_actor_manager.cc Outdated Show resolved Hide resolved
@ffbin
Copy link
Contributor Author

ffbin commented Nov 12, 2020

LGTM in general. Just a couple comments.

Btw, does it mean after this PR is merged actor restart or gcs restart will still work? Or should we wait until 3 PRs are merged?

After this PR is merged, gcs restart will not work until 3 PRs are merged, thanks.

@rkooo567
Copy link
Contributor

@ffbin Can you add skip for gcs server restart tests?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 13, 2020
@ffbin
Copy link
Contributor Author

ffbin commented Nov 13, 2020

@ffbin Can you add skip for gcs server restart tests?

Oh, i will add skip for gcs server restart tests, thanks.

@ffbin ffbin requested a review from rkooo567 November 30, 2020 09:23
@ffbin ffbin force-pushed the add_actor_task_spec_table branch 2 times, most recently from de2dd35 to 59000a6 Compare December 2, 2020 08:05
@ffbin ffbin removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 2, 2020
@ffbin
Copy link
Contributor Author

ffbin commented Dec 3, 2020

Hi @rkooo567 , Can you help review this PR, thanks :)

@@ -47,6 +47,7 @@ def test_gcs_server_restart(ray_start_regular):
assert result == 2


@pytest.mark.skip("This test does not work yet.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops nvm. I re-read the description.

src/ray/gcs/gcs_server/gcs_actor_manager.cc Outdated Show resolved Hide resolved
src/ray/protobuf/gcs.proto Show resolved Hide resolved
src/ray/protobuf/gcs.proto Outdated Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 4, 2020
@ffbin ffbin requested a review from rkooo567 December 5, 2020 07:09
@rkooo567
Copy link
Contributor

rkooo567 commented Jan 5, 2021

cc @mfitton

@rkooo567
Copy link
Contributor

Hey @ffbin do you have a plan to merge this soon?

@ffbin
Copy link
Contributor Author

ffbin commented Feb 23, 2021

Hey @ffbin do you have a plan to merge this soon?

Yeah, this will bring great benefits to the performance. I will see how to solve the problem of dashboard.

@ffbin ffbin force-pushed the add_actor_task_spec_table branch from 46216bb to 508c334 Compare April 6, 2021 09:53
@ffbin ffbin force-pushed the add_actor_task_spec_table branch from bc6db6e to 635eced Compare April 8, 2021 01:54
@rkooo567
Copy link
Contributor

cc @ffbin any status on this PR?

@ffbin
Copy link
Contributor Author

ffbin commented Jun 17, 2021

@rkooo567 Sorry, I haven't invested in this PR recently. I will ask @clay4444 to help me finish it.

@rkooo567
Copy link
Contributor

cc @ffbin no plan to work on this further? Should I close it?

@kfstorm
Copy link
Member

kfstorm commented Nov 26, 2021

@clay4444 Could you update the status?

@bveeramani
Copy link
Member

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

  1. Install Black
pip install -I black==21.12b0
  1. Format changed files with Black
curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh
  1. Commit your changes.
git add --all
git commit -m "Format Python code with Black"
  1. Merge master into your branch.
git pull upstream master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

@WangTaoTheTonic
Copy link
Contributor

I'll pick this up.

@stale
Copy link

stale bot commented Mar 12, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 12, 2022
@kfstorm
Copy link
Member

kfstorm commented Mar 13, 2022

‼️ ACTION REQUIRED ‼️

We've updated our formatting configuration for C++ code. (see #22725)

This PR includes C++ code change. To prevent issues with merging your code, here's what you'll need to do:

  1. Merge the latest changes from upstream/master branch into your branch.
git pull upstream master
git merge upstream/master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated C++ formatting configuration.

  1. Format changed files.
scripts/format.sh
  1. Commit your changes.
git add --all
git commit -m "Format C++ code"

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 13, 2022
@ffbin ffbin closed this Apr 7, 2022
rkooo567 pushed a commit that referenced this pull request Apr 12, 2022
This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs.

As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants