Skip to content

[Core]Worker resource leak when GCS server restarts #9286

Closed
@ffbin

Description

@ffbin

What is the problem?

When GCS restarts, GCS client of Worker A detects that GCS server is restarted, and sends the create actor request again. It starts over the whole process and maybe picks up another Raylet or worker to schedule the actor.
Problem: GCS doesn’t know if it has leased an worker from another raylet for this actor before. Worker resources may be leaked.
Solution: GCS server sends each raylet which lease workers it uses. If raylet finds that a lease worker is not used, release the lease worker.

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions