Description
What is the problem?
When GCS restarts, GCS client of Worker A detects that GCS server is restarted, and sends the create actor request again. It starts over the whole process and maybe picks up another Raylet or worker to schedule the actor.
Problem: GCS doesn’t know if it has leased an worker from another raylet for this actor before. Worker resources may be leaked.
Solution: GCS server sends each raylet which lease workers it uses. If raylet finds that a lease worker is not used, release the lease worker.
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.