[core] GCS doesn't always cancel worker leases for killed actors #13545
Open
Description
What is the problem?
The raylet doesn't guarantee the order when dealing with RequestWorkerLease and CancelWorkerLease. If we kill the actor immediately after creating the actor, we may not be able to clean up the request cached by the raylet.
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
We can reproduce this problem based on #13254.
def test_kill_pending_actor_with_no_restart_true():
cluster = ray.init()
global_state_accessor = GlobalStateAccessor(
cluster["redis_address"], ray.ray_constants.REDIS_DEFAULT_PASSWORD)
global_state_accessor.connect()
@ray.remote(resources={"WORKER": 1.0})
class PendingActor:
pass
# Kill actor with `no_restart=True`.
actor = PendingActor.remote()
ray.kill(actor, no_restart=True)
def condition1():
message = global_state_accessor.get_all_resource_usage()
resource_usages = ray.gcs_utils.ResourceUsageBatchData.FromString(
message)
if len(resource_usages.resource_load_by_shape.resource_demands) == 0:
return True
return False
# Actor is dead, so the infeasible task queue length is 0.
wait_for_condition(condition1, timeout=10)
global_state_accessor.disconnect()
ray.shutdown()
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.