Skip to content

[core] GCS doesn't always cancel worker leases for killed actors #13545

Open
@ffbin

Description

What is the problem?

The raylet doesn't guarantee the order when dealing with RequestWorkerLease and CancelWorkerLease. If we kill the actor immediately after creating the actor, we may not be able to clean up the request cached by the raylet.

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

We can reproduce this problem based on #13254.

def test_kill_pending_actor_with_no_restart_true():
    cluster = ray.init()
    global_state_accessor = GlobalStateAccessor(
        cluster["redis_address"], ray.ray_constants.REDIS_DEFAULT_PASSWORD)
    global_state_accessor.connect()

    @ray.remote(resources={"WORKER": 1.0})
    class PendingActor:
        pass

    # Kill actor with `no_restart=True`.
    actor = PendingActor.remote()
    ray.kill(actor, no_restart=True)

    def condition1():
        message = global_state_accessor.get_all_resource_usage()
        resource_usages = ray.gcs_utils.ResourceUsageBatchData.FromString(
            message)
        if len(resource_usages.resource_load_by_shape.resource_demands) == 0:
            return True
        return False

    # Actor is dead, so the infeasible task queue length is 0.
    wait_for_condition(condition1, timeout=10)

    global_state_accessor.disconnect()
    ray.shutdown()
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Assignees

Labels

P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tneeds-repro-scriptIssue needs a runnable script to be reproduced

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions