Skip to content

[Core] Metric unintentional_worker_failures_total is not accurate #1918

Closed
@amir-f

Description

@amir-f

What happened + What you expected to happen

We use ray on Kubernetes using the kuberay project. We have a sanity test that runs a simple job via the job submission API the workload succeeds however the metric unintentional_worker_failures_total is also incremented.

That metric should not however be incremented. The definition of the metric reads Number of worker failures that are not intentional.

I asked about it on the slack channel and was told to file an issue.

Versions / Dependencies

2.6.1

Reproduction script

# workload.py

import ray


@ray.remote
def workload(val: str) -> str:
    return f"got {val}"


if __name__ == "__main__":
    ray.init()
    assert ray.get(workload.remote("foo")) == "got foo"
# submit.py

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient(ray_head_address)
job_id = client.submit_job(entrypoint='python workload.py')

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions