Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Metric unintentional_worker_failures_total is not accurate #1918

Open
amir-f opened this issue Sep 5, 2023 · 1 comment
Open

[Core] Metric unintentional_worker_failures_total is not accurate #1918

amir-f opened this issue Sep 5, 2023 · 1 comment
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks

Comments

@amir-f
Copy link

amir-f commented Sep 5, 2023

What happened + What you expected to happen

We use ray on Kubernetes using the kuberay project. We have a sanity test that runs a simple job via the job submission API the workload succeeds however the metric unintentional_worker_failures_total is also incremented.

That metric should not however be incremented. The definition of the metric reads Number of worker failures that are not intentional.

I asked about it on the slack channel and was told to file an issue.

Versions / Dependencies

2.6.1

Reproduction script

# workload.py

import ray


@ray.remote
def workload(val: str) -> str:
    return f"got {val}"


if __name__ == "__main__":
    ray.init()
    assert ray.get(workload.remote("foo")) == "got foo"
# submit.py

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient(ray_head_address)
job_id = client.submit_job(entrypoint='python workload.py')

Issue Severity

Low: It annoys or frustrates me.

@amir-f amir-f added bug Something isn't working triage labels Sep 5, 2023
@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed triage labels Sep 25, 2023
@anyscalesam anyscalesam transferred this issue from ray-project/ray Feb 9, 2024
@kevin85421
Copy link
Member

@anyscalesam This is unrelated to KubeRay and instead belongs to observability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants