Closed
Description
What happened + What you expected to happen
We use ray on Kubernetes using the kuberay
project. We have a sanity test that runs a simple job via the job submission API the workload succeeds however the metric unintentional_worker_failures_total
is also incremented.
That metric should not however be incremented. The definition of the metric reads Number of worker failures that are not intentional.
I asked about it on the slack channel and was told to file an issue.
Versions / Dependencies
2.6.1
Reproduction script
# workload.py
import ray
@ray.remote
def workload(val: str) -> str:
return f"got {val}"
if __name__ == "__main__":
ray.init()
assert ray.get(workload.remote("foo")) == "got foo"
# submit.py
from ray.job_submission import JobSubmissionClient
client = JobSubmissionClient(ray_head_address)
job_id = client.submit_job(entrypoint='python workload.py')
Issue Severity
Low: It annoys or frustrates me.