Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Bumping up task failure logs to warnings to make sure failures could be traced in Ray Core logs #43111

Merged
merged 3 commits into from
Feb 13, 2024

Conversation

alexeykudinkin
Copy link
Contributor

…troubleshooted

Why are these changes needed?

Currently, we observe a lot of failures like following in our production deployment:

  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/handle.py", line 781, in __anext__
    return await next_obj_ref
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

However, we can't find any logs in Ray Core corresponding to this failure. Checking around i've realized that all of the log statements we have are DEBUG logs, which necessitates us to switch to DEBUG mode which will drown our logging infra.

Hence bumping failure logs to WARNING at least to make sure any failures are traceable in Ray Core logs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…troubleshooted

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@rkooo567
Copy link
Contributor

rkooo567 commented Feb 13, 2024

btw @alexeykudinkin @rynewang can you check logs of one of random release tests and make sure this is not spammy in logs? I assume there's a possibility this was debug because it can happen under normal condition (though I am not 100% sure). Should take only a couple min to verify it. (maybe check logs from a random scalability tests)

@jjyao jjyao merged commit 149e400 into ray-project:master Feb 13, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants