-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Create zip of ray session_latest/logs
dir on test failure and upload to buildkite via /artifact-mount
#23783
[CI] Create zip of ray session_latest/logs
dir on test failure and upload to buildkite via /artifact-mount
#23783
Conversation
08b780d
to
894426e
Compare
894426e
to
21f9e0e
Compare
It appears as of the CI run https://buildkite.com/ray-project/ray-builders-pr/builds/29186#bc5924e7-b7be-4738-95d8-ff3d40262b57, the experiment was a success. It seems that per test, the zip file uploaded is about 20-60KB. |
/artifact-mount
Seems like ray_spilled_objects was also zipped into the artifact. Changed to zip only the logs dir. |
d683dd2
to
59a625e
Compare
Hi @jon-chuang, can we get another test run showing the result artifact? I can only see the linked example with the spilled objects. |
Now using OS-agnostic way of getting tmp dir |
@krfricke do you have any suggestions on what the archive dir for storing the zipped logs dirs should be for windows and mac respectively? Seems the buildkite agents will upload from |
ced0e4d
to
771d552
Compare
I've decided to just enable this feature for Linux. Windows and Mac tests can come later if someone has an idea on the right way to configure an artifact mount dir for these. |
@krfricke I am quite satisfied with the coverage for now. You can see the coverage and example artifacts here: https://buildkite.com/ray-project/ray-builders-pr/builds/29627 |
The coverage is around dashboard tests, select serve & rllib and ML tests, and almost all python core tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally good, but let's clean up the code a bit please. Also, in some tests over 100 artifacts are generated (e.g. https://buildkite.com/ray-project/ray-builders-pr/builds/29627#cfb09371-0834-4f11-98ec-27ccde14f870) but Buildkite only supports up to 100. This is probably more something for a follow-up PR, but we could group some tests if this limit is exceeded
python/ray/tests/conftest.py
Outdated
if platform.system() == "Linux": | ||
if not os.path.exists(archive_dir): | ||
os.makedirs(archive_dir) | ||
output_file = f"{archive_dir}/{rep.nodeid.split('/')[-1]}-{time.time()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use os.path
here?
output_file = f"{archive_dir}/{rep.nodeid.split('/')[-1]}-{time.time()}" | |
test_name = rep.nodeid.split('/')[-1] | |
output_file = os.path.join(archive_dir, f"{test_name}_{time.time():.4f}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, for test_name
, should we just do rep.nodeid.replace(os.sep, "_")
to get the full path? (Evne though it seems this is not really reflected in the current output anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
went with rep.nodeid.replace(os.sep, "::")
python/ray/tests/conftest.py
Outdated
if rep.when == "call" and rep.failed: | ||
archive_dir = os.environ.get("RAY_TEST_FAILURE_LOGS_ARCHIVE_DIR") | ||
|
||
if archive_dir is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid these deep if
nestings, can we do something like:
if rep.when != "call" or not rep.failed:
return
archive_dir = os.environ.get("RAY_TEST_FAILURE_LOGS_ARCHIVE_DIR")
if not archive_dir:
return
if platform.system() != "Linux":
return
tmp_dir = gettempdir()
log_dir = os.path.join(tmp_dir, "ray", "session_latest", "logs")
if not os.path.exists(log_dir):
return
etc
https://buildkite.com/ray-project/ray-builders-pr/builds/29627#16e0350a-6204-48f8-8483-7b6acb051d96 seems to show that it supports much more than 100 artifacts. In practice I don't think any limit above 100 is a concern as we are expecting at most 1-5 tests to fail at a time (unless there is a test outage in which case probably any set of logs could help with diagnosis). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the refactor, this looks good to me. Will need code owner approval from other teams before we can merge.
While we're waiting for that, can you merge latest master and kick off CI again to fix the failing Ray client tests?
Seems the tests are passing now. |
Why are these changes needed?
Creates a zip of session_latest dir with test name and timestamp upon python test failure. Writes to dir specified by env var
RAY_TEST_FAILURE_LOGS_DIR
. Noop if env var does not exist.Downstream consumer (e.g. CI) can upload all created artifacts in this dir. Thereby, PR submitters can more easily debug their CI failures, especially if they can't repro locally.
Limitations:
Related issue number
#23746
Checks
scripts/format.sh
to lint the changes in this PR.