Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Create zip of ray session_latest/logs dir on test failure and upload to buildkite via /artifact-mount #23783

Merged
merged 31 commits into from
Apr 22, 2022

Conversation

jon-chuang
Copy link
Contributor

@jon-chuang jon-chuang commented Apr 7, 2022

Why are these changes needed?

Creates a zip of session_latest dir with test name and timestamp upon python test failure. Writes to dir specified by env var RAY_TEST_FAILURE_LOGS_DIR. Noop if env var does not exist.

Downstream consumer (e.g. CI) can upload all created artifacts in this dir. Thereby, PR submitters can more easily debug their CI failures, especially if they can't repro locally.

Limitations:

  • a conftest.py file importing the main ray conftest.py needs to be present in same dir as test. This presents a challenge for e.g. dashboard tests which are highly scattered

Related issue number

#23746

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jon-chuang jon-chuang force-pushed the ci-copy-logs-on-failure branch from 08b780d to 894426e Compare April 7, 2022 19:31
@jon-chuang jon-chuang force-pushed the ci-copy-logs-on-failure branch from 894426e to 21f9e0e Compare April 7, 2022 19:32
@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 9, 2022

It appears as of the CI run https://buildkite.com/ray-project/ray-builders-pr/builds/29186#bc5924e7-b7be-4738-95d8-ff3d40262b57, the experiment was a success.

It seems that per test, the zip file uploaded is about 20-60KB.

@jon-chuang jon-chuang changed the title [CI] Create zip of ray session dir on test failure [CI] Create zip of ray session dir on test failure and upload to buildkite via /artifact-mount Apr 9, 2022
@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 9, 2022

Seems like ray_spilled_objects was also zipped into the artifact. Changed to zip only the logs dir.

@jon-chuang jon-chuang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 9, 2022
@jon-chuang jon-chuang force-pushed the ci-copy-logs-on-failure branch from d683dd2 to 59a625e Compare April 9, 2022 20:25
@jon-chuang jon-chuang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 10, 2022
@krfricke
Copy link
Contributor

Hi @jon-chuang, can we get another test run showing the result artifact? I can only see the linked example with the spilled objects.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 12, 2022

Now using OS-agnostic way of getting tmp dir

@jon-chuang
Copy link
Contributor Author

@krfricke do you have any suggestions on what the archive dir for storing the zipped logs dirs should be for windows and mac respectively? Seems the buildkite agents will upload from /tmp/artifacts:/artifact-mount on Linux, but not on Windows and Mac.

@jon-chuang jon-chuang force-pushed the ci-copy-logs-on-failure branch from ced0e4d to 771d552 Compare April 14, 2022 15:37
@jon-chuang jon-chuang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 14, 2022
@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 15, 2022

I've decided to just enable this feature for Linux.

Windows and Mac tests can come later if someone has an idea on the right way to configure an artifact mount dir for these.

@jon-chuang
Copy link
Contributor Author

@krfricke I am quite satisfied with the coverage for now. You can see the coverage and example artifacts here: https://buildkite.com/ray-project/ray-builders-pr/builds/29627

@jon-chuang
Copy link
Contributor Author

The coverage is around dashboard tests, select serve & rllib and ML tests, and almost all python core tests.

@jon-chuang jon-chuang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 16, 2022
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally good, but let's clean up the code a bit please. Also, in some tests over 100 artifacts are generated (e.g. https://buildkite.com/ray-project/ray-builders-pr/builds/29627#cfb09371-0834-4f11-98ec-27ccde14f870) but Buildkite only supports up to 100. This is probably more something for a follow-up PR, but we could group some tests if this limit is exceeded

python/ray/tests/conftest.py Outdated Show resolved Hide resolved
if platform.system() == "Linux":
if not os.path.exists(archive_dir):
os.makedirs(archive_dir)
output_file = f"{archive_dir}/{rep.nodeid.split('/')[-1]}-{time.time()}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use os.path here?

Suggested change
output_file = f"{archive_dir}/{rep.nodeid.split('/')[-1]}-{time.time()}"
test_name = rep.nodeid.split('/')[-1]
output_file = os.path.join(archive_dir, f"{test_name}_{time.time():.4f}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, for test_name, should we just do rep.nodeid.replace(os.sep, "_") to get the full path? (Evne though it seems this is not really reflected in the current output anyway)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with rep.nodeid.replace(os.sep, "::")

Comment on lines 690 to 693
if rep.when == "call" and rep.failed:
archive_dir = os.environ.get("RAY_TEST_FAILURE_LOGS_ARCHIVE_DIR")

if archive_dir is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid these deep if nestings, can we do something like:

if rep.when != "call" or not rep.failed:
    return

archive_dir = os.environ.get("RAY_TEST_FAILURE_LOGS_ARCHIVE_DIR")

if not archive_dir:
    return
    
if platform.system() != "Linux":
    return

tmp_dir = gettempdir()
log_dir = os.path.join(tmp_dir, "ray", "session_latest", "logs")
if not os.path.exists(log_dir):
    return

etc

python/ray/tests/conftest.py Outdated Show resolved Hide resolved
@jon-chuang
Copy link
Contributor Author

Also, in some tests over 100 artifacts are generated (e.g. https://buildkite.com/ray-project/ray-builders-pr/builds/29627#cfb09371-0834-4f11-98ec-27ccde14f870) but Buildkite only supports up to 100.

https://buildkite.com/ray-project/ray-builders-pr/builds/29627#16e0350a-6204-48f8-8483-7b6acb051d96 seems to show that it supports much more than 100 artifacts.

In practice I don't think any limit above 100 is a concern as we are expecting at most 1-5 tests to fail at a time (unless there is a test outage in which case probably any set of logs could help with diagnosis).

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor, this looks good to me. Will need code owner approval from other teams before we can merge.
While we're waiting for that, can you merge latest master and kick off CI again to fix the failing Ray client tests?

@jon-chuang
Copy link
Contributor Author

Seems the tests are passing now.

@krfricke krfricke merged commit e6a458a into ray-project:master Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants