[CI] Create zip of ray `session_latest/logs` dir on test failure and upload to buildkite via `/artifact-mount` #23783

jon-chuang · 2022-04-07T18:57:42Z

Why are these changes needed?

Creates a zip of session_latest dir with test name and timestamp upon python test failure. Writes to dir specified by env var RAY_TEST_FAILURE_LOGS_DIR. Noop if env var does not exist.

Downstream consumer (e.g. CI) can upload all created artifacts in this dir. Thereby, PR submitters can more easily debug their CI failures, especially if they can't repro locally.

Limitations:

a conftest.py file importing the main ray conftest.py needs to be present in same dir as test. This presents a challenge for e.g. dashboard tests which are highly scattered

Related issue number

#23746

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jon-chuang · 2022-04-09T04:11:57Z

It appears as of the CI run https://buildkite.com/ray-project/ray-builders-pr/builds/29186#bc5924e7-b7be-4738-95d8-ff3d40262b57, the experiment was a success.

It seems that per test, the zip file uploaded is about 20-60KB.

jon-chuang · 2022-04-09T04:28:56Z

Seems like ray_spilled_objects was also zipped into the artifact. Changed to zip only the logs dir.

krfricke · 2022-04-12T09:20:21Z

Hi @jon-chuang, can we get another test run showing the result artifact? I can only see the linked example with the spilled objects.

jon-chuang · 2022-04-12T14:27:05Z

Now using OS-agnostic way of getting tmp dir

jon-chuang · 2022-04-13T19:22:50Z

@krfricke do you have any suggestions on what the archive dir for storing the zipped logs dirs should be for windows and mac respectively? Seems the buildkite agents will upload from /tmp/artifacts:/artifact-mount on Linux, but not on Windows and Mac.

jon-chuang · 2022-04-15T18:08:57Z

I've decided to just enable this feature for Linux.

Windows and Mac tests can come later if someone has an idea on the right way to configure an artifact mount dir for these.

jon-chuang · 2022-04-16T01:26:37Z

@krfricke I am quite satisfied with the coverage for now. You can see the coverage and example artifacts here: https://buildkite.com/ray-project/ray-builders-pr/builds/29627

jon-chuang · 2022-04-16T18:12:54Z

The coverage is around dashboard tests, select serve & rllib and ML tests, and almost all python core tests.

krfricke

Generally good, but let's clean up the code a bit please. Also, in some tests over 100 artifacts are generated (e.g. https://buildkite.com/ray-project/ray-builders-pr/builds/29627#cfb09371-0834-4f11-98ec-27ccde14f870) but Buildkite only supports up to 100. This is probably more something for a follow-up PR, but we could group some tests if this limit is exceeded

python/ray/tests/conftest.py

krfricke · 2022-04-18T14:40:13Z

python/ray/tests/conftest.py

+            if platform.system() == "Linux":
+                if not os.path.exists(archive_dir):
+                    os.makedirs(archive_dir)
+                output_file = f"{archive_dir}/{rep.nodeid.split('/')[-1]}-{time.time()}"


Can we use os.path here?

Suggested change

output_file = f"{archive_dir}/{rep.nodeid.split('/')[-1]}-{time.time()}"

test_name = rep.nodeid.split('/')[-1]

output_file = os.path.join(archive_dir, f"{test_name}_{time.time():.4f}")

Btw, for test_name, should we just do rep.nodeid.replace(os.sep, "_") to get the full path? (Evne though it seems this is not really reflected in the current output anyway)

went with rep.nodeid.replace(os.sep, "::")

krfricke · 2022-04-18T14:42:53Z

python/ray/tests/conftest.py

+    if rep.when == "call" and rep.failed:
+        archive_dir = os.environ.get("RAY_TEST_FAILURE_LOGS_ARCHIVE_DIR")
+
+        if archive_dir is not None:


To avoid these deep if nestings, can we do something like:

if rep.when != "call" or not rep.failed: return archive_dir = os.environ.get("RAY_TEST_FAILURE_LOGS_ARCHIVE_DIR") if not archive_dir: return if platform.system() != "Linux": return tmp_dir = gettempdir() log_dir = os.path.join(tmp_dir, "ray", "session_latest", "logs") if not os.path.exists(log_dir): return

etc

python/ray/tests/conftest.py

jon-chuang · 2022-04-18T17:33:36Z

Also, in some tests over 100 artifacts are generated (e.g. https://buildkite.com/ray-project/ray-builders-pr/builds/29627#cfb09371-0834-4f11-98ec-27ccde14f870) but Buildkite only supports up to 100.

https://buildkite.com/ray-project/ray-builders-pr/builds/29627#16e0350a-6204-48f8-8483-7b6acb051d96 seems to show that it supports much more than 100 artifacts.

In practice I don't think any limit above 100 is a concern as we are expecting at most 1-5 tests to fail at a time (unless there is a test outage in which case probably any set of logs could help with diagnosis).

krfricke

Thanks for the refactor, this looks good to me. Will need code owner approval from other teams before we can merge.
While we're waiting for that, can you merge latest master and kick off CI again to fix the failing Ray client tests?

jon-chuang · 2022-04-21T05:03:47Z

Seems the tests are passing now.

commit

a123bc2

jon-chuang force-pushed the ci-copy-logs-on-failure branch from 08b780d to 894426e Compare April 7, 2022 19:31

artificial test failure

21f9e0e

jon-chuang force-pushed the ci-copy-logs-on-failure branch from 894426e to 21f9e0e Compare April 7, 2022 19:32

jon-chuang added 3 commits April 8, 2022 11:28

Merge branch 'master' into ci-copy-logs-on-failure

31e4baa

add . dir, more affected tests

f1bf435

replace with /artifact-mount

fd104ac

cleanup

3db87d3

jon-chuang assigned krfricke and simon-mo Apr 9, 2022

jon-chuang changed the title ~~[CI] Create zip of ray session dir on test failure~~ [CI] Create zip of ray session dir on test failure and upload to buildkite via /artifact-mount Apr 9, 2022

jon-chuang added 2 commits April 9, 2022 00:29

only zip logs dir

c4ff639

erroneous pipeline cmd

854e529

jon-chuang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 9, 2022

add more tests to archive logs dir

59a625e

jon-chuang force-pushed the ci-copy-logs-on-failure branch from d683dd2 to 59a625e Compare April 9, 2022 20:25

jon-chuang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 10, 2022

force failure to show uploaded artifacts

e0a1561

experiment: trigger upload to see coverage. Add hook to more conftests

8872772

jon-chuang requested review from sven1977, gjoliver, avnishn, ArturNiederfahrenhorst, smorad, ericl and fishbone as code owners April 12, 2022 18:18

change bash script

98b4859

jon-chuang added 4 commits April 13, 2022 15:23

no quotes around script

90f3b32

limit to linux

e1bacd1

reorg

2a9d99c

disable shellcheck as does other section of script does

771d552

jon-chuang force-pushed the ci-copy-logs-on-failure branch from ced0e4d to 771d552 Compare April 14, 2022 15:37

Merge branch 'master' into ci-copy-logs-on-failure

2f24a25

jon-chuang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 14, 2022

remove should fail

8632a70

jon-chuang added 2 commits April 15, 2022 16:06

check coverage again

1d55960

revert to only when test fail

0cb4f14

jon-chuang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 16, 2022

krfricke reviewed Apr 18, 2022

View reviewed changes

apply suggestions from code review

13a3d9b

krfricke approved these changes Apr 19, 2022

View reviewed changes

sven1977 approved these changes Apr 19, 2022

View reviewed changes

clarkzinzow approved these changes Apr 19, 2022

View reviewed changes

stephanie-wang approved these changes Apr 19, 2022

View reviewed changes

Merge branch 'master' into ci-copy-logs-on-failure

107a4ba

fishbone approved these changes Apr 19, 2022

View reviewed changes

simon-mo approved these changes Apr 19, 2022

View reviewed changes

jon-chuang added 2 commits April 20, 2022 00:37

merge master

792f6c1

Merge branch 'master' into ci-copy-logs-on-failure

db55fbd

krfricke merged commit e6a458a into ray-project:master Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Create zip of ray `session_latest/logs` dir on test failure and upload to buildkite via `/artifact-mount` #23783

[CI] Create zip of ray `session_latest/logs` dir on test failure and upload to buildkite via `/artifact-mount` #23783

jon-chuang commented Apr 7, 2022 •

edited

Loading

jon-chuang commented Apr 9, 2022 •

edited

Loading

jon-chuang commented Apr 9, 2022 •

edited

Loading

krfricke commented Apr 12, 2022

jon-chuang commented Apr 12, 2022 •

edited

Loading

jon-chuang commented Apr 13, 2022

jon-chuang commented Apr 15, 2022 •

edited

Loading

jon-chuang commented Apr 16, 2022

jon-chuang commented Apr 16, 2022

krfricke left a comment

krfricke Apr 18, 2022

krfricke Apr 18, 2022

jon-chuang Apr 18, 2022

krfricke Apr 18, 2022

jon-chuang commented Apr 18, 2022

krfricke left a comment

jon-chuang commented Apr 21, 2022

	output_file = f"{archive_dir}/{rep.nodeid.split('/')[-1]}-{time.time()}"
	test_name = rep.nodeid.split('/')[-1]
	output_file = os.path.join(archive_dir, f"{test_name}_{time.time():.4f}")

[CI] Create zip of ray session_latest/logs dir on test failure and upload to buildkite via /artifact-mount #23783

[CI] Create zip of ray session_latest/logs dir on test failure and upload to buildkite via /artifact-mount #23783

Conversation

jon-chuang commented Apr 7, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

jon-chuang commented Apr 9, 2022 • edited Loading

jon-chuang commented Apr 9, 2022 • edited Loading

krfricke commented Apr 12, 2022

jon-chuang commented Apr 12, 2022 • edited Loading

jon-chuang commented Apr 13, 2022

jon-chuang commented Apr 15, 2022 • edited Loading

jon-chuang commented Apr 16, 2022

jon-chuang commented Apr 16, 2022

krfricke left a comment

Choose a reason for hiding this comment

krfricke Apr 18, 2022

Choose a reason for hiding this comment

krfricke Apr 18, 2022

Choose a reason for hiding this comment

jon-chuang Apr 18, 2022

Choose a reason for hiding this comment

krfricke Apr 18, 2022

Choose a reason for hiding this comment

jon-chuang commented Apr 18, 2022

krfricke left a comment

Choose a reason for hiding this comment

jon-chuang commented Apr 21, 2022

[CI] Create zip of ray `session_latest/logs` dir on test failure and upload to buildkite via `/artifact-mount` #23783

[CI] Create zip of ray `session_latest/logs` dir on test failure and upload to buildkite via `/artifact-mount` #23783

jon-chuang commented Apr 7, 2022 •

edited

Loading

jon-chuang commented Apr 9, 2022 •

edited

Loading

jon-chuang commented Apr 9, 2022 •

edited

Loading

jon-chuang commented Apr 12, 2022 •

edited

Loading

jon-chuang commented Apr 15, 2022 •

edited

Loading