Skip to content

Commit 52dceb1

Browse files
russellbmarkmcDarkLight1337
authored
[Docs] Add developer doc about CI failures (#18782)
Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
1 parent abd7df2 commit 52dceb1

File tree

1 file changed

+120
-0
lines changed

1 file changed

+120
-0
lines changed

docs/contributing/ci-failures.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# CI Failures
2+
3+
What should I do when a CI job fails on my PR, but I don't think my PR caused
4+
the failure?
5+
6+
- Check the dashboard of current CI test failures:
7+
👉 [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20)
8+
9+
- If your failure **is already listed**, it's likely unrelated to your PR.
10+
Help fixing it is always welcome!
11+
- Leave comments with links to additional instances of the failure.
12+
- React with a 👍 to signal how many are affected.
13+
14+
- If your failure **is not listed**, you should **file an issue**.
15+
16+
## Filing a CI Test Failure Issue
17+
18+
- **File a bug report:**
19+
👉 [New CI Failure Report](https://github.com/vllm-project/vllm/issues/new?template=450-ci-failure.yml)
20+
21+
- **Use this title format:**
22+
23+
```
24+
[CI Failure]: failing-test-job - regex/matching/failing:test
25+
```
26+
27+
- **For the environment field:**
28+
29+
```
30+
Still failing on main as of commit abcdef123
31+
```
32+
33+
- **In the description, include failing tests:**
34+
35+
```
36+
FAILED failing/test.py:failing_test1 - Failure description
37+
FAILED failing/test.py:failing_test2 - Failure description
38+
https://github.com/orgs/vllm-project/projects/20
39+
https://github.com/vllm-project/vllm/issues/new?template=400-bug-report.yml
40+
FAILED failing/test.py:failing_test3 - Failure description
41+
```
42+
43+
- **Attach logs** (collapsible section example):
44+
<details>
45+
<summary>Logs:</summary>
46+
47+
```text
48+
ERROR 05-20 03:26:38 [dump_input.py:68] Dumping input data
49+
--- Logging error ---
50+
Traceback (most recent call last):
51+
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 203, in execute_model
52+
return self.model_executor.execute_model(scheduler_output)
53+
...
54+
FAILED failing/test.py:failing_test1 - Failure description
55+
FAILED failing/test.py:failing_test2 - Failure description
56+
FAILED failing/test.py:failing_test3 - Failure description
57+
```
58+
59+
</details>
60+
61+
## Logs Wrangling
62+
63+
Download the full log file from Buildkite locally.
64+
65+
Strip timestamps and colorization:
66+
67+
```bash
68+
# Strip timestamps
69+
sed -i 's/^\[[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z\] //' ci.log
70+
71+
# Strip colorization
72+
sed -i -r 's/\x1B\[[0-9;]*[mK]//g' ci.log
73+
```
74+
75+
Use a tool for quick copy-pasting:
76+
77+
```bash
78+
tail -525 ci_build.log | wl-copy
79+
```
80+
81+
## Investigating a CI Test Failure
82+
83+
1. Go to 👉 [Buildkite main branch](https://buildkite.com/vllm/ci/builds?branch=main)
84+
2. Bisect to find the first build that shows the issue.
85+
3. Add your findings to the GitHub issue.
86+
4. If you find a strong candidate PR, mention it in the issue and ping contributors.
87+
88+
## Reproducing a Failure
89+
90+
CI test failures may be flaky. Use a bash loop to run repeatedly:
91+
92+
```bash
93+
COUNT=1; while pytest -sv tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]; do
94+
COUNT=$[$COUNT + 1]; echo "RUN NUMBER ${COUNT}";
95+
done
96+
```
97+
98+
## Submitting a PR
99+
100+
If you submit a PR to fix a CI failure:
101+
102+
- Link the PR to the issue:
103+
Add `Closes #12345` to the PR description.
104+
- Add the `ci-failure` label:
105+
This helps track it in the [CI Failures GitHub Project](https://github.com/orgs/vllm-project/projects/20).
106+
107+
## Other Resources
108+
109+
- 🔍 [Test Reliability on `main`](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&order=ASC&sort_by=reliability)
110+
- 🧪 [Latest Buildkite CI Runs](https://buildkite.com/vllm/ci/builds?branch=main)
111+
112+
## Daily Triage
113+
114+
Use [Buildkite analytics (2-day view)](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=2days) to:
115+
116+
- Identify recent test failures **on `main`**.
117+
- Exclude legitimate test failures on PRs.
118+
- (Optional) Ignore tests with 0% reliability.
119+
120+
Compare to the [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20).

0 commit comments

Comments
 (0)