|
| 1 | +# CI Failures |
| 2 | + |
| 3 | +What should I do when a CI job fails on my PR, but I don't think my PR caused |
| 4 | +the failure? |
| 5 | + |
| 6 | +- Check the dashboard of current CI test failures: |
| 7 | + 👉 [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20) |
| 8 | + |
| 9 | +- If your failure **is already listed**, it's likely unrelated to your PR. |
| 10 | + Help fixing it is always welcome! |
| 11 | + - Leave comments with links to additional instances of the failure. |
| 12 | + - React with a 👍 to signal how many are affected. |
| 13 | + |
| 14 | +- If your failure **is not listed**, you should **file an issue**. |
| 15 | + |
| 16 | +## Filing a CI Test Failure Issue |
| 17 | + |
| 18 | +- **File a bug report:** |
| 19 | + 👉 [New CI Failure Report](https://github.com/vllm-project/vllm/issues/new?template=450-ci-failure.yml) |
| 20 | + |
| 21 | +- **Use this title format:** |
| 22 | + |
| 23 | + ``` |
| 24 | + [CI Failure]: failing-test-job - regex/matching/failing:test |
| 25 | + ``` |
| 26 | +
|
| 27 | +- **For the environment field:** |
| 28 | + |
| 29 | + ``` |
| 30 | + Still failing on main as of commit abcdef123 |
| 31 | + ``` |
| 32 | +
|
| 33 | +- **In the description, include failing tests:** |
| 34 | + |
| 35 | + ``` |
| 36 | + FAILED failing/test.py:failing_test1 - Failure description |
| 37 | + FAILED failing/test.py:failing_test2 - Failure description |
| 38 | + https://github.com/orgs/vllm-project/projects/20 |
| 39 | + https://github.com/vllm-project/vllm/issues/new?template=400-bug-report.yml |
| 40 | + FAILED failing/test.py:failing_test3 - Failure description |
| 41 | + ``` |
| 42 | +
|
| 43 | +- **Attach logs** (collapsible section example): |
| 44 | + <details> |
| 45 | + <summary>Logs:</summary> |
| 46 | +
|
| 47 | + ```text |
| 48 | + ERROR 05-20 03:26:38 [dump_input.py:68] Dumping input data |
| 49 | + --- Logging error --- |
| 50 | + Traceback (most recent call last): |
| 51 | + File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 203, in execute_model |
| 52 | + return self.model_executor.execute_model(scheduler_output) |
| 53 | + ... |
| 54 | + FAILED failing/test.py:failing_test1 - Failure description |
| 55 | + FAILED failing/test.py:failing_test2 - Failure description |
| 56 | + FAILED failing/test.py:failing_test3 - Failure description |
| 57 | + ``` |
| 58 | + |
| 59 | + </details> |
| 60 | +
|
| 61 | +## Logs Wrangling |
| 62 | +
|
| 63 | +Download the full log file from Buildkite locally. |
| 64 | +
|
| 65 | +Strip timestamps and colorization: |
| 66 | +
|
| 67 | +```bash |
| 68 | +# Strip timestamps |
| 69 | +sed -i 's/^\[[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z\] //' ci.log |
| 70 | +
|
| 71 | +# Strip colorization |
| 72 | +sed -i -r 's/\x1B\[[0-9;]*[mK]//g' ci.log |
| 73 | +``` |
| 74 | + |
| 75 | +Use a tool for quick copy-pasting: |
| 76 | + |
| 77 | +```bash |
| 78 | +tail -525 ci_build.log | wl-copy |
| 79 | +``` |
| 80 | + |
| 81 | +## Investigating a CI Test Failure |
| 82 | + |
| 83 | +1. Go to 👉 [Buildkite main branch](https://buildkite.com/vllm/ci/builds?branch=main) |
| 84 | +2. Bisect to find the first build that shows the issue. |
| 85 | +3. Add your findings to the GitHub issue. |
| 86 | +4. If you find a strong candidate PR, mention it in the issue and ping contributors. |
| 87 | + |
| 88 | +## Reproducing a Failure |
| 89 | + |
| 90 | +CI test failures may be flaky. Use a bash loop to run repeatedly: |
| 91 | + |
| 92 | +```bash |
| 93 | +COUNT=1; while pytest -sv tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]; do |
| 94 | + COUNT=$[$COUNT + 1]; echo "RUN NUMBER ${COUNT}"; |
| 95 | +done |
| 96 | +``` |
| 97 | + |
| 98 | +## Submitting a PR |
| 99 | + |
| 100 | +If you submit a PR to fix a CI failure: |
| 101 | + |
| 102 | +- Link the PR to the issue: |
| 103 | + Add `Closes #12345` to the PR description. |
| 104 | +- Add the `ci-failure` label: |
| 105 | + This helps track it in the [CI Failures GitHub Project](https://github.com/orgs/vllm-project/projects/20). |
| 106 | + |
| 107 | +## Other Resources |
| 108 | + |
| 109 | +- 🔍 [Test Reliability on `main`](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&order=ASC&sort_by=reliability) |
| 110 | +- 🧪 [Latest Buildkite CI Runs](https://buildkite.com/vllm/ci/builds?branch=main) |
| 111 | + |
| 112 | +## Daily Triage |
| 113 | + |
| 114 | +Use [Buildkite analytics (2-day view)](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=2days) to: |
| 115 | + |
| 116 | +- Identify recent test failures **on `main`**. |
| 117 | +- Exclude legitimate test failures on PRs. |
| 118 | +- (Optional) Ignore tests with 0% reliability. |
| 119 | + |
| 120 | +Compare to the [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20). |
0 commit comments