Skip to content

[Docs] Add developer doc about CI failures #18782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 4, 2025
120 changes: 120 additions & 0 deletions docs/contributing/ci-failures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# CI Failures

What should I do when a CI job fails on my PR, but I don't think my PR caused
the failure?

- Check the dashboard of current CI test failures:
👉 [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20)

- If your failure **is already listed**, it's likely unrelated to your PR.
Help fixing it is always welcome!
- Leave comments with links to additional instances of the failure.
- React with a 👍 to signal how many are affected.

- If your failure **is not listed**, you should **file an issue**.

## Filing a CI Test Failure Issue

- **File a bug report:**
👉 [New CI Failure Report](https://github.com/vllm-project/vllm/issues/new?template=450-ci-failure.yml)

- **Use this title format:**

```
[CI Failure]: failing-test-job - regex/matching/failing:test
```

- **For the environment field:**

```
Still failing on main as of commit abcdef123
```

- **In the description, include failing tests:**

```
FAILED failing/test.py:failing_test1 - Failure description
FAILED failing/test.py:failing_test2 - Failure description
https://github.com/orgs/vllm-project/projects/20
https://github.com/vllm-project/vllm/issues/new?template=400-bug-report.yml
FAILED failing/test.py:failing_test3 - Failure description
```

- **Attach logs** (collapsible section example):
<details>
<summary>Logs:</summary>

```text
ERROR 05-20 03:26:38 [dump_input.py:68] Dumping input data
--- Logging error ---
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 203, in execute_model
return self.model_executor.execute_model(scheduler_output)
...
FAILED failing/test.py:failing_test1 - Failure description
FAILED failing/test.py:failing_test2 - Failure description
FAILED failing/test.py:failing_test3 - Failure description
```

</details>
Comment on lines +44 to +59
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Logs Wrangling

Download the full log file from Buildkite locally.

Strip timestamps and colorization:

```bash
# Strip timestamps
sed -i 's/^\[[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z\] //' ci.log

# Strip colorization
sed -i -r 's/\x1B\[[0-9;]*[mK]//g' ci.log
```

Use a tool for quick copy-pasting:

```bash
tail -525 ci_build.log | wl-copy
```

## Investigating a CI Test Failure

1. Go to 👉 [Buildkite main branch](https://buildkite.com/vllm/ci/builds?branch=main)
2. Bisect to find the first build that shows the issue.
3. Add your findings to the GitHub issue.
4. If you find a strong candidate PR, mention it in the issue and ping contributors.

## Reproducing a Failure

CI test failures may be flaky. Use a bash loop to run repeatedly:

```bash
COUNT=1; while pytest -sv tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]; do
COUNT=$[$COUNT + 1]; echo "RUN NUMBER ${COUNT}";
done
```

## Submitting a PR

If you submit a PR to fix a CI failure:

- Link the PR to the issue:
Add `Closes #12345` to the PR description.
- Add the `ci-failure` label:
This helps track it in the [CI Failures GitHub Project](https://github.com/orgs/vllm-project/projects/20).

## Other Resources

- 🔍 [Test Reliability on `main`](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&order=ASC&sort_by=reliability)
- 🧪 [Latest Buildkite CI Runs](https://buildkite.com/vllm/ci/builds?branch=main)

## Daily Triage

Use [Buildkite analytics (2-day view)](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=2days) to:

- Identify recent test failures **on `main`**.
- Exclude legitimate test failures on PRs.
- (Optional) Ignore tests with 0% reliability.

Compare to the [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20).