-
-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Summary
RECURRING INFRASTRUCTURE ISSUE: GitHub Actions concurrency control is causing false CI failures in merge queue runs. This is now the second occurrence in 6 hours with the same root cause.
Failure Pattern
This Run (21301673630):
- All 25 system test jobs cancelled (not failed)
- 0 actual test failures
- CI-KSail workflow marked as "failure" despite no failed tests
- Concurrent CI-Go workflow succeeded
Previous Run (21301517058) - 6 hours ago:
- Same pattern: jobs cancelled, no failures
- Marked as non-recurring at the time
- Now proven to be a recurring infrastructure issue
Impact
- ❌ False failures block merges - PRs cannot merge even when all tests pass
⚠️ Wasted CI resources - Tests run successfully for 2+ minutes before being cancelled- 🔄 Requires manual intervention - Need to re-trigger merge queue runs
Root Cause Analysis
Category: Infrastructure/Configuration
Problem: Multiple workflows triggered simultaneously for the same merge group commit create a race condition with GitHub's cancel-in-progress concurrency control.
Current Configuration (.github/workflows/ci.yaml:9-11):
concurrency:
group: "ci-ksail-${{ github.ref }}"
cancel-in-progress: trueWhat Happens:
- Merge queue creates a commit (e.g.,
1c485b85) - Multiple workflows trigger simultaneously:
- CI - KSail (21301673630) ← marked as failure
- CI - Go (21301673645) ← succeeded
- CI - Auto-merge (skipped)
- Zizmor (skipped)
- Concurrency groups conflict because they all share the same
github.ref - GitHub cancels in-progress jobs from "CI - KSail"
- Workflow conclusion becomes "failure" even though no tests actually failed
Evidence from Run 21301673630
Concurrent Runs for Commit 1c485b85:
| Run ID | Workflow | Conclusion |
|---|---|---|
| 21301673630 | CI - KSail | failure ❌ |
| 21301673645 | CI - Go | success ✅ |
| 21301673650 | CI - Auto-merge | skipped |
| 21301673655 | Zizmor | skipped |
Cancelled Jobs (all 25):
- 🧪 System Test (Vanilla, Docker, true) - cancelled after 2m 21s
- 🧪 System Test (Vanilla, Docker, true, --cert-manager Enabled) - cancelled after 2m 33s
- 🧪 System Test (Vanilla, Docker, true, --name system-test-cluster) - cancelled after 2m 27s
- ... (22 more jobs, all running successfully before cancellation)
Key Observation: Average job duration before cancellation was ~2m 21s, indicating all tests were progressing normally.
Historical Context
Related Closed Issues
Similar Patterns:
- CI Failure DoctorTalos + Hetzner system test failure in merge queue #1861: Talos + Hetzner system test failure (concurrency + Hetzner secrets)
- CI Failure DoctorRecurring CI failure: K3s + Flux + GHCR local registry system test #1853: K3s + Flux + GHCR registry failure (actual test failure, different root cause)
Pattern Emergence
This is now a confirmed recurring pattern:
- First occurrence: Run 21301517058 (2026-01-23 ~15:30 UTC)
- Second occurrence: Run 21301673630 (2026-01-23 ~21:25 UTC)
- Interval: ~6 hours
- Trigger: merge_group events
- Affected workflows: CI - KSail (system tests)
Recommended Solutions
Option 1: Use Workflow-Specific Concurrency Groups ⭐ RECOMMENDED
Change the concurrency group to be workflow-specific:
# .github/workflows/ci.yaml
concurrency:
group: "ci-ksail-${{ github.workflow }}-${{ github.ref }}"
cancel-in-progress: trueThis ensures each workflow (CI-KSail, CI-Go, etc.) has its own concurrency group.
Option 2: Disable cancel-in-progress for merge_group
concurrency:
group: "ci-ksail-${{ github.ref }}"
cancel-in-progress: ${{ github.event_name != 'merge_group' }}Merge queue runs are already short-lived, so cancellation is less critical.
Option 3: Remove Concurrency Control Entirely
For merge queue events, GitHub already manages concurrency at the queue level:
concurrency:
group: "ci-ksail-${{ github.ref }}"
cancel-in-progress: ${{ github.event_name == 'pull_request' }}Only cancel on PR updates, not merge queue runs.
Prevention Strategy
Immediate Actions
- Implement Option 1 - Add
${{ github.workflow }}to concurrency group - Monitor next 5 merge queue runs for recurrence
- Update cached investigation patterns to flag this as high-priority
Long-term Improvements
-
Separate Concurrency Domains:
- Build/test workflows should have independent concurrency groups
- Prevent cross-workflow cancellation
-
Better Failure Detection:
- Distinguish "cancelled" from "failed" in workflow conclusions
- Add summary job that checks for actual test failures vs. cancellations
-
Retry Logic:
- Auto-retry merge queue runs if conclusion is "failure" with only cancellations
Testing the Fix
After implementing Option 1:
- Create a test PR that triggers merge queue
- Verify all concurrent workflows complete without cancellations
- Confirm CI-KSail workflow reaches "success" conclusion
- Check that individual workflow cancellations still work (update PR → old run cancels)
AI Team Self-Improvement
Add to .github/copilot/instructions/concurrency-best-practices.md:
## GitHub Actions Concurrency Configuration
**Critical Rule**: Always make concurrency groups workflow-specific when using multiple workflows on the same triggers.
### Correct Pattern
``````yaml
concurrency:
group: "${{ github.workflow }}-${{ github.ref }}"
cancel-in-progress: trueThis ensures:
- ✅ Each workflow manages its own concurrency independently
- ✅ Updating a PR cancels old runs of the SAME workflow
- ✅ Different workflows don't interfere with each other
Anti-Pattern (DON'T DO THIS)
concurrency:
group: "ci-${{ github.ref }}" # ❌ Too broad - affects ALL workflows
cancel-in-progress: trueProblems:
- ❌ Workflows cancel each other's jobs
- ❌ False failures when concurrent workflows trigger
- ❌ Race conditions in merge queue
Event-Specific Considerations
merge_group: GitHub already manages concurrency at queue level
- Consider:
cancel-in-progress: falsefor merge queue - Or: Use workflow-specific groups
pull_request: Most important for cancel-in-progress
- Old commits should be cancelled when new commits pushed
- Always use workflow-specific groups
push to main: Usually no cancellation needed
- Each commit should run to completion
Debugging Concurrency Issues
Signs of concurrency configuration problems:
- Jobs cancelled with conclusion "failure" but no error logs
- Multiple workflows triggered simultaneously all failing
- Different workflows for the same commit interfering
- Cancelled jobs had status "success" before cancellation
Fix: Add ${{ github.workflow }} to concurrency group name.
---
**Investigation Data**:
- Saved to: `/cache-memory/investigations/investigation-21301673630.json`
- Related: `investigation-21301517058.json`
- Pattern: `merge_group_concurrency_race`
**Next Steps**:
1. Implement Option 1 (workflow-specific concurrency)
2. Test with next merge queue run
3. Close this issue once pattern stops recurring
> AI generated by [CI Failure Doctor](https://github.com/devantler-tech/ksail/actions/runs/21301806113)
>
> To add this workflow in your repository, run `gh aw add githubnext/agentics/workflows/ci-doctor.md@c5da0cdbfae2a3cba74f330ca34424a4aea929f5`. See [usage guide](https://githubnext.github.io/gh-aw/guides/packaging-imports/).
<!-- gh-aw-agentic-workflow: CI Failure Doctor, engine: copilot, run: https://github.com/devantler-tech/ksail/actions/runs/21301806113 -->
Metadata
Metadata
Assignees
Labels
Type
Projects
Status