Skip to content

CI Failure DoctorRecurring merge_group concurrency race causing false CI failures #1903

@github-actions

Description

@github-actions

Summary

RECURRING INFRASTRUCTURE ISSUE: GitHub Actions concurrency control is causing false CI failures in merge queue runs. This is now the second occurrence in 6 hours with the same root cause.

Failure Pattern

This Run (21301673630):

  • All 25 system test jobs cancelled (not failed)
  • 0 actual test failures
  • CI-KSail workflow marked as "failure" despite no failed tests
  • Concurrent CI-Go workflow succeeded

Previous Run (21301517058) - 6 hours ago:

  • Same pattern: jobs cancelled, no failures
  • Marked as non-recurring at the time
  • Now proven to be a recurring infrastructure issue

Impact

  • False failures block merges - PRs cannot merge even when all tests pass
  • ⚠️ Wasted CI resources - Tests run successfully for 2+ minutes before being cancelled
  • 🔄 Requires manual intervention - Need to re-trigger merge queue runs

Root Cause Analysis

Category: Infrastructure/Configuration

Problem: Multiple workflows triggered simultaneously for the same merge group commit create a race condition with GitHub's cancel-in-progress concurrency control.

Current Configuration (.github/workflows/ci.yaml:9-11):

concurrency:
  group: "ci-ksail-${{ github.ref }}"
  cancel-in-progress: true

What Happens:

  1. Merge queue creates a commit (e.g., 1c485b85)
  2. Multiple workflows trigger simultaneously:
    • CI - KSail (21301673630) ← marked as failure
    • CI - Go (21301673645) ← succeeded
    • CI - Auto-merge (skipped)
    • Zizmor (skipped)
  3. Concurrency groups conflict because they all share the same github.ref
  4. GitHub cancels in-progress jobs from "CI - KSail"
  5. Workflow conclusion becomes "failure" even though no tests actually failed

Evidence from Run 21301673630

Concurrent Runs for Commit 1c485b85:

Run ID Workflow Conclusion
21301673630 CI - KSail failure ❌
21301673645 CI - Go success ✅
21301673650 CI - Auto-merge skipped
21301673655 Zizmor skipped

Cancelled Jobs (all 25):

  • 🧪 System Test (Vanilla, Docker, true) - cancelled after 2m 21s
  • 🧪 System Test (Vanilla, Docker, true, --cert-manager Enabled) - cancelled after 2m 33s
  • 🧪 System Test (Vanilla, Docker, true, --name system-test-cluster) - cancelled after 2m 27s
  • ... (22 more jobs, all running successfully before cancellation)

Key Observation: Average job duration before cancellation was ~2m 21s, indicating all tests were progressing normally.

Historical Context

Related Closed Issues

Similar Patterns:

Pattern Emergence

This is now a confirmed recurring pattern:

  • First occurrence: Run 21301517058 (2026-01-23 ~15:30 UTC)
  • Second occurrence: Run 21301673630 (2026-01-23 ~21:25 UTC)
  • Interval: ~6 hours
  • Trigger: merge_group events
  • Affected workflows: CI - KSail (system tests)

Recommended Solutions

Option 1: Use Workflow-Specific Concurrency Groups ⭐ RECOMMENDED

Change the concurrency group to be workflow-specific:

# .github/workflows/ci.yaml
concurrency:
  group: "ci-ksail-${{ github.workflow }}-${{ github.ref }}"
  cancel-in-progress: true

This ensures each workflow (CI-KSail, CI-Go, etc.) has its own concurrency group.

Option 2: Disable cancel-in-progress for merge_group

concurrency:
  group: "ci-ksail-${{ github.ref }}"
  cancel-in-progress: ${{ github.event_name != 'merge_group' }}

Merge queue runs are already short-lived, so cancellation is less critical.

Option 3: Remove Concurrency Control Entirely

For merge queue events, GitHub already manages concurrency at the queue level:

concurrency:
  group: "ci-ksail-${{ github.ref }}"
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}

Only cancel on PR updates, not merge queue runs.

Prevention Strategy

Immediate Actions

  1. Implement Option 1 - Add ${{ github.workflow }} to concurrency group
  2. Monitor next 5 merge queue runs for recurrence
  3. Update cached investigation patterns to flag this as high-priority

Long-term Improvements

  1. Separate Concurrency Domains:

    • Build/test workflows should have independent concurrency groups
    • Prevent cross-workflow cancellation
  2. Better Failure Detection:

    • Distinguish "cancelled" from "failed" in workflow conclusions
    • Add summary job that checks for actual test failures vs. cancellations
  3. Retry Logic:

    • Auto-retry merge queue runs if conclusion is "failure" with only cancellations

Testing the Fix

After implementing Option 1:

  1. Create a test PR that triggers merge queue
  2. Verify all concurrent workflows complete without cancellations
  3. Confirm CI-KSail workflow reaches "success" conclusion
  4. Check that individual workflow cancellations still work (update PR → old run cancels)

AI Team Self-Improvement

Add to .github/copilot/instructions/concurrency-best-practices.md:

## GitHub Actions Concurrency Configuration

**Critical Rule**: Always make concurrency groups workflow-specific when using multiple workflows on the same triggers.

### Correct Pattern

``````yaml
concurrency:
  group: "${{ github.workflow }}-${{ github.ref }}"
  cancel-in-progress: true

This ensures:

  • ✅ Each workflow manages its own concurrency independently
  • ✅ Updating a PR cancels old runs of the SAME workflow
  • ✅ Different workflows don't interfere with each other

Anti-Pattern (DON'T DO THIS)

concurrency:
  group: "ci-${{ github.ref }}"  # ❌ Too broad - affects ALL workflows
  cancel-in-progress: true

Problems:

  • ❌ Workflows cancel each other's jobs
  • ❌ False failures when concurrent workflows trigger
  • ❌ Race conditions in merge queue

Event-Specific Considerations

merge_group: GitHub already manages concurrency at queue level

  • Consider: cancel-in-progress: false for merge queue
  • Or: Use workflow-specific groups

pull_request: Most important for cancel-in-progress

  • Old commits should be cancelled when new commits pushed
  • Always use workflow-specific groups

push to main: Usually no cancellation needed

  • Each commit should run to completion

Debugging Concurrency Issues

Signs of concurrency configuration problems:

  1. Jobs cancelled with conclusion "failure" but no error logs
  2. Multiple workflows triggered simultaneously all failing
  3. Different workflows for the same commit interfering
  4. Cancelled jobs had status "success" before cancellation

Fix: Add ${{ github.workflow }} to concurrency group name.


---

**Investigation Data**:
- Saved to: `/cache-memory/investigations/investigation-21301673630.json`
- Related: `investigation-21301517058.json`
- Pattern: `merge_group_concurrency_race`

**Next Steps**:
1. Implement Option 1 (workflow-specific concurrency)
2. Test with next merge queue run
3. Close this issue once pattern stops recurring




> AI generated by [CI Failure Doctor](https://github.com/devantler-tech/ksail/actions/runs/21301806113)
>
> To add this workflow in your repository, run `gh aw add githubnext/agentics/workflows/ci-doctor.md@c5da0cdbfae2a3cba74f330ca34424a4aea929f5`. See [usage guide](https://githubnext.github.io/gh-aw/guides/packaging-imports/).

<!-- gh-aw-agentic-workflow: CI Failure Doctor, engine: copilot, run: https://github.com/devantler-tech/ksail/actions/runs/21301806113 -->

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions