Skip to content

fix: unreachable validation logic, add API retry, prevent workflow races#27

Merged
lubianat merged 7 commits intomainfrom
copilot/fix-validation-script-error
Jan 23, 2026
Merged

fix: unreachable validation logic, add API retry, prevent workflow races#27
lubianat merged 7 commits intomainfrom
copilot/fix-validation-script-error

Conversation

Copy link
Contributor

Copilot AI commented Jan 22, 2026

Three production-blocking issues: validation script never executed (logic unreachable), API calls failed on any transient error, and concurrent workflow runs could race.

Changes

Fixed unreachable validation in scripts/validate_pr.py

  • Main validation logic was trapped inside user_exists() after early returns
  • Moved to proper main() function—validation now actually runs

Added retry logic to sync scripts

def create_session(token):
    """Create a requests session with retry logic and exponential backoff."""
    session = requests.Session()
    retries = Retry(
        total=RETRY_TOTAL,
        backoff_factor=RETRY_BACKOFF_FACTOR,
        status_forcelist=[429, 500, 502, 503, 504],
        respect_retry_after_header=True
    )
    session.mount("https://", HTTPAdapter(max_retries=retries))
    session.headers.update(auth_headers(token))
    return session
  • Applied to all API calls in github_to_yaml.py and yaml_to_github.py
  • Handles rate limits, transient 5xx errors with exponential backoff

Added workflow concurrency control

concurrency:
  group: team-sync-${{ github.repository }}
  cancel-in-progress: false
  • Prevents simultaneous YAML→GitHub and GitHub→YAML syncs from racing
  • Jobs queue instead of canceling

Added comprehensive test suite

  • Created tests/test_validate_pr.py with 10 test cases for validation logic, pagination, and user existence checks
  • Created tests/test_retry_logic.py with 6 test cases verifying retry configuration consistency and session creation
  • Added .github/workflows/run-tests.yml for automated testing on PRs and pushes
  • Added tests/README.md with testing documentation
  • Updated main README with development and testing instructions
  • All 16 tests passing ✅
Original prompt

Summary

This PR addresses three critical issues identified during production readiness review that must be fixed before this can be safely used in production.

Critical Fixes Required

1. Fix broken validation script (scripts/validate_pr.py)

Problem: The main validation logic (lines 52-99) is incorrectly indented inside the user_exists() function and never executes. The validation workflow appears to pass but doesn't actually validate anything.

Current broken code:

def user_exists(login: str) -> bool:
    """Check if a GitHub user exists."""
    try:
        r = requests.get(f"{API}/users/{login}", headers=H, timeout=60)
        if r.status_code == 403:
            # Check if it's a rate limit issue
            if (
                "X-RateLimit-Remaining" in r.headers
                and r.headers["X-RateLimit-Remaining"] == "0"
            ):
                print(f"ERROR: GitHub API rate limit exceeded")
                sys.exit(1)
        return r.status_code == 200
    except requests.exceptions.RequestException as e:
        print(f"ERROR: Failed to check user '{login}': {e}")
        sys.exit(1)

    # THIS CODE IS UNREACHABLE - it's inside user_exists() but should be in main()
    # Load teams.yaml
    teams_path = Path("teams.yaml")
    cfg = yaml.safe_load(teams_path.read_text(encoding="utf-8")) or {}
    # ... rest of validation logic that never runs

Fix: Move the main validation logic out of user_exists() and into a proper main() function that gets called.

Commit message: fix(validate): move main logic out of user_exists function


2. Add retry logic with exponential backoff to sync scripts

Problem: Both scripts/github_to_yaml.py and scripts/yaml_to_github.py use raw requests.get() calls that fail immediately on any transient error (503, rate limit, network blip). A single API hiccup crashes the entire sync.

Fix: Add a shared retry mechanism using requests.adapters.HTTPAdapter with urllib3.util.retry.Retry:

from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session(token):
    session = requests.Session()
    retries = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
        respect_retry_after_header=True
    )
    session.mount("https://", HTTPAdapter(max_retries=retries))
    session.headers.update(auth_headers(token))
    return session

Apply this to:

  • scripts/github_to_yaml.py - update paginate() to use session
  • scripts/yaml_to_github.py - update paginate() and all API calls in reconcile_team(), invite_by_login(), get_user_id() to use session

Commit message: fix(sync): add retry logic with exponential backoff for API calls


3. Add concurrency control to workflows

Problem: If someone merges a YAML change while the hourly export is running (or vice versa), race conditions can occur leading to inconsistent state.

Fix: Add concurrency block to both workflow files:

For .github/workflows/settings-to-yaml-sync.yml:

concurrency:
  group: team-sync-${{ github.repository }}
  cancel-in-progress: false

For .github/workflows/yaml-to-settings-sync.yml:

concurrency:
  group: team-sync-${{ github.repository }}
  cancel-in-progress: false

Using the same group name ensures only one sync runs at a time. cancel-in-progress: false means jobs queue rather than cancel each other.

Commit message: fix(workflows): add concurrency control to prevent race conditions


Commit Strategy

Please make three separate atomic commits in this order:

  1. fix(validate): move main logic out of user_exists function
  2. fix(sync): add retry logic with exponential backoff for API calls
  3. fix(workflows): add concurrency control to prevent race conditions

Each commit should be self-contained and the codebase should work correctly after each commit.

This pull request was created from Copilot chat.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 5 commits January 22, 2026 21:56
Co-authored-by: lubianat <7917951+lubianat@users.noreply.github.com>
Co-authored-by: lubianat <7917951+lubianat@users.noreply.github.com>
Co-authored-by: lubianat <7917951+lubianat@users.noreply.github.com>
Co-authored-by: lubianat <7917951+lubianat@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix broken validation script in production workflow fix: unreachable validation logic, add API retry, prevent workflow races Jan 22, 2026
Copilot AI requested a review from lubianat January 22, 2026 22:03
@lubianat
Copy link
Member

Without tests it is always harder to approve PRs...

@lubianat lubianat requested a review from Copilot January 22, 2026 22:09
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes three critical production-blocking issues: unreachable validation logic that prevented PR validation from running, lack of retry handling for API calls that caused failures on transient errors, and missing concurrency control that could lead to race conditions between workflow runs.

Changes:

  • Moved validation logic from inside user_exists() to a proper main() function so validation actually executes
  • Added retry logic with exponential backoff to all GitHub API calls in sync scripts
  • Added concurrency control to both workflow files to prevent simultaneous syncs

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
scripts/validate_pr.py Extracted validation logic into main() function to fix unreachable code
scripts/yaml_to_github.py Added session-based retry mechanism for all API calls
scripts/github_to_yaml.py Added session-based retry mechanism for all API calls
.github/workflows/yaml-to-settings-sync.yml Added concurrency control to prevent race conditions
.github/workflows/settings-to-yaml-sync.yml Added concurrency control and corrected cron comment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- cron: "00 * * * *" # hourly

concurrency:
group: team-sync-${{ github.repository }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually provide safety against racing conditions? Say an hourly update and a pull request merge at the same time?

Also, if there are manual settings that have been changed (i.e. the state of the org is different from the previous yaml file) in the same hour a merge happens, there is a racing condition this will not catch. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concurrency control prevents workflows from running simultaneously, but you're correct that it doesn't fully prevent the race condition you described.

What it prevents: Two workflows starting at the exact same time (e.g., hourly export + manual merge).

What it doesn't prevent: The scenario where:

  1. Export starts at 0:00, reads current GitHub state
  2. YAML merge happens at 0:30 (export still running)
  3. Export completes at 0:45 and creates a PR that might conflict

However, this is partially mitigated because:

  • The export creates a PR (not direct push), so conflicts are visible during review
  • The YAML→GitHub sync runs immediately on merge, so the window is small
  • GitHub's PR merge will detect file conflicts if both modified teams.yaml

The fundamental issue is that this is a bidirectional sync system. True prevention would require distributed locking or a single source of truth. Given the current architecture (GitHub UI changes allowed), the PR-based review is the safety mechanism.

If this is a concern in practice, consider:

  • Disabling manual GitHub UI changes (YAML-only)
  • Adding a check in the export to detect if main changed during the run
  • Implementing optimistic locking with version markers in teams.yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is the other way around, and doesnt have to do with concurrency in workflows. The issue is the yaml overriding the manual changes even before a PR is open for the sync

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is the other way around, and doesnt have to do with concurrency in workflows. The issue is the yaml overriding the manual changes even before a PR is open for the sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right - I misunderstood the race condition. The actual issue is:

  1. Admin manually adds user X to a team in GitHub UI
  2. Someone merges a YAML change (before hourly export captures user X)
  3. YAML→GitHub sync runs, reads teams.yaml (which lacks user X)
  4. Line 151-165 in yaml_to_github.py removes user X (because it's in GitHub but not YAML)

This is data loss, not just a merge conflict. The concurrency control I added doesn't prevent this.

Per your other comment, this seems like a separate issue from the three production-blocking bugs fixed in this PR (unreachable validation, missing retries, workflow concurrency). The sync race would need architectural changes (e.g., merge instead of replace, version markers, or YAML-only mode).

Co-authored-by: lubianat <7917951+lubianat@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 23, 2026

Without tests it is always harder to approve PRs...

Added comprehensive test suite in commit 584bbb3:

  • 16 test cases covering validation logic, pagination, retry configuration, and session creation
  • Automatic test runs via GitHub Actions workflow
  • See tests/README.md for details

All tests passing ✅

Copilot AI requested a review from lubianat January 23, 2026 09:49
Copy link
Member

@lubianat lubianat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments on sync. May ne separate in other pr

@lubianat lubianat marked this pull request as ready for review January 23, 2026 11:29
@lubianat lubianat merged commit 09aa496 into main Jan 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants