Skip to content

Single transient GraphQL failure during pass-2 strands an already-cut release #1203

@klhq

Description

@klhq

Environment details

  • Programming language: TypeScript / Node (project uses Bun, action runs on Node)
  • OS: GitHub Runner ubuntu-latest
  • Language runtime version: Node 20 (action's runtime)
  • Package version: googleapis/release-please-action@v4. Also reproducible on @v5. Underlying release-please 17.3.0 and 17.6.0.

Steps to reproduce

  1. Have a workflow that runs release-please-action on push to main, with a downstream deploy job gated on needs.release-please.outputs['<path>--release_created'] == 'true'.
  2. Merge a release-please PR (e.g. chore(main): release X.Y.Z).
  3. While the workflow runs, encounter a transient GraphQL failure during pass-2 (createPullRequests's commit.history query). The easiest natural reproduction is running during a GitHub Disruption with some GitHub services incident. We caught this on 2026-04-27 16:48Z to 19:02Z. Symptom in logs:
    Creating 1 releases for pull #41                    (pass 1 succeeded, tag v1.10.1 cut)
    Fetching merge commits on branch main with cursor: undefined  (pass 2 starts)
    ##[error]release-please failed: Request failed due to following response errors:
     - Something went wrong while executing your query on 2026-04-27T18:46:36Z.
       Please include `2401:3F4A02:1120138:44A0780:69EFAF0C` when reporting...
    
  4. Observe: the GitHub release and tag for X.Y.Z are created (pass-1 completed), but the release-please job's conclusion is failure, the downstream deploy job is skipped, and re-running the workflow does not recover. The action no longer re-emits release_created=true on subsequent runs because the release already exists.

What's wrong

release-please-action's main() (verified byte-identical in v4 and v5) runs two passes in one step:

  1. manifest.createReleases() cuts the release and writes <path>--release_created=true via core.setOutput.
  2. manifest.createPullRequests() scans commit.history to draft the next release PR. Independent of the release just cut.

If pass-2 throws, core.setFailed marks the entire step failed. The job's release_created output is no longer trusted by needs: consumers, even though pass-1 wrote it before pass-2 ran. Once a tag exists, no future workflow run will re-emit it. So the deploy gate becomes permanently unreachable for that release.

Same symptom independently reported in #1202 today (Rust project, v5), confirming this is neither language- nor repo-specific. Recurring reports: #867 (2023), #976, #977 (2024), #1202 (2026). Each prior issue has been closed without addressing the coupling.

Three independent root issues

(A) Retry only catches HTTP 502. From release-please/src/github.ts graphqlRequest:

if ((err as GitHubAPIError).status !== 502) {
  throw err;
}

The 2026-04-27 incident returned a GraphqlResponseError (HTTP 200, errors: [{message: "Something went wrong..."}] in body). No 502 means no retry. Retry should also cover 503, 504, and GraphqlResponseError whose messages match /something went wrong|server error/i. Identical in 17.3.0 and 17.6.0.

(B) Pass-1 and pass-2 share a single failure surface. Wrapping pass-2 in its own try/catch and surfacing pass-2 errors as core.warning (rather than core.setFailed) would let pass-1's outputs survive a transient pass-2 failure.

(C) Pass-2 is unconditional even when pointless. When HEAD is the release commit pass-1 just cut, pass-2 has zero new commits to scan and zero PRs to open. A short-circuit (if (headSha === justCutReleaseSha) return;) would skip the API call entirely for the most common trigger and eliminate this failure mode there.

Workaround for affected users

continue-on-error: true on the action step preserves pass-1's outputs even if pass-2 throws. Pass-1 outputs are written via core.setOutput before pass-2 begins (verified in src/index.ts main()), so downstream gates on <path>--release_created remain accurate.

Confirmed v5 does not fix this: src/index.ts main() is byte-identical between v4 and v5. The v5.0.0 release notes are limited to a Node 24 bump and a release-please lib bump from 17.3.0 to 17.6.0, neither of which changes retry coverage or pass coupling.

Asks (in priority order)

  1. Decouple pass-2 failures from pass-1 outputs at the action level (issue B). Easiest, fully within this repo.
  2. Broaden retry coverage in release-please to include 503, 504, and matching GraphqlResponseError (issue A).
  3. Short-circuit pass-2 when no work is possible (issue C).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions