Skip to content

ENOENT race condition during cache restoration on ephemeral self hosted runners #479

@lifeofmoo

Description

@lifeofmoo

Contributing guidelines

I've found a bug, and:

  • The documentation does not mention anything about my problem
  • There are no open or closed issues that are related to my problem

Description

docker/setup-buildx-action@v3 with version: latest is experiencing intermittent ENOENT: no such file or directory, copyfile errors on ephemeral Kubernetes-based GitHub Actions self-hosted runners. The error occurs during the cache restoration phase, suggesting a race condition between cache restoration completion and file availability.

Environment

  • Runner: Self-hosted Kubernetes ephemeral runners (k8s-runners)
  • Action Version: docker/setup-buildx-action@v3
  • Buildx Version: version: latest (downloads v0.31.1)
  • Setup Date: Workflows created February 11, 2026
  • Frequency: Intermittent (not every run)
  • Shared Workflow: ......./build-push-deploy.yaml@main (serves 200+ repos)

Reproduction

Failing Workflow (gha-java-app)

Workflow created: 2026-02-11
Last failure: 2026-02-12

- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3
  with:
    version: latest
    cache-binary: true

Full Error Log:

Download buildx from GitHub Releases
  Downloading https://github.com/docker/buildx/releases/download/v0.31.1/buildx-v0.31.1.linux-amd64
  Received 61235200 of 61235200 (100.0%), 54.8 MBs/sec
  /usr/bin/tar --extract -z --file=/home/runner/work/_temp/8eadc7e2-2a5b-41e5-b25b-e01f76fc6ad4/tmp.tar.gz -C /home/runner/work/_temp/8eadc7e2-2a5b-41e5-b25b-e01f76fc6ad4
  Caching buildx-dl-bin-0.31.1-linux-x64 to GitHub Actions cache
  Received 59080704 of 59080704 (100.0%), 29.2 MBs/sec
  Cache Size: ~57 MB (59080704 B)
  Cache saved successfully
  Cache Path: /opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64
  Cache Expiration: 2026-03-14 18:17:25 +0000 UTC
  Process to be cached: /opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64/docker-buildx
  Cached to hosted tool cache /opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64
  Buildx binary found in /home/runner/.docker/buildx/.bin/0.31.1/linux-x64/docker-buildx
Error: ENOENT: no such file or directory, copyfile '/opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64/docker-buildx' -> '/home/runner/.docker/buildx/.bin/0.31.1/linux-x64/docker-buildx'

Working Workflow (web-frontend)

Same workflow, same runner type, 30 minutes earlier - SUCCESS:

Download buildx from GitHub Releases
  Use 0.31.1 version spec cache key for v0.31.1
  Cache hit for: buildx-dl-bin-0.31.1-linux-x64
  Received 59080704 of 59080704 (100.0%), 33.4 MBs/sec
  Cache Size: ~57 MB (59080704 B)
  Cache restored successfully
  Restored buildx-dl-bin-0.31.1-linux-x64 from GitHub Actions cache
  Cached to hosted tool cache /opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64
  Buildx binary found in /home/runner/.docker/buildx/.bin/0.31.1/linux-x64/docker-buildx
Install buildx
  Docker plugin mode
  ✅ Success (no error)

Analysis

Race Condition Hypothesis

The error pattern suggests a timing issue:

  1. ✅ Cache restoration reports success: "Cached to hosted tool cache"
  2. ✅ Binary detection succeeds: "Buildx binary found in /home/runner/.docker/buildx/.bin/..."
  3. ❌ Immediate copyfile operation fails with ENOENT

Theory: On ephemeral self-hosted runners, there may be a delay between:

  • Cache service reporting "cached to hosted tool cache"
  • Actual filesystem availability of the cached file

The intermittent nature (same workflow succeeds for web-frontend, fails for gha-java-app) suggests:

  • Not a permissions issue (would be consistent)
  • Not a configuration issue (would affect all repos)
  • Likely a timing/race condition in cache restoration path

Version Timeline

  • docker/setup-buildx-action v3.9.0 released: 2025-02-06
  • docker/setup-buildx-action v3.10.0 released: 2025-02-26
  • gha-java-app workflows created: 2026-02-11 (between these releases)
  • v3.12.0 (latest): 2025-12-19

Using version: latest means downloading buildx v0.31.1 every time.

Impact

  • Scope: 200+ production repositories using shared workflow
  • Frequency: Intermittent failures on newly-created workflows
  • Workaround: Manual retry usually fail
  • Minor Risk: Cannot modify shared workflow without affecting entire fleet

Similar Issues

Searched closed issues - found #423 with EACCES: permission denied, copyfile but:

  • Different error (EACCES vs ENOENT)
  • Different copy direction (to cli-plugins vs to .bin)
  • Different resolution (permissions fix)

No exact match found for this ENOENT race condition.

Requested Fix

Potential solutions:

  1. Add retry logic with backoff when copyfile fails with ENOENT
  2. Add explicit filesystem sync/wait after cache restoration
  3. Verify file existence before reporting "Buildx binary found"
  4. Document timing considerations for ephemeral runners

Additional Context

  • Other workflow commands using same runners work reliably
  • Only affects setup-buildx-action specifically
  • Issue appeared around Feb 11, 2026 for new workflows

Full logs available upon request.

Expected behaviour

The action should reliably:

  1. Restore buildx binary from GitHub Actions cache
  2. Cache to hosted tool cache
  3. Copy binary to final destination
  4. Proceed with buildx setup

This workflow has worked reliably across ~200 production repositories since 2024.

Actual behaviour

The action intermittently fails with:

Cached to hosted tool cache /opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64
Buildx binary found in /home/runner/.docker/buildx/.bin/0.31.1/linux-x64/docker-buildx
Error: ENOENT: no such file or directory, copyfile '/opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64/docker-buildx' -> '/home/runner/.docker/buildx/.bin/0.31.1/linux-x64/docker-buildx'

Key observation: The action reports "Cached to hosted tool cache" AND "Buildx binary found", yet immediately fails with ENOENT when attempting the copy operation. This suggests the file hasn't fully materialized despite cache success indicators.

Repository URL

No response

Workflow run URL

No response

YAML workflow

This is embedded in a shared reusable workflow:


- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3
  with:
    version: latest
    cache-binary: true


Called from consuming workflow:


name: Build and Deploy

on:
  push:
    branches:
      - main

jobs:
  build:
    uses: org/shared-workflows/.github/workflows/build-push-deploy.yaml@main
    with:
      STACK: dev
      K8S_REPO: org/app_eks
    secrets: inherit

Workflow logs

Failing Run (New Workflow - Created Feb 11, 2026)

Download buildx from GitHub Releases
  Downloading https://github.com/docker/buildx/releases/download/v0.31.1/buildx-v0.31.1.linux-amd64
  Received 61235200 of 61235200 (100.0%), 54.8 MBs/sec
  /usr/bin/tar --extract -z --file=/home/runner/work/_temp/8eadc7e2-2a5b-41e5-b25b-e01f76fc6ad4/tmp.tar.gz -C /home/runner/work/_temp/8eadc7e2-2a5b-41e5-b25b-e01f76fc6ad4
  Caching buildx-dl-bin-0.31.1-linux-x64 to GitHub Actions cache
  Received 59080704 of 59080704 (100.0%), 29.2 MBs/sec
  Cache Size: ~57 MB (59080704 B)
  Cache saved successfully
  Cache Path: /opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64
  Cache Expiration: 2026-03-14 18:17:25 +0000 UTC
  Process to be cached: /opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64/docker-buildx
  Cached to hosted tool cache /opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64
  Buildx binary found in /home/runner/.docker/buildx/.bin/0.31.1/linux-x64/docker-buildx
Error: ENOENT: no such file or directory, copyfile '/opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64/docker-buildx' -> '/home/runner/.docker/buildx/.bin/0.31.1/linux-x64/docker-buildx'

Working Run (Existing Workflow - Same Runner Type, 30 mins earlier)

Download buildx from GitHub Releases
  Use 0.31.1 version spec cache key for v0.31.1
  Cache hit for: buildx-dl-bin-0.31.1-linux-x64
  Received 59080704 of 59080704 (100.0%), 33.4 MBs/sec
  Cache Size: ~57 MB (59080704 B)
  Cache restored successfully
  Restored buildx-dl-bin-0.31.1-linux-x64 from GitHub Actions cache
  Cached to hosted tool cache /opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64
  Buildx binary found in /home/runner/.docker/buildx/.bin/0.31.1/linux-x64/docker-buildx
Install buildx
  Docker plugin mode
  ✅ Success (no error)

BuildKit logs

_Not applicable - error occurs before BuildKit initialization_

Additional info

Self-Hosted Runner Configuration

Runner Controller:

  • Controller: summerwind/actions-runner-controller:v0.27.6
  • Runner Image: summerwind/actions-runner:latest
  • GitHub Actions Runner Version: 2.331.0
  • Ephemeral Mode: true (confirmed - pods destroyed after each job)

Docker Environment:

  • Docker Version: 29.1.3 (built Dec 12, 2025)
  • Docker API: 1.52
  • containerd: v2.2.0
  • runc: 1.3.4
  • Docker Buildx (pre-installed): v0.30.1

Version Information:

  • Runner comes with buildx v0.30.1 pre-installed
  • GitHub Action downloads v0.31.1 with version: latest
  • Bug is in v0.31.1 binary itself, not related to version upgrade process
  • v0.30.1 works correctly in all scenarios

Test Results & Root Cause Confirmation

Test 1: v0.30.1 (2026-02-12 14:40 UTC)

Modified shared workflow to use version: v0.30.1 (matching runner's pre-installed version).

Test Branch: buildx in github-actions-k8s repository
Test Workflow: https://github.com/MYLtd/gha-java-app/actions/runs/21946618833

Result:SUCCESS - Build completed without ENOENT error


Test 2: v0.31.1 Explicit Version (2026-02-12 15:15 UTC)

Modified shared workflow to use explicit version: v0.31.1 with cache-binary: true.

Purpose: Determine if v0.31.1 itself has a bug, or if only version mismatch (v0.30.1→v0.31.1 upgrade) causes the issue.

Test Branch: test-v0.31.1 in github-actions-k8s repository
Test Workflow: https://github.com/MYLtd/gha-java-app/actions/runs/21947309220

Result:FAILED - ENOENT error occurred:

ENOENT: no such file or directory, copyfile '/opt/hostedtoolcache/buildx-dl-bin/0.31.1/linux-x64/docker-buildx' 
-> '/home/runner/.docker/buildx/.bin/0.31.1/linux-x64/docker-buildx'

Conclusion:

The ENOENT error is caused by a regression bug in buildx v0.31.1 during cache-save operations on ephemeral Kubernetes-based GitHub Actions runners.

Test Evidence:

Test 2 definitively proves this is not a version mismatch issue:

  • Used explicit version: v0.31.1 (no pre-existing version to upgrade from)
  • Still produced identical ENOENT error
  • Therefore, the bug is in v0.31.1's cache-save implementation itself

Failure Pattern:

  1. ❌ v0.31.1 + first-time download → ENOENT (cache save bug triggers)
  2. ✅ v0.31.1 + cache hit → SUCCESS (no cache save needed, bug bypassed)
  3. ✅ v0.30.1 + first-time download → SUCCESS (no bug in v0.30.1)
  4. ❌ v0.31.1 + explicit version → ENOENT (confirms bug is in v0.31.1 binary)

Potential Root Cause:

Buildx v0.31.1 reports "Cache saved successfully" before the file is actually available on ephemeral runner filesystems. The subsequent copyfile operation fails because the source file doesn't exist yet, despite the action claiming it was successfully cached.

This is a race condition bug introduced in v0.31.1, likely in the tool cache or filesystem interaction code.

Recommended Actions:

  1. Immediate Fix (Production): Pin to version: v0.30.1 in shared workflow.
  2. Monitor: Check if v0.31.0 has the same bug or if regression was introduced in v0.31.1 specifically

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions