Skip to content

Commit 71ad211

Browse files
sjarmakclaude
andcommitted
fix: eliminate 30-min chown timeout on large-repo Docker overlay2
The runtime setup in claude_baseline_agent.py did unconditional `chown -R claude:claude /workspace /app /testbed` which walked 350K+ files on overlay2, taking 30+ minutes and causing agent timeouts. Replace with stat-based ownership probe (2ms when dirs already owned by claude) with bounded maxdepth-1 fallback for mismatched dirs. Also demonstrate clone-as-claude Dockerfile pattern on ccx-domain-071: create claude user first, USER claude, then git clone — eliminates the chown -R layer that duplicated all repo data in overlay2 (2.6GB→1.31GB). Update _CHOWN_OPTIMIZATION comment in generate_sgonly_dockerfiles.py to document that sg_only/artifact Dockerfiles (empty workspace) are fine, but baseline Dockerfiles should use clone-as-claude instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent af3e69a commit 71ad211

File tree

3 files changed

+38
-18
lines changed

3 files changed

+38
-18
lines changed

agents/claude_baseline_agent.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1074,13 +1074,28 @@ def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
10741074

10751075
# The outer command:
10761076
# 1. Creates claude user
1077-
# 2. Writes system prompt and wrapper script from base64
1078-
# 3. Runs wrapper script as claude user (or root in hybrid SG modes)
1079-
# 4. Fixes permissions after run
1077+
# 2. Probes writability; only chowns if needed (avoids 30+ min
1078+
# no-op stat() walk on large repos in Docker overlay2)
1079+
# 3. Writes system prompt and wrapper script from base64
1080+
# 4. Runs wrapper script as claude user (or root in hybrid SG modes)
1081+
# 5. Fixes permissions after run
10801082
setup_cmds = (
10811083
"id -u claude &>/dev/null || adduser -D -s /bin/bash claude 2>/dev/null || adduser --disabled-password --gecos '' claude 2>/dev/null || true && "
10821084
"chown -R claude:claude /logs 2>/dev/null || true && "
1083-
"chown -R claude:claude /workspace /app /testbed 2>/dev/null || true"
1085+
# Probe ownership via stat (not touch — touch succeeds as root).
1086+
# If every existing workdir is already owned by claude, skip the
1087+
# expensive recursive chown that can take 30+ min on overlay2.
1088+
"{ _ok=1; for d in /workspace /app /testbed; do "
1089+
"[ -d \"$d\" ] || continue; "
1090+
"[ \"$(stat -c %U \"$d\" 2>/dev/null)\" = claude ] || _ok=0; "
1091+
"done; [ \"$_ok\" = 1 ]; } && "
1092+
"echo 'chown-skip: dirs already owned by claude' || "
1093+
"{ echo 'chown-fix: running bounded ownership repair'; "
1094+
"for d in /workspace /app /testbed; do "
1095+
"[ -d \"$d\" ] || continue; "
1096+
"chown claude:claude \"$d\" 2>/dev/null; "
1097+
"find \"$d\" -maxdepth 1 -not -user claude -exec chown claude:claude {} + 2>/dev/null; "
1098+
"done; true; }"
10841099
)
10851100

10861101
file_cmds = ""

benchmarks/ccb_mcp_domain/ccx-domain-071/environment/Dockerfile

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,16 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1111
default-jdk \
1212
&& rm -rf /var/lib/apt/lists/*
1313

14+
# Create claude user BEFORE cloning so files are owned correctly from the
15+
# start. This avoids a post-clone chown -R layer that doubles image size
16+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
17+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
18+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
19+
chown -R claude:claude /workspace /logs
20+
21+
# Clone as claude — files land claude-owned, no separate chown layer needed.
22+
USER claude
1423
WORKDIR /workspace
15-
16-
# Clone local checkout repos (baseline config: agent has local access to these)
1724
RUN git clone --depth 1 https://github.com/sg-evals/kafka--0753c489 /workspace/kafka--0753c489
1825
RUN git clone --depth 1 https://github.com/sg-evals/flink--0cc95fcc /workspace/flink--0cc95fcc
1926
RUN git clone --depth 1 https://github.com/sg-evals/camel--1006f047 /workspace/camel--1006f047
@@ -23,12 +30,7 @@ RUN git config --global user.email "agent@example.com" && \
2330
git config --global user.name "Agent" && \
2431
git config --global safe.directory '*'
2532

26-
# Create log directories
27-
RUN mkdir -p /logs/agent /logs/verifier
28-
29-
# Pre-create claude user and set ownership at build time so Harbor's
30-
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
31-
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
32-
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
33+
# Switch back to root for Harbor's runtime setup
34+
USER root
3335

3436
ENTRYPOINT []

scripts/generate_sgonly_dockerfiles.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -127,12 +127,15 @@
127127

128128

129129
# Block injected before ENTRYPOINT to pre-create the claude user and set
130-
# ownership at build time. Harbor's runtime ``chown -R claude:claude /workspace``
131-
# then becomes a near-instant no-op (files already owned by claude), saving
132-
# 15-30 min on large-repo images (e.g. Firefox 5.4 GB workspace).
130+
# ownership at build time. The agent's runtime setup (claude_baseline_agent.py)
131+
# probes ownership via stat and skips chown if dirs are already claude-owned.
132+
#
133+
# NOTE: For sg_only/artifact Dockerfiles this is fine (workspace is empty).
134+
# For BASELINE Dockerfiles with git clones, prefer the clone-as-claude pattern
135+
# instead: create claude user first, USER claude, then git clone. This avoids
136+
# a chown -R layer that doubles image size on Docker overlay2.
133137
_CHOWN_OPTIMIZATION = """\
134-
# Pre-create claude user and set ownership at build time so Harbor's
135-
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
138+
# Pre-create claude user and set ownership at build time.
136139
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \\
137140
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
138141
"""

0 commit comments

Comments
 (0)