fix: filter AWF infrastructure lines from engine failure context#25314
fix: filter AWF infrastructure lines from engine failure context#25314
Conversation
) When the Copilot/Claude CLI exits with code 1 before producing any substantive output (as observed in the Apr 8 systemic outage), the buildEngineFailureContext fallback previously showed AWF infrastructure shutdown messages (Container awf-squid Removed, [WARN] Command completed with exit code: 1, Process exiting with code: 1) as "Last agent output", which was confusing and not useful for diagnosis. Fix: - Add INFRA_LINE_RE pattern (consistent with parse_copilot_log.cjs) to filter AWF infrastructure lines from the fallback tail - When log contains only infrastructure lines → show dedicated "engine terminated before producing output / possible transient issue" message - When actual agent output exists → show filtered last 10 lines only Adds 4 new tests covering the new behavior. Agent-Logs-Url: https://github.com/github/gh-aw/sessions/3f85c846-cf45-418e-9ff8-200607fb878f Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Improves diagnostic output for workflow engine failures by removing AWF (container/firewall wrapper) infrastructure noise from the buildEngineFailureContext fallback and emitting a dedicated “startup failure” message when the engine produces no real output.
Changes:
- Filter AWF infrastructure lines from
agent-stdio.logbefore selecting the fallback “tail” context. - Add a dedicated startup-failure message when logs contain only infrastructure lines.
- Add new tests covering infra-only logs, mixed logs,
[entrypoint]/[health-check]prefixes, and engine ID inclusion; add a patch changeset.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/handle_agent_failure.cjs | Filters infra lines from fallback tail and adds startup-failure messaging when no engine output exists. |
| actions/setup/js/handle_agent_failure.test.cjs | Adds test cases validating infra filtering and the new startup-failure behavior. |
| .changeset/patch-fix-infra-lines-in-engine-failure-context.md | Documents the patch-level change for release notes/versioning. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 3/3 changed files
- Comments generated: 2
| // (e.g., Copilot API service unavailable, rate-limiting, token not yet provisioned). | ||
| core.info("agent-stdio.log contains only infrastructure lines — engine likely failed at startup (possible transient failure)"); | ||
| let context = `\n**⚠️ Engine Failure**: The${engineLabel} engine terminated before producing output.\n\n`; | ||
| context += | ||
| "The engine exited immediately without producing any output. This often indicates a transient infrastructure issue (e.g., service unavailable, API rate limiting). " + | ||
| "If this failure recurs, check the GitHub Copilot status page and review the firewall audit logs.\n\n"; |
There was a problem hiding this comment.
The startup-failure message always tells users to check the “GitHub Copilot status page”, but this code path is used for any engine ID (e.g. tests already cover claude). For non-copilot engines this guidance is inaccurate/misleading. Consider making the wording engine-agnostic or conditionally mentioning the Copilot status page only when GH_AW_ENGINE_ID === "copilot" (and using a generic provider/status message otherwise).
| // (e.g., Copilot API service unavailable, rate-limiting, token not yet provisioned). | |
| core.info("agent-stdio.log contains only infrastructure lines — engine likely failed at startup (possible transient failure)"); | |
| let context = `\n**⚠️ Engine Failure**: The${engineLabel} engine terminated before producing output.\n\n`; | |
| context += | |
| "The engine exited immediately without producing any output. This often indicates a transient infrastructure issue (e.g., service unavailable, API rate limiting). " + | |
| "If this failure recurs, check the GitHub Copilot status page and review the firewall audit logs.\n\n"; | |
| // (e.g., API service unavailable, rate-limiting, token not yet provisioned). | |
| core.info("agent-stdio.log contains only infrastructure lines — engine likely failed at startup (possible transient failure)"); | |
| const recurringFailureGuidance = | |
| process.env.GH_AW_ENGINE_ID === "copilot" | |
| ? "If this failure recurs, check the GitHub Copilot status page and review the firewall audit logs.\n\n" | |
| : "If this failure recurs, check the provider status page (if available) and review the firewall audit logs.\n\n"; | |
| let context = `\n**⚠️ Engine Failure**: The${engineLabel} engine terminated before producing output.\n\n`; | |
| context += | |
| "The engine exited immediately without producing any output. This often indicates a transient infrastructure issue (e.g., service unavailable, API rate limiting). " + | |
| recurringFailureGuidance; |
| // AWF infrastructure lines written by the firewall/container wrapper — not produced by | ||
| // the engine itself. They must be filtered out of the fallback tail so the failure | ||
| // context surfaces actual agent output rather than container lifecycle noise | ||
| // (e.g. "Container awf-squid Removed", "[WARN] Command completed with exit code: 1", | ||
| // "Process exiting with code: 1"). Uses the same pattern as parse_copilot_log.cjs. | ||
| // Note: INFO/WARN/SUCCESS/ERROR are uppercase (AWF wrapper convention); entrypoint and | ||
| // health-check are lowercase (container script convention) — mixed casing is intentional | ||
| // and reflects the actual log output format produced by different AWF components. | ||
| const INFRA_LINE_RE = /^\[(INFO|WARN|SUCCESS|ERROR|entrypoint|health-check)\]|^ (?:Container|Network|Volume) |^Process exiting with code:/; |
There was a problem hiding this comment.
INFRA_LINE_RE is duplicated here and in actions/setup/js/parse_copilot_log.cjs:156. Since the comment explicitly says this must stay in sync, consider extracting it into a shared constant/module (e.g. alongside other log parsing helpers) to prevent future drift when the infrastructure log format changes.
|
@copilot review comments |
…tus page guidance engine-specific - Extract INFRA_LINE_RE as AWF_INFRA_LINE_RE into log_parser_shared.cjs so parse_copilot_log.cjs and handle_agent_failure.cjs share the same constant (no more duplication risk when infra log format changes) - Make startup-failure status page guidance engine-specific: copilot gets "GitHub Copilot status page", all other engines get "provider status page" - Add test verifying provider-agnostic guidance for claude engine - Update existing engine ID startup-failure test to also verify Copilot status page text Agent-Logs-Url: https://github.com/github/gh-aw/sessions/de02b350-d94e-4ee6-aceb-0e1f8d2b9ae8 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Both review comments addressed in f40da83:
|
…25314) Creates a new workflow that: - Triggers on PRs labeled "implementation" or with >100 new lines in business logic dirs - Detects missing ADRs and generates draft using AI analysis of PR diff - Posts actionable comment guiding author to complete and commit the ADR - Blocks merge until ADR is linked via REQUEST_CHANGES review - Verifies implementation matches existing ADR when one is present - ADRs stored as numbered Markdown files in /docs/adr/ - Configurable business logic paths via .design-gate.yml Agent-Logs-Url: https://github.com/github/gh-aw/sessions/3d100092-4ec3-4509-8739-d270f7d79996 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
During the Apr 8 systemic outage, 13+ workflows failed with exit code 1 before producing any agent output. The
buildEngineFailureContextfallback was showing AWF container lifecycle messages as "Last agent output" in failure issues — confusing and useless for diagnosis:Changes
Infrastructure line filter: Uses the shared
AWF_INFRA_LINE_REconstant (extracted intolog_parser_shared.cjs) inbuildEngineFailureContext. Infrastructure lines are stripped before the fallback tail is selected, so only actual engine output appears.Shared constant:
AWF_INFRA_LINE_REis now defined once inlog_parser_shared.cjsand imported by bothparse_copilot_log.cjsandhandle_agent_failure.cjs, eliminating duplication and preventing future drift.Startup-failure detection: When the log contains only infrastructure lines — the engine exited before producing anything — a dedicated message is shown instead of the generic "terminated unexpectedly" + useless tail. The message is engine-aware:
copilotengine failures reference the GitHub Copilot status page; all other engines (claude, codex, custom) reference a generic provider status page:No change to issue creation logic: Failure issues are still created in all cases; only the diagnostic context surfaced in those issues is improved.
5 new tests covering: infra-only log → startup-failure message; mixed log → infra lines excluded from tail;
[entrypoint]/[health-check]prefix handling; engine ID label in startup-failure message; provider-agnostic status page guidance for non-copilot engines.