Conversation
…reduce token usage Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…to artifact files Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in 046df74. The step now:
Directory layout:
Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes the CI Failure Doctor workflow to reduce token consumption from ~1.5M tokens per run by pre-downloading logs and artifacts before the agent session starts, then applying generic error-detection heuristics to create focused hint files that the agent can examine first.
Changes:
- Added a pre-download step that fetches failed job logs and artifacts, applies grep heuristics to identify error lines, and creates a summary file for the agent to start with
- Updated agent instructions to prioritize pre-downloaded hint files over full log retrieval
- Reduced fallback
tail_linesparameter from 200 to 100 for MCP log retrieval
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
.github/workflows/ci-doctor.md |
Added pre-download step with generic error heuristics, new Pre-Analysis Data section, and updated investigation protocol to use pre-downloaded files first |
.github/workflows/ci-doctor.lock.yml |
Auto-generated lock file with updated metadata hash and compiled pre-download step |
Comments suppressed due to low confidence (1)
.github/workflows/ci-doctor.md:118
- The grep pattern
exit status [^0]will match "exit status 1" but also "exit status " followed by any non-zero character (including letters, symbols, etc.), not just non-zero exit codes. The pattern should beexit status [1-9]to match only non-zero numeric exit codes, orexit (code|status) [1-9]to cover both common patterns.
grep -n -iE "(error[: ]|ERROR|FAIL|panic:|fatal[: ]|undefined[: ]|exception|exit status [^0])" \
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| find "$ARTIFACT_DIR" -type f | while read -r f; do | ||
| echo " $f" | ||
| done |
There was a problem hiding this comment.
The find command on line 142 in the summary section could fail or produce unexpected output if $ARTIFACT_DIR is empty or doesn't exist. While the directory is created earlier, if artifact download fails, the find might still print an error to stderr. Consider adding a check: [ -d "$ARTIFACT_DIR" ] && find "$ARTIFACT_DIR" -type f || echo " (no artifacts downloaded)" to handle this edge case gracefully.
| find "$ARTIFACT_DIR" -type f | while read -r f; do | |
| echo " $f" | |
| done | |
| if [ -d "$ARTIFACT_DIR" ]; then | |
| find "$ARTIFACT_DIR" -type f | while read -r f; do | |
| echo " $f" | |
| done | |
| else | |
| echo " (no artifacts downloaded)" | |
| fi |
|
|
||
| # Apply generic heuristics: find lines with common error indicators | ||
| HINTS_FILE="$FILTERED_DIR/job-${JOB_ID}-hints.txt" | ||
| grep -n -iE "(error[: ]|ERROR|FAIL|panic:|fatal[: ]|undefined[: ]|exception|exit status [^0])" \ |
There was a problem hiding this comment.
The grep pattern exit status [^0] will match "exit status 1" but also "exit status " followed by any non-zero character (including letters, symbols, etc.), not just non-zero exit codes. The pattern should be exit status [1-9] to match only non-zero numeric exit codes, or exit (code|status) [1-9] to cover both common patterns.
This issue also appears on line 118 of the same file.
| grep -n -iE "(error[: ]|ERROR|FAIL|panic:|fatal[: ]|undefined[: ]|exception|exit status [^0])" \ | |
| grep -n -iE "(error[: ]|ERROR|FAIL|panic:|fatal[: ]|undefined[: ]|exception|exit status [1-9])" \ |
| -name "*.txt" -o -name "*.log" -o -name "*.json" \ | ||
| -o -name "*.xml" -o -name "*.out" -o -name "*.err" \ | ||
| \) | while read -r ARTIFACT_FILE; do | ||
| REL_PATH="${ARTIFACT_FILE#$ARTIFACT_DIR/}" |
There was a problem hiding this comment.
The variable substitution ${ARTIFACT_FILE#$ARTIFACT_DIR/} should quote the variable $ARTIFACT_DIR to prevent word splitting in case the path contains spaces. Change to ${ARTIFACT_FILE#"$ARTIFACT_DIR"/} for safer string manipulation.
| REL_PATH="${ARTIFACT_FILE#$ARTIFACT_DIR/}" | |
| REL_PATH="${ARTIFACT_FILE#"$ARTIFACT_DIR"/}" |
CI Failure Doctor was consuming 1.5M tokens/run because the agent loaded full log payloads into conversation context, then burned 5–8 LLM turns running Python grep scripts to locate the actual error.
Changes
Pre-download step (frontmatter
steps:)Runs before the agent session starts — zero extra agent turns:
gh api→failed-jobs.json/tmp/ci-doctor/logs/job-<id>.loggh run download→/tmp/ci-doctor/artifacts/(auto-extracted by artifact name)error:,FAIL,panic:,fatal:,undefined:,exception, non-zero exit status) to both log files and artifact text files (.txt,.log,.json,.xml,.out,.err) → filtered hint files in/tmp/ci-doctor/filtered/*-hints.txt/tmp/ci-doctor/summary.txtenumerating all file locations and first 3 hint matches per fileDirectory layout:
/tmp/ci-doctor/logs/failed-jobs.json/tmp/ci-doctor/artifacts//tmp/ci-doctor/filtered//tmp/ci-doctor/summary.txtAgent instructions
summary.txtas the starting pointget_job_logsreduced fromtail_lines=200→tail_lines=100The heuristics are intentionally generic (no job-name-specific patterns) so they work across any CI job type and artifact format.
Security: context expressions (
${{ github.event.workflow_run.id }}) are passed viaenv:vars rather than interpolated directly in the shell script to prevent template injection.Original prompt
This section details on the original issue you should resolve
<issue_title>[q] fix(ci-doctor): add error pre-location heuristics to reduce token usage</issue_title>
<issue_description>## Q Workflow Optimization Report
Issues Found (from live data)
CI Failure Doctor was consuming 1.5M tokens per run (10 runs analyzed, 8.7M total tokens over 7 days).
Root cause analysis of runs
22260230861(1.55M tokens) and22260291762(1.47M tokens):get_job_logswhich returned oversized MCP payloads (flagged as "too large for MCP response"), causing the Copilot agent to iterate over them with multiple Python parsing scripts before finding the actual errortail_lines=200was retrieving large verbose CI log sections when the error was typically in 2–3 linesget_workflow_run→list_workflow_jobs→get_job_logs (large payload)→ 5–8 Python grep scripts → error foundChanges Made
.github/workflows/ci-doctor.mdAdded Phase 1.5: Error Pre-Location Heuristics — a new phase inserted between Phase 1 (triage) and Phase 2 (log analysis) that:
list_workflow_jobs(zero extra API calls)lint-go→ grep forunused,SA;test-unit→ grep for--- FAIL;build→ grep forundefined:)tail_lines=50when failure type is clear,tail_lines=100when ambiguousReduced
tail_linesfrom 200 → 100 in Phase 2 for initial log retrievalAdded instruction to grep-extract error context (10 lines before/after the match) instead of loading the full log payload into conversation context
Expected Improvements
tail_lines=50for clear-cut failures (lint, build) reduces MCP payload size significantlyValidation
✅
ci-doctor.mdcompiled successfully withgh aw compileNote:
.lock.ymlwill be regenerated automatically after merge.References
22260230861(1.55M tokens),22260291762(1.47M tokens),22260087423(876K tokens)/tmp/gh-aw/aw-mcp/logs/run-22260230861/agent-stdio.logTo apply the patch locally:
Show patch preview (60 of 60 lines)