Skip to content

fix(ci-doctor): pre-download logs and artifacts, apply generic error heuristics to reduce token usage#17719

Merged
pelikhan merged 3 commits intomainfrom
copilot/optimize-ci-doctor-error-handling
Feb 22, 2026
Merged

fix(ci-doctor): pre-download logs and artifacts, apply generic error heuristics to reduce token usage#17719
pelikhan merged 3 commits intomainfrom
copilot/optimize-ci-doctor-error-handling

Conversation

Copy link
Contributor

Copilot AI commented Feb 22, 2026

CI Failure Doctor was consuming 1.5M tokens/run because the agent loaded full log payloads into conversation context, then burned 5–8 LLM turns running Python grep scripts to locate the actual error.

Changes

Pre-download step (frontmatter steps:)
Runs before the agent session starts — zero extra agent turns:

  • Fetches failed jobs + their failed step names via gh apifailed-jobs.json
  • Downloads each failed job's full log → /tmp/ci-doctor/logs/job-<id>.log
  • Downloads and unpacks all workflow run artifacts via gh run download/tmp/ci-doctor/artifacts/ (auto-extracted by artifact name)
  • Applies generic grep heuristics (case-insensitive: error:, FAIL, panic:, fatal:, undefined:, exception, non-zero exit status) to both log files and artifact text files (.txt, .log, .json, .xml, .out, .err) → filtered hint files in /tmp/ci-doctor/filtered/*-hints.txt
  • Writes /tmp/ci-doctor/summary.txt enumerating all file locations and first 3 hint matches per file

Directory layout:

Path Contents
/tmp/ci-doctor/logs/ Job log files + failed-jobs.json
/tmp/ci-doctor/artifacts/ Unpacked artifact files (by artifact name)
/tmp/ci-doctor/filtered/ Hint files for both logs and artifacts
/tmp/ci-doctor/summary.txt Starting point listing all locations + first hints

Agent instructions

  • New Pre-Analysis Data section directs the agent to summary.txt as the starting point
  • Phase 2 revised: read hint files first, inspect ±10 lines around hinted line numbers, check artifacts for structured output, only load the full log if hints are insufficient
  • Fallback get_job_logs reduced from tail_lines=200tail_lines=100

The heuristics are intentionally generic (no job-name-specific patterns) so they work across any CI job type and artifact format.

Security: context expressions (${{ github.event.workflow_run.id }}) are passed via env: vars rather than interpolated directly in the shell script to prevent template injection.

Original prompt

This section details on the original issue you should resolve

<issue_title>[q] fix(ci-doctor): add error pre-location heuristics to reduce token usage</issue_title>
<issue_description>## Q Workflow Optimization Report

Issues Found (from live data)

CI Failure Doctor was consuming 1.5M tokens per run (10 runs analyzed, 8.7M total tokens over 7 days).

Root cause analysis of runs 22260230861 (1.55M tokens) and 22260291762 (1.47M tokens):

  • The agent called get_job_logs which returned oversized MCP payloads (flagged as "too large for MCP response"), causing the Copilot agent to iterate over them with multiple Python parsing scripts before finding the actual error
  • tail_lines=200 was retrieving large verbose CI log sections when the error was typically in 2–3 lines
  • No lightweight pre-triage step: the agent had no way to classify the failure type before loading logs, so it loaded everything and searched iteratively
  • Typical turn pattern: get_workflow_runlist_workflow_jobsget_job_logs (large payload) → 5–8 Python grep scripts → error found

Changes Made

.github/workflows/ci-doctor.md

  1. Added Phase 1.5: Error Pre-Location Heuristics — a new phase inserted between Phase 1 (triage) and Phase 2 (log analysis) that:

    • Extracts the exact step name(s) that failed from list_workflow_jobs (zero extra API calls)
    • Classifies failure type from job name heuristics (e.g., lint-go → grep for unused, SA; test-unit → grep for --- FAIL; build → grep for undefined:)
    • Builds a targeted grep pattern before any log download
    • Sets a decision gate: tail_lines=50 when failure type is clear, tail_lines=100 when ambiguous
  2. Reduced tail_lines from 200 → 100 in Phase 2 for initial log retrieval

  3. Added instruction to grep-extract error context (10 lines before/after the match) instead of loading the full log payload into conversation context

Expected Improvements

  • Token reduction estimate: ~40–60% fewer tokens per run by avoiding iterative Python parsing of large payload files
  • Targeted grep patterns mean the error is found in 1–2 shell commands instead of 5–8 Python scripts
  • tail_lines=50 for clear-cut failures (lint, build) reduces MCP payload size significantly

Validation

ci-doctor.md compiled successfully with gh aw compile

Note: .lock.yml will be regenerated automatically after merge.

References

  • Run IDs investigated: 22260230861 (1.55M tokens), 22260291762 (1.47M tokens), 22260087423 (876K tokens)
  • 10-run aggregate: 8.7M total tokens across successful CI Doctor runs
  • Agent log showing iterative Python parsing: /tmp/gh-aw/aw-mcp/logs/run-22260230861/agent-stdio.log

🎩 Equipped by Q

  • expires on Feb 24, 2026, 1:23 PM UTC

[!NOTE]
This was originally intended as a pull request, but the git push operation failed.

Workflow Run: View run details and download patch artifact

The patch file is available in the agent-artifacts artifact in the workflow run linked above.

To apply the patch locally:

# Download the artifact from the workflow run https://github.com/github/gh-aw/actions/runs/22277901060
# (Use GitHub MCP tools if gh CLI is not available)
gh run download 22277901060 -n agent-artifacts -D /tmp/agent-artifacts-22277901060

# The patch file will be at agent-artifacts/tmp/gh-aw/aw-q-ci-doctor-token-optimization.patch after download
# Apply the patch
git am /tmp/agent-artifacts-22277901060/aw-q-ci-doctor-token-optimization.patch
Show patch preview (60 of 60 lines)
From 251a0692dbf88c9f343939a6660adf73781fd77f Mon Sep 17 00:00:00 2001
From: Copilot <copilot@github.com>
Date: Sun, 22 Feb 2026 13:22:03 +0000
Subject: [PATCH] fix(ci-doctor): add pre-location heuristics to reduce token
 usage

- Add Phase 1.5: Error Pre-Location Heuristics that classifies failure
  type from job/step names before downloading any logs
- Build targeted grep patterns per failure type (lint-go, test, build,
  validate-yaml) to extract only relevant error lines
- Reduce tail_lines from 200 to 100 for initial log retrieval
- Add decision gate: 50 lines for clear failures, 100 for ambiguous
- Instruct agent to grep-extract error context instead of loading full
  log into conversation context

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .github/workflows/ci-doctor.md | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/ci-doctor.md b/.github/workflows/ci-doctor.md
inde...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes github/gh-aw#17711

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/github/gh-aw/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

…reduce token usage

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix CI Doctor to improve token usage with error pre-location heuristics fix(ci-doctor): pre-download logs and apply generic error heuristics to reduce token usage Feb 22, 2026
Copy link
Contributor

@pelikhan pelikhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot update step to download and unpack all files and artifacts to files, then produce filtered files as well.

…to artifact files

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 22, 2026

@copilot update step to download and unpack all files and artifacts to files, then produce filtered files as well.

Done in 046df74. The step now:

  1. Downloads and unpacks all workflow run artifacts to /tmp/ci-doctor/artifacts/ via gh run download (auto-extracts by artifact name)
  2. Applies the same generic grep heuristics to artifact text files (.txt, .log, .json, .xml, .out, .err) → filtered hint files in /tmp/ci-doctor/filtered/artifact-<name>-hints.txt
  3. The summary at /tmp/ci-doctor/summary.txt enumerates all log files, all artifact files, and all hint files with their first few matches

Directory layout:

  • /tmp/ci-doctor/logs/ — job log files + failed-jobs.json
  • /tmp/ci-doctor/artifacts/ — unpacked artifact files
  • /tmp/ci-doctor/filtered/ — hint files for both logs and artifacts
  • /tmp/ci-doctor/summary.txt — starting point for the agent

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/graphql
    • Triggering command: /usr/bin/gh /usr/bin/gh api graphql -f query=query($owner: String!, $name: String!) { repository(owner: $owner, name: $name) { hasDiscussionsEnabled } } -f owner=github -f name=gh-aw /home/REDACTED/work/gh-aw/gh-aw/.github/workflows/ci-doctor.md /home/REDACTED/work/gh-aw/gh-aw/.github/workflows/claude-code-user-docs-review.md /home/REDACTED/work/gh-aw/gh-aw/.github/workflows/cli-consistency-checker.md (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title fix(ci-doctor): pre-download logs and apply generic error heuristics to reduce token usage fix(ci-doctor): pre-download logs and artifacts, apply generic error heuristics to reduce token usage Feb 22, 2026
@pelikhan pelikhan marked this pull request as ready for review February 22, 2026 13:56
Copilot AI review requested due to automatic review settings February 22, 2026 13:56
@pelikhan pelikhan merged commit 6221cdc into main Feb 22, 2026
1 check passed
@pelikhan pelikhan deleted the copilot/optimize-ci-doctor-error-handling branch February 22, 2026 13:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the CI Failure Doctor workflow to reduce token consumption from ~1.5M tokens per run by pre-downloading logs and artifacts before the agent session starts, then applying generic error-detection heuristics to create focused hint files that the agent can examine first.

Changes:

  • Added a pre-download step that fetches failed job logs and artifacts, applies grep heuristics to identify error lines, and creates a summary file for the agent to start with
  • Updated agent instructions to prioritize pre-downloaded hint files over full log retrieval
  • Reduced fallback tail_lines parameter from 200 to 100 for MCP log retrieval

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
.github/workflows/ci-doctor.md Added pre-download step with generic error heuristics, new Pre-Analysis Data section, and updated investigation protocol to use pre-downloaded files first
.github/workflows/ci-doctor.lock.yml Auto-generated lock file with updated metadata hash and compiled pre-download step
Comments suppressed due to low confidence (1)

.github/workflows/ci-doctor.md:118

  • The grep pattern exit status [^0] will match "exit status 1" but also "exit status " followed by any non-zero character (including letters, symbols, etc.), not just non-zero exit codes. The pattern should be exit status [1-9] to match only non-zero numeric exit codes, or exit (code|status) [1-9] to cover both common patterns.
        grep -n -iE "(error[: ]|ERROR|FAIL|panic:|fatal[: ]|undefined[: ]|exception|exit status [^0])" \

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +142 to +144
find "$ARTIFACT_DIR" -type f | while read -r f; do
echo " $f"
done
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The find command on line 142 in the summary section could fail or produce unexpected output if $ARTIFACT_DIR is empty or doesn't exist. While the directory is created earlier, if artifact download fails, the find might still print an error to stderr. Consider adding a check: [ -d "$ARTIFACT_DIR" ] && find "$ARTIFACT_DIR" -type f || echo " (no artifacts downloaded)" to handle this edge case gracefully.

Suggested change
find "$ARTIFACT_DIR" -type f | while read -r f; do
echo " $f"
done
if [ -d "$ARTIFACT_DIR" ]; then
find "$ARTIFACT_DIR" -type f | while read -r f; do
echo " $f"
done
else
echo " (no artifacts downloaded)"
fi

Copilot uses AI. Check for mistakes.

# Apply generic heuristics: find lines with common error indicators
HINTS_FILE="$FILTERED_DIR/job-${JOB_ID}-hints.txt"
grep -n -iE "(error[: ]|ERROR|FAIL|panic:|fatal[: ]|undefined[: ]|exception|exit status [^0])" \
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grep pattern exit status [^0] will match "exit status 1" but also "exit status " followed by any non-zero character (including letters, symbols, etc.), not just non-zero exit codes. The pattern should be exit status [1-9] to match only non-zero numeric exit codes, or exit (code|status) [1-9] to cover both common patterns.

This issue also appears on line 118 of the same file.

Suggested change
grep -n -iE "(error[: ]|ERROR|FAIL|panic:|fatal[: ]|undefined[: ]|exception|exit status [^0])" \
grep -n -iE "(error[: ]|ERROR|FAIL|panic:|fatal[: ]|undefined[: ]|exception|exit status [1-9])" \

Copilot uses AI. Check for mistakes.
-name "*.txt" -o -name "*.log" -o -name "*.json" \
-o -name "*.xml" -o -name "*.out" -o -name "*.err" \
\) | while read -r ARTIFACT_FILE; do
REL_PATH="${ARTIFACT_FILE#$ARTIFACT_DIR/}"
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable substitution ${ARTIFACT_FILE#$ARTIFACT_DIR/} should quote the variable $ARTIFACT_DIR to prevent word splitting in case the path contains spaces. Change to ${ARTIFACT_FILE#"$ARTIFACT_DIR"/} for safer string manipulation.

Suggested change
REL_PATH="${ARTIFACT_FILE#$ARTIFACT_DIR/}"
REL_PATH="${ARTIFACT_FILE#"$ARTIFACT_DIR"/}"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants