- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.1k
 
[WIP] Automated CI Failure Analysis and Transient Failure Tracking #21228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
- Update error message to reflect the error is likely a known transient failure. - Track explicit issue numbers with errors so we can more easily track information about the error over time.
| 
           This doesn't work - the artifact downloads are shitty (at least on my McWifi) and the HTML parsing of the logs is seemingly error prone - every time I run it - it parses HTML it downloads differently - sometimes taking multiple iterations at writing complex Python scripts. Part of the problem could be fixed if we have the JSON output introduced in this PR - it is a bit chicken and egg that way. I was hoping for a request for comment on this and advice on how to harden it. Are we vaguely okay with saving the JSON artifacts? Do Claude-heads have an idea how to harden it - maybe guide it toward a more reproducible process somehow. Do we have Claude-skeptics that think this should just be a script we can run directly instead of a Claude command? I would love to have someone else using it and making sure it wasn't overly John and reflected a broader Galaxy sensibility. If this worked correctly - it would have all the failing artifacts locally and we could easily write another Claude command to just fix the PR. Once that loop could be automated - I could write a little script to just review all my results every morning and summarize the fixes - this is whole days of my work somedays because I suck at context switching.  | 
    
| 
           This is an amazing idea! 💯 
 Yeah, from what I can see in this very PR, the JSON output from  Maybe even having a "simple script" that, from the CI logs, extracts the relevant failures with some pattern matching or similar, so we have a more focused output, something like: Example summary for a test failureJust a series of blocks with the test that failed + reason, and then the stack trace (or any other structured format). It may be a bit difficult to find a good way to extract the failures since every test suite might be a little different 🤔 I think, having just this focused summary will already be extremely helpful to identify which tests failed and for what reason without scanning large reports. If we want to feed this summarized version to Claude or any other AI agent to improve the report even more, create the PR comment, etc., that would likely yield better results.  | 
    
| 
           @davelopez I think you're right about the size the outputs being very large - but I think Claude is smart enough to convert to JSON and just check for errors. Also thank you for pointing out the size of the artifacts - for some reason the JSON outputs are bigger than the HTML outputs - I would have never caught that of course - that is wild and I am going to push a fix I hope fixes this. ... So I made some serious progress on this by offloading all of the artifact downloading and HTML -> JSON conversion to a standalone project (https://github.com/jmchilton/gh-ci-artifacts). It isn't Galaxy-specific but it can read a config file and be adapted to Galaxy - I am pushing that as well. This makes the artifact download basically instantaneous. In addition to pushing and converting all the artifacts - it produces a little interactive explorer to the local disk that is much faster than the Github Interface. We an update to the command to use this I tried: on #21232 and it produces this output: I mean I bet that is good enough for Claude to fix the PR directly. I also added a   | 
    
Use --json-report --json-report-file=PATH after -- instead of custom arg. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
f6ca57a    to
    265b44f      
    Compare
  
    
Automated CI Failure Analysis and Transient Failure Tracking
The Problem
Galaxy developers and PR reviewers spend significant time analyzing test failures. With 500+ test failures per month, the manual workflow is:
This happens dozens of times per day for PR reviewers and maintainers.
For contributors (especially first-time or irregular), it's worse:
The Cost
Time per PR review: 5-10 minutes of context switching and log inspection
Frequency: ~100 PRs/month with CI failures
Annual cost: ~50-100 hours of reviewer time
Data transfer: 10-50 MB of HTML logs per failure (slow on mobile/poor connections)
Context switching: Major productivity killer - breaks flow state
For 20 years, we've tried to eliminate transient failures. We haven't succeeded. They're an inherent characteristic of a complex system with browser automation, race conditions, external service dependencies, and parallel test execution. We should (and do) work to minimize them, but pretending we'll ever get to zero is unrealistic.
We need tooling to manage them efficiently.
The Solution
This PR introduces three improvements that work together:
1. JSON Test Artifacts (10x-100x Smaller, 100x Faster to Parse)
Current state:
New state:
jqorjson.load()Impact: Artifact downloads go from 5-10 seconds to <1 second. Analysis scripts run in milliseconds instead of seconds.
2. Transient Failure Decorator
What it does:
TRANSIENT FAILURE [Issue #12345]: <original error>Impact:
This helps everyone, not just AI/automation users. Looking at raw CI logs now shows:
Instead of:
The first tells you immediately: known issue, tracked, safe to re-run. The second forces investigation.
3. Automated Analysis Commands
For developers using Claude Code (or potentially other AI assistants):
/summarize_ci <PR#>database/pr_reviews/<PR#>/database/pr_reviews/<PR#>/summary/summarize_ci_post <PR#>Benefits
For PR Reviewers (Manual or Automated)
Speed:
Before: 5-10 minutes per PR with failures
After: 30 seconds per PR
/summarize_ci <PR#>(10s)Or manually: Look at error message, see "TRANSIENT FAILURE [Issue #12345]", done.
Saved per PR: 4-9 minutes
Saved per month: 7-15 hours
Saved per year: 80-180 hours
For Contributors
Before:
After:
Result: Faster feedback, less anxiety, better experience
For Maintainers
Before:
After:
/mark_transientBonus: Can generate metrics - which tests are flakiest? Which issues most common?
For Everyone
Data transfer:
Cognitive load:
Implementation Details
Test Decorator (
lib/galaxy/util/unittest_utils/__init__.py)Safe exception handling:
Exceptionif constructor incompatiblefrom e)Workflow Changes (10 files)
Modified all test workflows to add
--structured_data_report_fileflag:Storage impact: Adds ~1-2% per failure (500 KB JSON vs 36 MB HTML)
Automation Commands (
.claude/commands/)Three new commands for Claude Code users:
summarize_ci.md- Analyze PR failuressummarize_ci_post.md- Comment on PRNon-Claude users can:
Addressing Root Causes
This is not a substitute for fixing transient failures. It's a management tool.
We've had transient test failures for 20 years. We should fix them when we can. But we also need to be pragmatic:
We will keep working to reduce them. But while they exist (and they always will to some degree), we need efficient ways to manage them.
This PR makes transient failures visible and trackable:
What This Doesn't Do
What This Does Do
Changes
Workflows (10 files):
.github/workflows/api.yaml.github/workflows/cwl_conformance.yaml.github/workflows/framework_tools.yaml.github/workflows/framework_workflows.yaml.github/workflows/integration.yaml.github/workflows/integration_selenium.yaml.github/workflows/main_tools.yaml.github/workflows/playwright.yaml.github/workflows/selenium.yaml.github/workflows/toolshed.yamlLibrary:
lib/galaxy/util/unittest_utils/__init__.py- Add@transient_failuredecoratorCommands:
.claude/commands/summarize_ci.md- CI analysis.claude/commands/summarize_ci_post.md- Post summariesDocumentation:
WHY_JSON.md- RationaleENABLE_JSON.md- Usage guideExample Workflow
For Claude Code users:
For manual reviewers:
For contributors:
Marking new transient failures:
Summary
This PR makes transient failures explicit instead of implicit.
Benefits everyone:
Doesn't hide problems, makes them manageable.
We've lived with transient failures for 20 years. Let's manage them efficiently while we work to reduce them.
How to test the changes?
(Select all options that apply)
License