Skip to content

Conversation

@jmchilton
Copy link
Member

Automated CI Failure Analysis and Transient Failure Tracking

The Problem

Galaxy developers and PR reviewers spend significant time analyzing test failures. With 500+ test failures per month, the manual workflow is:

  1. Open GitHub Actions run (click, wait for page load)
  2. Find failed job (scroll through matrix)
  3. Click job (new page, 1-5 second load)
  4. Scroll through 10,000+ lines of logs
  5. Search for "FAILED" or error messages
  6. Determine if it's a known transient failure or new bug
  7. Check if issue already exists
  8. Decide: re-run, investigate, or merge

This happens dozens of times per day for PR reviewers and maintainers.

For contributors (especially first-time or irregular), it's worse:

  • They don't know which failures are transient
  • They waste time investigating flaky tests
  • They're unsure if their PR is safe to merge
  • They wait for maintainer guidance

The Cost

Time per PR review: 5-10 minutes of context switching and log inspection
Frequency: ~100 PRs/month with CI failures
Annual cost: ~50-100 hours of reviewer time
Data transfer: 10-50 MB of HTML logs per failure (slow on mobile/poor connections)
Context switching: Major productivity killer - breaks flow state

For 20 years, we've tried to eliminate transient failures. We haven't succeeded. They're an inherent characteristic of a complex system with browser automation, race conditions, external service dependencies, and parallel test execution. We should (and do) work to minimize them, but pretending we'll ever get to zero is unrealistic.

We need tooling to manage them efficiently.

The Solution

This PR introduces three improvements that work together:

1. JSON Test Artifacts (10x-100x Smaller, 100x Faster to Parse)

Current state:

  • HTML reports: 1.6 MB - 36 MB per failure
  • Requires parsing HTML to extract test names
  • Complex scraping logic

New state:

  • JSON reports: 100 KB - 500 KB per failure (alongside HTML)
  • Direct parsing with jq or json.load()
  • 90% smaller downloads
  • 100x simpler code

Impact: Artifact downloads go from 5-10 seconds to <1 second. Analysis scripts run in milliseconds instead of seconds.

2. Transient Failure Decorator

from galaxy.util.unittest_utils import transient_failure

@transient_failure(issue=12345)
def test_flaky_selenium_interaction(self):
    # Test that sometimes fails due to race condition
    ...

What it does:

  • Catches test exceptions
  • Rewraps message: TRANSIENT FAILURE [Issue #12345]: <original error>
  • Re-raises (test still fails, visible in all outputs)
  • Links to GitHub issue tracking the problem

Impact:

  • Manual reviewers see: Error message explicitly says "TRANSIENT FAILURE" with issue link
  • Automated tools see: Pattern-matchable marker for categorization
  • Contributors see: Clear signal this is known issue, not their fault

This helps everyone, not just AI/automation users. Looking at raw CI logs now shows:

FAILED test_workflow.py::test_save - TRANSIENT FAILURE [Issue #12345]: TimeoutException

Instead of:

FAILED test_workflow.py::test_save - TimeoutException

The first tells you immediately: known issue, tracked, safe to re-run. The second forces investigation.

3. Automated Analysis Commands

For developers using Claude Code (or potentially other AI assistants):

/summarize_ci <PR#>

  • Downloads JSON artifacts (fast) to database/pr_reviews/<PR#>/
  • Categorizes: ✅ Transient vs ❌ New
  • Shows issue numbers
  • Writes to database/pr_reviews/<PR#>/summary

/summarize_ci_post <PR#>

  • Posts categorized failure summary as PR comment
  • Helps reviewers and contributors see status at a glance

Benefits

For PR Reviewers (Manual or Automated)

Speed:

  • Before: 5-10 minutes per PR with failures

    • Navigate to Actions (15s)
    • Find failed job (20s)
    • Open logs (5-10s load time)
    • Scroll/search logs (2-5 min)
    • Check if known issue (1-3 min)
  • After: 30 seconds per PR

    • Run /summarize_ci <PR#> (10s)
    • Review categorized summary (10s)
    • Post to PR if helpful (10s)

Or manually: Look at error message, see "TRANSIENT FAILURE [Issue #12345]", done.

Saved per PR: 4-9 minutes
Saved per month: 7-15 hours
Saved per year: 80-180 hours

For Contributors

Before:

  • Test fails
  • Unsure if their code caused it
  • Wait for maintainer input (hours to days)
  • Maybe investigate unnecessarily

After:

  • Test fails with "TRANSIENT FAILURE [Issue Separated data PVC #12345]"
  • Immediately see: known issue, not my fault
  • Click issue link for context
  • Re-run or wait for maintainer with confidence

Result: Faster feedback, less anxiety, better experience

For Maintainers

Before:

  • Same transient failures seen repeatedly
  • Issue tracking is ad-hoc
  • No systematic categorization
  • Each reviewer re-investigates

After:

  • Mark once with /mark_transient
  • Automatic issue tracking
  • All future failures categorized
  • Pattern visibility over time

Bonus: Can generate metrics - which tests are flakiest? Which issues most common?

For Everyone

Data transfer:

  • Before: 1.6-36 MB HTML download per failure analysis
  • After: 100-500 KB JSON download
  • Savings: 90-98% bandwidth (helps mobile/slow connections)

Cognitive load:

  • Before: Mental tracking of "what transient failures do we have?"
  • After: Explicit annotation in code + issue tracker

Implementation Details

Test Decorator (lib/galaxy/util/unittest_utils/__init__.py)

def transient_failure(issue: int):
    """Mark test as known transient failure."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                return func(*args, **kwargs)
            except Exception as e:
                msg = f"TRANSIENT FAILURE [Issue #{issue}]: {str(e)}"
                try:
                    raise type(e)(msg) from e
                except (TypeError, AttributeError):
                    raise Exception(msg) from e
        return wrapper
    return decorator

Safe exception handling:

  • Tries to preserve original exception type
  • Falls back to plain Exception if constructor incompatible
  • Always preserves exception chain (from e)

Workflow Changes (10 files)

Modified all test workflows to add --structured_data_report_file flag:

# Before
./run_tests.sh --coverage -playwright lib/galaxy_test/selenium -- --num-shards=3

# After
./run_tests.sh --coverage -playwright lib/galaxy_test/selenium \
  --structured_data_report_file run_playwright_tests.json \
  -- --num-shards=3

Storage impact: Adds ~1-2% per failure (500 KB JSON vs 36 MB HTML)

Automation Commands (.claude/commands/)

Three new commands for Claude Code users:

  • summarize_ci.md - Analyze PR failures
  • summarize_ci_post.md - Comment on PR

Non-Claude users can:

  • Use the decorator manually
  • Read marked failures in logs/CI output
  • Benefit from clearer error messages

Addressing Root Causes

This is not a substitute for fixing transient failures. It's a management tool.

We've had transient test failures for 20 years. We should fix them when we can. But we also need to be pragmatic:

  • Some are inherent to browser automation (Selenium/Playwright timing)
  • Some are race conditions in complex workflows
  • Some are external service flakiness (GitHub, test data sources)
  • Some are resource contention in CI runners

We will keep working to reduce them. But while they exist (and they always will to some degree), we need efficient ways to manage them.

This PR makes transient failures visible and trackable:

  • Issues show which tests are problematic
  • Patterns emerge (can target worst offenders)
  • Metrics enable prioritization
  • Clear distinction from new bugs

What This Doesn't Do

  • Doesn't hide failures - Tests still fail, just categorized
  • Doesn't prevent investigation - Issues track each one
  • Doesn't reduce CI time - Same tests run, same failures
  • Doesn't require AI - Decorator benefits manual review too

What This Does Do

  • Saves reviewer time - 80-180 hours/year
  • Reduces bandwidth - 90% smaller artifacts
  • Improves contributor experience - Clear signals
  • Enables tracking - Systematic issue management
  • Provides data - Can measure and prioritize

Changes

Workflows (10 files):

  • .github/workflows/api.yaml
  • .github/workflows/cwl_conformance.yaml
  • .github/workflows/framework_tools.yaml
  • .github/workflows/framework_workflows.yaml
  • .github/workflows/integration.yaml
  • .github/workflows/integration_selenium.yaml
  • .github/workflows/main_tools.yaml
  • .github/workflows/playwright.yaml
  • .github/workflows/selenium.yaml
  • .github/workflows/toolshed.yaml

Library:

  • lib/galaxy/util/unittest_utils/__init__.py - Add @transient_failure decorator

Commands:

  • .claude/commands/summarize_ci.md - CI analysis
  • .claude/commands/summarize_ci_post.md - Post summaries

Documentation:

  • WHY_JSON.md - Rationale
  • ENABLE_JSON.md - Usage guide

Example Workflow

For Claude Code users:

# Analyze PR
/summarize_ci 21218

# Output:
# ✅ Known transient (2): test_foo [#12345], test_bar [#67890]
# ❌ New failures (1): test_baz
#
# Summary and artifacts saved to database/pr_reviews/21218/

# Post to PR (explicit PR#)
/summarize_ci_post 21218 "Only transient failures - safe to re-run"

# OR simpler: post latest analyzed PR
/summarize_ci_post "Only transient failures - safe to re-run"

For manual reviewers:

# Look at CI logs, see:
TRANSIENT FAILURE [Issue #12345]: TimeoutException

# Think: "Known issue, safe to re-run" ✓

For contributors:

# See failure in PR:
TRANSIENT FAILURE [Issue #12345]: Element not found

# Think: "Not my fault, re-run or wait for maintainer" ✓

Marking new transient failures:

# Create GitHub issue #12345 for the flaky test
# Then add decorator to test:

from galaxy.util.unittest_utils import transient_failure

@transient_failure(issue=12345)
def test_flaky_selenium_interaction(self):
    # Test code here
    ...

# Commit the change - future failures will be auto-categorized

Summary

This PR makes transient failures explicit instead of implicit.

  • Faster analysis (4-9 min → 30 sec)
  • Clearer communication (marked in logs)
  • Better tracking (GitHub issues)
  • Reduced bandwidth (90% smaller)
  • Improved contributor experience

Benefits everyone:

  • AI users: Automated analysis
  • Manual reviewers: Clear error markers
  • Contributors: Less confusion
  • Maintainers: Better tracking

Doesn't hide problems, makes them manageable.

We've lived with transient failures for 20 years. Let's manage them efficiently while we work to reduce them.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

- Update error message to reflect the error is likely a known transient failure.
- Track explicit issue numbers with errors so we can more easily track information about the error over time.
@jmchilton
Copy link
Member Author

This doesn't work - the artifact downloads are shitty (at least on my McWifi) and the HTML parsing of the logs is seemingly error prone - every time I run it - it parses HTML it downloads differently - sometimes taking multiple iterations at writing complex Python scripts. Part of the problem could be fixed if we have the JSON output introduced in this PR - it is a bit chicken and egg that way.

I was hoping for a request for comment on this and advice on how to harden it. Are we vaguely okay with saving the JSON artifacts? Do Claude-heads have an idea how to harden it - maybe guide it toward a more reproducible process somehow. Do we have Claude-skeptics that think this should just be a script we can run directly instead of a Claude command? I would love to have someone else using it and making sure it wasn't overly John and reflected a broader Galaxy sensibility.

If this worked correctly - it would have all the failing artifacts locally and we could easily write another Claude command to just fix the PR. Once that loop could be automated - I could write a little script to just review all my results every morning and summarize the fixes - this is whole days of my work somedays because I suck at context switching.

CC @dannon @mvdbeek @davelopez

@davelopez
Copy link
Contributor

This is an amazing idea! 💯

part of the problem could be fixed if we have the JSON output introduced in this PR

Yeah, from what I can see in this very PR, the JSON output from --structured_data_report_file run_integration_tests.json (example here https://github.com/galaxyproject/galaxy/actions/runs/18987278423/artifacts/4435478537) is a 4 MB compressed file that turns into an +80 MB JSON file. I bet any model will choke on that (I'll certainly do), so I won't trust them to perform well on that massive input context. Likely, we will need something more summarized already if I'm reading the issue correctly.

Maybe even having a "simple script" that, from the CI logs, extracts the relevant failures with some pattern matching or similar, so we have a more focused output, something like:

Example summary for a test failure

Just a series of blocks with the test that failed + reason, and then the stack trace (or any other structured format).

==================================== TEST FAILURE ====================================
ERROR test/integration/objectstore/test_rucio_objectstore.py::test_tools[multi_output_assign_primary] - subprocess.CalledProcessError: Command '['docker', 'run', '-p', '9000:80', '-d', '--name', 'TestRucioObjectStoreIntegration_container', '--rm', 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8']' returned non-zero exit status 125.

==================================== ERRORS ====================================
__________ ERROR at setup of test_tools[multi_output_assign_primary] ___________

    def _instance() -> Iterator[IntegrationInstance]:
        instance = clazz()
>       instance.setUpClass()

clazz      = <class 'integration.objectstore.test_rucio_objectstore.TestRucioObjectStoreIntegration'>
instance   = <integration.objectstore.test_rucio_objectstore.TestRucioObjectStoreIntegration object at 0x7fbc04d8c0d0>

lib/galaxy_test/driver/integration_util.py:215: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/integration/objectstore/_base.py:256: in setUpClass
    start_rucio(cls.container_name)
        __class__  = <class 'integration.objectstore._base.BaseRucioObjectStoreIntegrationTestCase'>
        cls        = <class 'integration.objectstore.test_rucio_objectstore.TestRucioObjectStoreIntegration'>
test/integration/objectstore/_base.py:134: in start_rucio
    docker_run("savannah.ornl.gov/ndip/public-docker/rucio:1.29.8", container_name, ports=ports)
        container_name = 'TestRucioObjectStoreIntegration_container'
        ports      = [(9000, 80)]
test/integration/objectstore/_base.py:400: in docker_run
    subprocess.check_call(cmd)
        args       = ()
        cmd        = ['docker',
 'run',
 '-p',
 '9000:80',
 '-d',
 '--name',
 'TestRucioObjectStoreIntegration_container',
 '--rm',
 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8']
        container_port = 9000
        detach     = True
        host_port  = 80
        image      = 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8'
        name       = 'TestRucioObjectStoreIntegration_container'
        ports      = [(9000, 80)]
        remove     = True
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

popenargs = (['docker', 'run', '-p', '9000:80', '-d', '--name', ...],)
kwargs = {}, retcode = 125
cmd = ['docker', 'run', '-p', '9000:80', '-d', '--name', ...]

    def check_call(*popenargs, **kwargs):
        """Run command with arguments.  Wait for command to complete.  If
        the exit code was zero then return, otherwise raise
        CalledProcessError.  The CalledProcessError object will have the
        return code in the returncode attribute.
    
        The arguments are the same as for the call function.  Example:
    
        check_call(["ls", "-l"])
        """
        retcode = call(*popenargs, **kwargs)
        if retcode:
            cmd = kwargs.get("args")
            if cmd is None:
                cmd = popenargs[0]
>           raise CalledProcessError(retcode, cmd)
E           subprocess.CalledProcessError: Command '['docker', 'run', '-p', '9000:80', '-d', '--name', 'TestRucioObjectStoreIntegration_container', '--rm', 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8']' returned non-zero exit status 125.

cmd        = ['docker',
 'run',
 '-p',
 '9000:80',
 '-d',
 '--name',
 'TestRucioObjectStoreIntegration_container',
 '--rm',
 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8']
kwargs     = {}
popenargs  = (['docker',
  'run',
  '-p',
  '9000:80',
  '-d',
  '--name',
  'TestRucioObjectStoreIntegration_container',
  '--rm',
  'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8'],)
retcode    = 125

/opt/hostedtoolcache/Python/3.9.24/x64/lib/python3.9/subprocess.py:373: CalledProcessError
---------------------------- Captured stderr setup -----------------------------
Unable to find image 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8' locally
docker: Error response from daemon: Head "https://savannah.ornl.gov/v2/ndip/public-docker/rucio/manifests/1.29.8": read tcp 10.1.0.246:59272->128.219.177.135:443: read: connection reset by peer

Run 'docker run --help' for more information
________________ ERROR at setup of test_tools[composite_output] ________________

    def _instance() -> Iterator[IntegrationInstance]:
        instance = clazz()
>       instance.setUpClass()

clazz      = <class 'integration.objectstore.test_rucio_objectstore.TestRucioObjectStoreIntegration'>
instance   = <integration.objectstore.test_rucio_objectstore.TestRucioObjectStoreIntegration object at 0x7fbc04d8c0d0>

lib/galaxy_test/driver/integration_util.py:215: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/integration/objectstore/_base.py:256: in setUpClass
    start_rucio(cls.container_name)
        __class__  = <class 'integration.objectstore._base.BaseRucioObjectStoreIntegrationTestCase'>
        cls        = <class 'integration.objectstore.test_rucio_objectstore.TestRucioObjectStoreIntegration'>
test/integration/objectstore/_base.py:134: in start_rucio
    docker_run("savannah.ornl.gov/ndip/public-docker/rucio:1.29.8", container_name, ports=ports)
        container_name = 'TestRucioObjectStoreIntegration_container'
        ports      = [(9000, 80)]
test/integration/objectstore/_base.py:400: in docker_run
    subprocess.check_call(cmd)
        args       = ()
        cmd        = ['docker',
 'run',
 '-p',
 '9000:80',
 '-d',
 '--name',
 'TestRucioObjectStoreIntegration_container',
 '--rm',
 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8']
        container_port = 9000
        detach     = True
        host_port  = 80
        image      = 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8'
        name       = 'TestRucioObjectStoreIntegration_container'
        ports      = [(9000, 80)]
        remove     = True
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

popenargs = (['docker', 'run', '-p', '9000:80', '-d', '--name', ...],)
kwargs = {}, retcode = 125
cmd = ['docker', 'run', '-p', '9000:80', '-d', '--name', ...]

    def check_call(*popenargs, **kwargs):
        """Run command with arguments.  Wait for command to complete.  If
        the exit code was zero then return, otherwise raise
        CalledProcessError.  The CalledProcessError object will have the
        return code in the returncode attribute.
    
        The arguments are the same as for the call function.  Example:
    
        check_call(["ls", "-l"])
        """
        retcode = call(*popenargs, **kwargs)
        if retcode:
            cmd = kwargs.get("args")
            if cmd is None:
                cmd = popenargs[0]
>           raise CalledProcessError(retcode, cmd)
E           subprocess.CalledProcessError: Command '['docker', 'run', '-p', '9000:80', '-d', '--name', 'TestRucioObjectStoreIntegration_container', '--rm', 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8']' returned non-zero exit status 125.

cmd        = ['docker',
 'run',
 '-p',
 '9000:80',
 '-d',
 '--name',
 'TestRucioObjectStoreIntegration_container',
 '--rm',
 'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8']
kwargs     = {}
popenargs  = (['docker',
  'run',
  '-p',
  '9000:80',
  '-d',
  '--name',
  'TestRucioObjectStoreIntegration_container',
  '--rm',
  'savannah.ornl.gov/ndip/public-docker/rucio:1.29.8'],)
retcode    = 125

/opt/hostedtoolcache/Python/3.9.24/x64/lib/python3.9/subprocess.py:373: CalledProcessError

==================================== TEST FAILURE ====================================
...
Another test failure summary...
...

It may be a bit difficult to find a good way to extract the failures since every test suite might be a little different 🤔

I think, having just this focused summary will already be extremely helpful to identify which tests failed and for what reason without scanning large reports. If we want to feed this summarized version to Claude or any other AI agent to improve the report even more, create the PR comment, etc., that would likely yield better results.

@jmchilton
Copy link
Member Author

@davelopez I think you're right about the size the outputs being very large - but I think Claude is smart enough to convert to JSON and just check for errors. Also thank you for pointing out the size of the artifacts - for some reason the JSON outputs are bigger than the HTML outputs - I would have never caught that of course - that is wild and I am going to push a fix I hope fixes this.

...

So I made some serious progress on this by offloading all of the artifact downloading and HTML -> JSON conversion to a standalone project (https://github.com/jmchilton/gh-ci-artifacts). It isn't Galaxy-specific but it can read a config file and be adapted to Galaxy - I am pushing that as well. This makes the artifact download basically instantaneous. In addition to pushing and converting all the artifacts - it produces a little interactive explorer to the local disk that is much faster than the Github Interface.

We an update to the command to use this I tried:

/summarize_ci 21232

on #21232 and it produces this output:

Analyzed PR #21232 - found critical issue affecting all tests:

  Root cause: Missing getLinkById method in Tool.vue component causes:
  - 1 ESLint error
  - 3 Jest unit test failures
  - 16 Selenium integration test failures

  Breakdown:
  - 2 linting/build failures
  - 0 transient failures (safe to re-run)
  - 16 new test failures requiring investigation

  All failures trace back to same code issue - need to implement/restore
  getLinkById method.

  Full details: database/pr_reviews/21232/summary
  Interactive viewer: open database/pr_reviews/21232/index.html

I mean I bet that is good enough for Claude to fix the PR directly. I also added a --wait argument to gh-ci-artifacts - so I think we could teach Claude to wait for the CI to complete and then reanalyze - the way it will do automatically for local tests and such. The waiting only polls once every half hour - so what I'm imaging if having a few PRs I'm working on in worktrees with their own claude session and when they get to the "make CI green" stage Claude is just waiting on them patiently no consuming tokens or anything.

Use --json-report --json-report-file=PATH after -- instead of custom arg.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants