Skip to content

Conversation

@sfreudenthaler
Copy link
Member

@sfreudenthaler sfreudenthaler commented Jul 2, 2025

Summary

  • Fixes the architectural flaw in claude-orchestrator.yml that caused double triggering
  • Creates claude-simple.yml as a reliable replacement workflow
  • Updates all documentation and examples with the corrected architecture
  • Provides clear migration path for existing consumers

Problem

The original claude-orchestrator.yml had a fundamental design flaw: when called via workflow_call, GitHub Actions changes github.event_name from the original trigger (like "issue_comment") to "workflow_call". This caused:

  • All conditional logic to fail
  • Double triggering when @claude is mentioned
  • Unreliable workflow execution

Solution

  • New Architecture: Consumer repositories handle webhook triggers directly and call centralized workflows
  • claude-simple.yml: Lightweight, reliable wrapper around claude-executor.yml
  • Deprecate claude-orchestrator.yml: Mark as deprecated due to architectural issues
  • Updated Examples: Comprehensive examples showing the correct pattern

Changes

  • ✅ Created .github/workflows/claude-simple.yml - new recommended workflow
  • ✅ Updated examples/consumer-repo-workflow.yml with correct architecture
  • ✅ Added examples/infrastructure-consumer-workflow.yml for infrastructure use cases
  • ✅ Added ARCHITECTURE.md with detailed explanation and diagram
  • ✅ Updated CLAUDE.md with migration guide and deprecation notice
  • ✅ Provided multiple example files showing different approaches

Migration Required

Consuming repositories (like infrastructure-as-code) need to:

  1. Replace single orchestrator call with multiple conditional jobs
  2. Use claude-simple.yml instead of claude-orchestrator.yml
  3. Add proper concurrency control
  4. Test @claude mentions work without double triggering

Test plan

  • Verify claude-simple.yml workflow syntax
  • Validate all example workflows
  • Test documentation accuracy
  • Deploy to infrastructure-as-code repo and test @claude mentions
  • Verify no double triggering occurs
  • Confirm automatic PR reviews still work

Closes dotCMS/dotcat#213

🤖 Generated with Claude Code

sfreudenthaler and others added 6 commits July 2, 2025 17:36
…xamples

- Create claude-simple.yml as replacement for flawed orchestrator pattern
- Update consumer-repo-workflow.yml with correct architecture
- Add comprehensive examples for different use cases
- Update CLAUDE.md documentation with migration guide
- Add ARCHITECTURE.md with detailed explanation and diagram
- Deprecate claude-orchestrator.yml due to workflow_call event context loss

Fixes double triggering issue where workflow_call loses original event context.
Consumer repositories now handle triggers directly and call centralized workflows.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove corrected-consumer-workflow.yml (redundant with consumer-repo-workflow.yml)
- Remove fixed-consumer-repo-workflow.yml (redundant with consumer-repo-workflow.yml)
- Keep only essential examples: consumer-repo-workflow.yml and infrastructure-consumer-workflow.yml

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Mark claude-orchestrator.yml as DEPRECATED with clear warnings
- Explain the architectural flaw and double triggering issue
- Direct users to claude-simple.yml and migration documentation
- Keep file for backward compatibility during migration period

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ead of creating new file

- Replace claude-orchestrator.yml contents with working implementation (lightweight wrapper)
- Remove claude-simple.yml (was unnecessary - better to fix existing file)
- Update all examples and documentation to use fixed claude-orchestrator.yml
- Maintain backward compatibility while fixing the double triggering issue
- Consumer repositories can now use the existing orchestrator name with working implementation

This approach is cleaner than creating a new "simple" workflow since there's only one consumer.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix duplicate self-reference in migration section (claude-orchestrator.yml instead of claude-orchestrator.yml)
- Remove duplicate deprecated orchestrator entry in ARCHITECTURE.md
- Remove reference to deleted corrected-consumer-workflow.yml example
- Add missing newlines to end of files
- Clean up formatting issues

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sfreudenthaler sfreudenthaler marked this pull request as ready for review July 2, 2025 23:11
@sfreudenthaler sfreudenthaler requested review from a team as code owners July 2, 2025 23:11
@sfreudenthaler sfreudenthaler merged commit 5940497 into main Jul 2, 2025
2 checks passed
@sfreudenthaler sfreudenthaler deleted the fix/claude-workflows-architecture branch July 2, 2025 23:13
dcolina added a commit that referenced this pull request Dec 16, 2025
BREAKING CHANGES:
- verify_image_existence now defaults to true (was false in v1.x)

✨ Added:
- Robust version comparison with rebuild number and hash support
- Improved registry validation with private registry fallback
- Enhanced error reporting with accumulated failures
- Comprehensive CHANGELOG documenting all fixes

🔧 Changed:
- Complete replacement of temporary files with bash arrays
- Added strict error handling (set -euo pipefail) to all scripts
- Better error messages for all validation failures

🐛 Fixed:
- Bug #1: Rebuild downgrade detection (25.12.08-2 → 25.12.08 now blocked)
- Bug #2: Temporary file race conditions and persistence issues
- Bug #3: Image existence validation for private registries
- Bug #4: Silent failures in validation loops
- Bug #5: Version pattern validation edge cases

🔒 Security:
- Strict error handling prevents silent failures
- Eliminated temporary file security concerns
- Better input validation before processing

📝 Details:
This is a complete architectural refactor addressing all known bugs in
the deployment guard workflow. The main improvements are:

1. Version Comparison: Now properly handles:
   - Base version (YY.MM.DD)
   - Rebuild numbers (-N suffix)
   - Commit hashes (_hash suffix)
   - All combinations of the above

2. State Management: Replaced all temporary files with bash arrays:
   - Before: /tmp/validation_failed.txt, /tmp/new_images.txt
   - After: VALIDATION_FAILED, FAILED_IMAGES, NEW_IMAGES arrays
   - Eliminates race conditions and cleanup issues

3. Registry Validation: Now supports both public and private registries
   - Tries Docker Hub first
   - Falls back to full image path for private registries

See CHANGELOG.md for complete migration guide and bug details.
dcolina added a commit that referenced this pull request Dec 16, 2025
#19)

## 🎯 Overview

Complete architectural refactor of the Deployment Guard workflow to
v2.0.0, eliminating all fragile temporary file-based state management
and fixing critical validation bugs.

## 🚨 Breaking Changes

- **`verify_image_existence` now defaults to `true`** (was `false` in
v1.x)
  - This was always the intended behavior but was disabled due to bugs
- If you want to disable, explicitly set to `false` in your workflow
configuration

## ✨ What's New

### 1. Robust Version Comparison
Complete rewrite of anti-downgrade logic with proper handling of:
- Base version comparison (YY.MM.DD format)
- Rebuild number comparison (e.g., `-2` in `25.12.08-2`)
- Hash comparison (e.g., `_abc123` in `25.12.08_abc123`)
- Full support for all combinations: `25.12.08`, `25.12.08-2`,
`25.12.08_abc`, `25.12.08-2_abc`

### 2. Improved Registry Validation
- Tries Docker Hub first for canonical images
- Falls back to full image path for private registries
- Better error messages indicating which registry was checked
- Handles mirror registries gracefully

### 3. Enhanced Error Reporting
- State variables accumulate ALL validation failures before exiting
- Detailed failure reasons for each failed image/file
- Clear indication of which validation step failed and why

## 🐛 Bugs Fixed

### Bug #1: Rebuild Downgrade Not Detected (Critical)
**Issue**: v1.x allowed downgrade from `25.12.08-2` to `25.12.08`
**Root Cause**: Only compared base version (YY.MM.DD), ignored rebuild
numbers
**Fix**: Now extracts and compares rebuild numbers when base version is
the same

**Examples**:
- ❌ v1.x: `25.12.08-2` → `25.12.08` = ✅ Allowed (BUG)
- ✅ v2.0.0: `25.12.08-2` → `25.12.08` = ❌ Blocked (CORRECT)
- ✅ v2.0.0: `25.12.08` → `25.12.08-2` = ✅ Allowed (upgrade)
- ✅ v2.0.0: `25.12.08-2_abc` → `25.12.08-2_xyz` = ✅ Allowed (same
version, different hash)

### Bug #2: Temporary File Race Conditions
**Issue**: Race conditions with `/tmp/validation_failed.txt` file
**Root Cause**: Multiple writes to same file, manual cleanup required
**Fix**: Eliminated ALL temporary files, using in-memory bash arrays

### Bug #3: Image Existence Check Failures
**Issue**: Validation failed for valid private registry images
**Root Cause**: Only checked Docker Hub canonical image
**Fix**: Now tries Docker Hub first, then falls back to full image path

### Bug #4: Silent Failures in Validation Loops
**Issue**: Validation could continue after failures
**Root Cause**: Lack of strict error handling
**Fix**: Added `set -euo pipefail` to all bash scripts

### Bug #5: Version Pattern Validation Edge Cases
**Issue**: Malformed tags could pass validation
**Root Cause**: Regex didn't enforce proper boundaries
**Fix**: Improved regex validation with proper format checks

## 🔧 Technical Changes

### State Management Architecture
**Before (v1.x)**: Used temporary files
```bash
echo "false" > /tmp/validation_failed.txt
echo "$image" >> /tmp/new_images.txt
[ -f /tmp/validation_failed.txt ] && exit 1
```

**After (v2.0.0)**: Uses bash arrays
```bash
VALIDATION_FAILED=false
FAILED_IMAGES=()
FAILED_IMAGES+=("$image: reason")
if [ "$VALIDATION_FAILED" = "true" ]; then
  printf '   - %s\n' "${FAILED_IMAGES[@]}"
  exit 1
fi
```

### Error Handling
All bash scripts now use strict mode:
```bash
set -euo pipefail
```

## 📝 Documentation

Added comprehensive CHANGELOG.md with:
- Complete bug details and technical explanations
- Migration guide from v1.x to v2.0.0
- Version support matrix
- Testing recommendations

## 🧪 Testing

Recommended testing approach:
```yaml
uses: dotCMS/ai-workflows/.github/workflows/deployment-guard.yml@v2.0.0
with:
  testing_force_non_bypass: true  # Force validation even for org members
  verify_image_existence: true    # Now enabled by default
```

## 📊 Test Cases Covered

| Scenario | v1.x | v2.0.0 |
|----------|------|--------|
| `25.12.08-2` → `25.12.08` | ✅ (bug) | ❌ (correct) |
| `25.12.08` → `25.12.08-2` | ✅ | ✅ |
| `25.12.08-2` → `25.12.08-3` | ✅ | ✅ |
| `25.12.08_abc` → `25.12.08_xyz` | ✅ | ✅ |
| `25.12.07` → `25.12.08` | ✅ | ✅ |
| `25.12.08` → `25.12.07` | ❌ | ❌ |
| Private registry image | ❌ (bug) | ✅ (fixed) |

## 🔄 Migration Path

1. **Week 1**: Deploy v2.0.0 to staging/dev
2. **Week 2**: Monitor and validate test cases
3. **Week 3**: Deploy to production
4. **Week 4**: Deprecate v1.x

## 📚 Related Issues

Fixes bugs reported in Deutsche Bank infrastructure validation.

## ✅ Checklist

- [x] All temporary files eliminated
- [x] Rebuild downgrade detection fixed
- [x] Private registry support added
- [x] Strict error handling implemented
- [x] Comprehensive CHANGELOG created
- [x] All validation cases tested
- [x] Breaking changes documented

---
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant