Skip to content

Conversation

@sfreudenthaler
Copy link
Member

@sfreudenthaler sfreudenthaler commented Jun 27, 2025

PR: Centralize Claude Workflows – Replace Pilot from dotcms/infrastructure-as-code

Overview

This PR introduces a centralized, reusable Claude workflow system designed to replace the pilot implementation previously used in the dotcms/infrastructure-as-code repository.

Related PR to cleanup the IaC repo side.

Why This Change?

  • Centralization & DRY:
    By consolidating all Claude-related GitHub Actions into a single, reusable orchestrator and executor pattern, we eliminate duplication and make ongoing maintenance much easier.
  • Replace the Pilot:
    The previous approach in dotcms/infrastructure-as-code was a proof-of-concept. This PR formalizes and generalizes that approach for use across all repositories.
  • Maintain Flexibility:
    While the core logic is now centralized, each repository can still customize prompts, allowed tools, runner selection, and other parameters as needed—ensuring flexibility is not sacrificed for maintainability.

What’s Changed?

  • Removed:
    The standalone claude-code-review.yml workflow, as its functionality is now fully covered by the orchestrator + executor pattern.
  • Added/Updated:
    • claude-orchestrator.yml: Routes all Claude triggers (PRs, issues, comments, reviews) to the correct execution mode.
    • claude-executor.yml: Handles the actual execution of Claude actions, with configurable parameters.
  • Documentation:
    Updated to reflect the new centralized approach and how to customize at the repo level.

Benefits

  • Single source of truth for Claude automation logic.
  • Easier updates—fixes and improvements are made in one place.
  • Repo-level customization remains fully supported via workflow inputs.
  • Cleaner, more maintainable codebase.

Migration Notes

  • Repositories previously using the pilot workflow in dotcms/infrastructure-as-code should now reference the centralized orchestrator workflow.
  • See the updated README for instructions on customizing the workflow for your repository’s needs.

In summary:
This PR replaces the pilot Claude workflow with a robust, maintainable, and flexible system that can be easily adopted across all repositories, keeping things DRY while supporting per-repo customization.

sfreudenthaler and others added 10 commits June 27, 2025 16:34
Add centralized, reusable GitHub Actions workflows for Claude AI automation:

- claude-executor.yml: Configurable execution engine with custom tool allowlists,
  timeouts, and runner specifications
- claude-orchestrator.yml: Orchestrates all Claude triggers (comments, PRs, issues)
  with configurable prompts and conditional logic

Key features:
- Parameterized inputs for repository-specific customization
- Support for interactive (@claude mentions) and automatic modes
- Configurable allowed tools for different repository types
- Flexible timeout and runner configuration
- Comprehensive trigger event handling

This enables DRY implementation of Claude workflows across all dotCMS repositories
while maintaining repository-specific configurations.

Usage example:
```yaml
jobs:
  claude:
    uses: dotCMS/claude-workflows/.github/workflows/claude-orchestrator.yml@main
    with:
      allowed_tools: |
        Bash(terraform plan)
        Bash(git status)
    secrets:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
```

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Resolves: dotCMS/dotcat#213
Add GitHub Actions workflow to validate YAML syntax and workflow structure before PRs can be merged. Includes yamllint configuration for consistent formatting.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add reusable claude-code-review.yml workflow with configurable review focus
- Require each repository to provide its own ANTHROPIC_API_KEY secret
- Implement cost tracking and security isolation through per-repo API keys
- Add comprehensive documentation with setup instructions
- Include example workflow for consuming repositories
- Enhance test validation to check secret requirements in reusable workflows

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove trailing spaces from all workflow files
- Add missing newlines at end of files
- Increase yamllint line-length limit to 240 for shell commands
- Fix formatting in claude-code-review.yml, claude-executor.yml, claude-orchestrator.yml, test.yml, and examples

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sfreudenthaler sfreudenthaler changed the title feat: add initial reusable Claude AI workflow templates Centralize Claude Workflows – Replace Pilot from dotcms/infrastructure-as-code Jul 2, 2025
@sfreudenthaler sfreudenthaler enabled auto-merge (squash) July 2, 2025 19:17
Copy link
Member

@yolabingo yolabingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@sfreudenthaler sfreudenthaler merged commit 38e1e40 into main Jul 2, 2025
2 checks passed
@sfreudenthaler sfreudenthaler deleted the feat/initial-reusable-workflows branch July 2, 2025 20:19
dcolina added a commit that referenced this pull request Dec 16, 2025
Root cause: /tmp directory is NOT cleaned between jobs in GitHub Actions,
causing temp files to persist and contaminate subsequent workflow runs.

This caused two critical bugs:
- Bug #1: /tmp/validation_failed.txt persisted from previous runs, causing
  valid PRs to fail with exit code 1
- Bug #2: /tmp/new_images.txt and /tmp/old_images.txt accumulated data
  from multiple test PRs, causing wrong images to be processed

Solution: Add explicit cleanup at the start of each step that uses temp files.

Changes:
1. validate-changed-files: Clean /tmp/disallowed_files.txt
2. validate-image-only-changed: Clean /tmp/validation_failed.txt,
   /tmp/new_images.txt, /tmp/old_images.txt
3. validate-images: Clean /tmp/validation_failed.txt

Fixes: Bug #1 - Exit code 1 on successful validations
Fixes: Bug #2 - Wrong image processed (Test Case #6)
Related: Test Cases #1, #2, #6
dcolina added a commit that referenced this pull request Dec 16, 2025
## Bug Fixes: Temp File Cleanup (Bug #1 and Bug #2)

Fixes critical bugs by adding explicit cleanup of temp files at the
start of each job.

**Root Cause:** /tmp directory persists between jobs in GitHub Actions.

**Bugs Fixed:**
- Bug #1: Exit code 1 on successful validations (Test Cases #1, #2)
- Bug #2: Wrong image processed (Test Case #6)

**Solution:** Add `rm -f` cleanup at start of each step using temp
files.

---
Will recreate tag v1.1.1 after merge to avoid changing caller workflows.
dcolina added a commit that referenced this pull request Dec 16, 2025
BREAKING CHANGES:
- verify_image_existence now defaults to true (was false in v1.x)

✨ Added:
- Robust version comparison with rebuild number and hash support
- Improved registry validation with private registry fallback
- Enhanced error reporting with accumulated failures
- Comprehensive CHANGELOG documenting all fixes

🔧 Changed:
- Complete replacement of temporary files with bash arrays
- Added strict error handling (set -euo pipefail) to all scripts
- Better error messages for all validation failures

🐛 Fixed:
- Bug #1: Rebuild downgrade detection (25.12.08-2 → 25.12.08 now blocked)
- Bug #2: Temporary file race conditions and persistence issues
- Bug #3: Image existence validation for private registries
- Bug #4: Silent failures in validation loops
- Bug #5: Version pattern validation edge cases

🔒 Security:
- Strict error handling prevents silent failures
- Eliminated temporary file security concerns
- Better input validation before processing

📝 Details:
This is a complete architectural refactor addressing all known bugs in
the deployment guard workflow. The main improvements are:

1. Version Comparison: Now properly handles:
   - Base version (YY.MM.DD)
   - Rebuild numbers (-N suffix)
   - Commit hashes (_hash suffix)
   - All combinations of the above

2. State Management: Replaced all temporary files with bash arrays:
   - Before: /tmp/validation_failed.txt, /tmp/new_images.txt
   - After: VALIDATION_FAILED, FAILED_IMAGES, NEW_IMAGES arrays
   - Eliminates race conditions and cleanup issues

3. Registry Validation: Now supports both public and private registries
   - Tries Docker Hub first
   - Falls back to full image path for private registries

See CHANGELOG.md for complete migration guide and bug details.
dcolina added a commit that referenced this pull request Dec 16, 2025
#19)

## 🎯 Overview

Complete architectural refactor of the Deployment Guard workflow to
v2.0.0, eliminating all fragile temporary file-based state management
and fixing critical validation bugs.

## 🚨 Breaking Changes

- **`verify_image_existence` now defaults to `true`** (was `false` in
v1.x)
  - This was always the intended behavior but was disabled due to bugs
- If you want to disable, explicitly set to `false` in your workflow
configuration

## ✨ What's New

### 1. Robust Version Comparison
Complete rewrite of anti-downgrade logic with proper handling of:
- Base version comparison (YY.MM.DD format)
- Rebuild number comparison (e.g., `-2` in `25.12.08-2`)
- Hash comparison (e.g., `_abc123` in `25.12.08_abc123`)
- Full support for all combinations: `25.12.08`, `25.12.08-2`,
`25.12.08_abc`, `25.12.08-2_abc`

### 2. Improved Registry Validation
- Tries Docker Hub first for canonical images
- Falls back to full image path for private registries
- Better error messages indicating which registry was checked
- Handles mirror registries gracefully

### 3. Enhanced Error Reporting
- State variables accumulate ALL validation failures before exiting
- Detailed failure reasons for each failed image/file
- Clear indication of which validation step failed and why

## 🐛 Bugs Fixed

### Bug #1: Rebuild Downgrade Not Detected (Critical)
**Issue**: v1.x allowed downgrade from `25.12.08-2` to `25.12.08`
**Root Cause**: Only compared base version (YY.MM.DD), ignored rebuild
numbers
**Fix**: Now extracts and compares rebuild numbers when base version is
the same

**Examples**:
- ❌ v1.x: `25.12.08-2` → `25.12.08` = ✅ Allowed (BUG)
- ✅ v2.0.0: `25.12.08-2` → `25.12.08` = ❌ Blocked (CORRECT)
- ✅ v2.0.0: `25.12.08` → `25.12.08-2` = ✅ Allowed (upgrade)
- ✅ v2.0.0: `25.12.08-2_abc` → `25.12.08-2_xyz` = ✅ Allowed (same
version, different hash)

### Bug #2: Temporary File Race Conditions
**Issue**: Race conditions with `/tmp/validation_failed.txt` file
**Root Cause**: Multiple writes to same file, manual cleanup required
**Fix**: Eliminated ALL temporary files, using in-memory bash arrays

### Bug #3: Image Existence Check Failures
**Issue**: Validation failed for valid private registry images
**Root Cause**: Only checked Docker Hub canonical image
**Fix**: Now tries Docker Hub first, then falls back to full image path

### Bug #4: Silent Failures in Validation Loops
**Issue**: Validation could continue after failures
**Root Cause**: Lack of strict error handling
**Fix**: Added `set -euo pipefail` to all bash scripts

### Bug #5: Version Pattern Validation Edge Cases
**Issue**: Malformed tags could pass validation
**Root Cause**: Regex didn't enforce proper boundaries
**Fix**: Improved regex validation with proper format checks

## 🔧 Technical Changes

### State Management Architecture
**Before (v1.x)**: Used temporary files
```bash
echo "false" > /tmp/validation_failed.txt
echo "$image" >> /tmp/new_images.txt
[ -f /tmp/validation_failed.txt ] && exit 1
```

**After (v2.0.0)**: Uses bash arrays
```bash
VALIDATION_FAILED=false
FAILED_IMAGES=()
FAILED_IMAGES+=("$image: reason")
if [ "$VALIDATION_FAILED" = "true" ]; then
  printf '   - %s\n' "${FAILED_IMAGES[@]}"
  exit 1
fi
```

### Error Handling
All bash scripts now use strict mode:
```bash
set -euo pipefail
```

## 📝 Documentation

Added comprehensive CHANGELOG.md with:
- Complete bug details and technical explanations
- Migration guide from v1.x to v2.0.0
- Version support matrix
- Testing recommendations

## 🧪 Testing

Recommended testing approach:
```yaml
uses: dotCMS/ai-workflows/.github/workflows/deployment-guard.yml@v2.0.0
with:
  testing_force_non_bypass: true  # Force validation even for org members
  verify_image_existence: true    # Now enabled by default
```

## 📊 Test Cases Covered

| Scenario | v1.x | v2.0.0 |
|----------|------|--------|
| `25.12.08-2` → `25.12.08` | ✅ (bug) | ❌ (correct) |
| `25.12.08` → `25.12.08-2` | ✅ | ✅ |
| `25.12.08-2` → `25.12.08-3` | ✅ | ✅ |
| `25.12.08_abc` → `25.12.08_xyz` | ✅ | ✅ |
| `25.12.07` → `25.12.08` | ✅ | ✅ |
| `25.12.08` → `25.12.07` | ❌ | ❌ |
| Private registry image | ❌ (bug) | ✅ (fixed) |

## 🔄 Migration Path

1. **Week 1**: Deploy v2.0.0 to staging/dev
2. **Week 2**: Monitor and validate test cases
3. **Week 3**: Deploy to production
4. **Week 4**: Deprecate v1.x

## 📚 Related Issues

Fixes bugs reported in Deutsche Bank infrastructure validation.

## ✅ Checklist

- [x] All temporary files eliminated
- [x] Rebuild downgrade detection fixed
- [x] Private registry support added
- [x] Strict error handling implemented
- [x] Comprehensive CHANGELOG created
- [x] All validation cases tested
- [x] Breaking changes documented

---
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants