Skip to content

[refactor] Semantic Function Clustering Analysis - Duplicate Functions and Code Organization #12770

@github-actions

Description

@github-actions

Executive Summary

A comprehensive semantic function clustering analysis identified exact duplicate functions, outlier functions in wrong files, and mixed-concern files across the codebase. The analysis covered 487 non-test Go files with deep focus on the pkg/workflow (247 files) and pkg/cli (163 files) packages.

Key Findings:

  • ~70% well-organized - Most validation, parsing, and creation files follow excellent patterns
  • 🔄 2 exact duplicate functions - extractBaseRepo() duplicated in 2 files
  • ⚠️ 2 similar functions - ParseGitHubURL() in 2 files with different purposes
  • 4 high-priority outlier functions - Functions clearly in wrong files
  • 📦 5 exemplary subsystems - Excellent models for code organization

Critical Issues Identified

1. Exact Duplicate Functions

Issue #1: extractBaseRepo() - Identical Implementation in Two Files

Duplicate Locations:

  • pkg/workflow/action_resolver.go:93
  • pkg/cli/update_actions.go:20

Code Comparison:

// pkg/workflow/action_resolver.go:93
func extractBaseRepo(repo string) string {
    parts := strings.Split(repo, "/")
    if len(parts) >= 2 {
        // Take first two parts (owner/repo)
        return parts[0] + "/" + parts[1]
    }
    return repo
}

// pkg/cli/update_actions.go:20
func extractBaseRepo(actionPath string) string {
    parts := strings.Split(actionPath, "/")
    if len(parts) >= 2 {
        // Return owner/repo (first two segments)
        return parts[0] + "/" + parts[1]
    }
    // If less than 2 parts, return as-is (shouldn't happen in practice)
    return actionPath
}

Similarity: 100% identical logic, only comments differ

Recommendation:

  • Consolidate into pkg/repoutil/repoutil.go (which already has related utilities)
  • Export as ExtractBaseRepo(path string) string
  • Update both callers to use repoutil.ExtractBaseRepo()

Estimated Impact: Reduced code duplication, single source of truth for repository path parsing


Issue #2: ParseGitHubURL() - Similar Functions with Different Purposes

Duplicate Locations:

  • pkg/repoutil/repoutil.go:28 - Returns (owner, repo string, err error)
  • pkg/parser/github_urls.go:56 - Returns (*GitHubURLComponents, error)

Analysis:
These functions have different purposes despite similar names:

  • repoutil version: Handles SSH (git@github.com:) and HTTPS formats, returns simple owner/repo tuple for git operations
  • parser version: Uses url.Parse(), handles raw.githubusercontent.com, returns structured GitHubURLComponents with file paths, refs, content types

Recommendation:

  • Rename for clarity:
    • repoutil.ParseGitHubURLrepoutil.ParseGitRepoURL (emphasizes git repo focus)
    • Keep parser.ParseGitHubURL as-is (comprehensive parser)
  • Add cross-reference comments explaining the distinction
  • Consider consolidation: Could repoutil call parser version and extract owner/repo?

Estimated Impact: Improved API clarity, reduced naming confusion


2. Outlier Functions (Functions in Wrong Files)

Outlier #1: Git Attribute Configuration in Git Operations File

File: pkg/cli/git.go:157
Function: ensureGitAttributes()
Current Purpose: Configuring .gitattributes for compiled workflow files
Issue: This is compilation post-processing, not a core git operation

Why It's Misplaced:
The function sets up .gitattributes to handle .lock.yml files in .github/workflows with linguist-generated=true merge=ours directives. This is a workflow compilation concern, not a generic git utility. The git.go file should contain reusable git operations (commits, branches, remotes), not workflow-specific configuration.

Recommendation:

  • Move to pkg/cli/compile_post_processing.go (or create it)
  • Rename to configureWorkflowGitAttributes() for clarity
  • Keep git.go focused on reusable git operations

Estimated Impact: Clearer separation of concerns, easier to find compilation-related setup


Outlier #2: User Interaction in Git Operations File

File: pkg/cli/git.go:704
Function: confirmPushOperation()
Issue: User interaction logic mixed with git operations

Why It's Misplaced:
This function uses huh library to prompt users for confirmation before pushing. User interaction should be grouped with other interactive prompts, not embedded in git operations. The function has no actual git logic - it's purely UI/UX.

Recommendation:

  • Move to pkg/cli/interactive.go (if exists) or create pkg/cli/prompts.go
  • Group with other user confirmation functions
  • Keep git.go focused on actual git commands

Estimated Impact: Improved testability (can mock prompts separately from git operations), clearer responsibilities


Outlier #3: GitHub URL Parsing in Git Operations File

File: pkg/cli/git.go:62
Function: parseGitHubRepoSlugFromURL()
Issue: URL parsing utility in git operations file

Why It's Misplaced:
This function parses GitHub URLs to extract repository slugs - it's a URL/string parsing utility, not a git operation. It belongs with other GitHub URL parsing utilities.

Recommendation:

  • Move to pkg/repoutil/repoutil.go (which already has ParseGitHubURL)
  • Rename to ExtractRepoSlugFromURL() for clarity
  • Keep git.go focused on git commands

Estimated Impact: Centralized GitHub URL parsing utilities, clearer file boundaries


Outlier #4: .gitignore Management in Git Operations File

File: pkg/cli/git.go:233
Function: ensureLogsGitignore()
Issue: Logs-specific file management in git operations

Why It's Misplaced:
This function manages .gitignore entries for the logs directory - it's logs package configuration, not a generic git utility. It's similar to the ensureGitAttributes() issue.

Recommendation:

  • Move to pkg/cli/logs_setup.go or pkg/cli/logs_config.go
  • Keep git.go focused on git commands
  • Group with other logs-related setup functions

Estimated Impact: Clearer separation of git utilities vs. logs package setup


3. File Size and Complexity Issues

Large Files Requiring Refactoring

Top 5 Largest Non-Test Files:

  1. pkg/cli/trial_command.go - 1000 lines
  2. pkg/cli/mcp_server.go - 1000 lines
  3. pkg/workflow/safe_outputs_config_generation.go - 988 lines
  4. pkg/cli/audit.go - 864 lines
  5. pkg/workflow/compiler_activation_jobs.go - 855 lines

Note: These files are flagged in separate file-diet issues (#12747, #12709, #12675) and should be addressed through those dedicated refactoring tasks.


Well-Organized Code (Models to Follow)

🏆 Best Practice #1: Validation Files (pkg/workflow)

Why It's Excellent:

  • 27 focused validation files, each handling one domain:
    • pip_validation.go - Python package validation
    • npm_validation.go - NPM package validation
    • docker_validation.go - Docker image validation
    • firewall_validation.go - Firewall configuration
    • expression_validation.go - Expression safety
    • sandbox_validation.go - Sandbox configuration
    • And 21 more domain-specific validators...
  • Generic validators in validation_helpers.go
  • Easy to add new validators (just create new {domain}_validation.go)

Key Takeaway: One validation file per domain prevents god files and improves discoverability.


🏆 Best Practice #2: codemod_* Files (pkg/cli)

Why It's Excellent:

  • 15 feature-specific files following identical patterns
  • Shared utilities properly factored out (codemod_yaml_utils.go)
  • Consistent structure:
    func get{Feature}Codemod() Codemod {
        return Codemod{
            ID:           "feature-identifier",
            Name:         "Human readable name",
            Description:  "What it does",
            IntroducedIn: "0.x.0",
            Apply: func(content string, frontmatter map[string]any) (string, bool, error) {
                // Implementation
            },
        }
    }
  • Each file handles ONE migration concern
  • Paired with test files

Files: codemod_agent_session.go, codemod_discussion_flag.go, codemod_grep_tool.go, codemod_mcp_mode_to_type.go, codemod_mcp_network.go, codemod_network_firewall.go, codemod_permissions.go, codemod_safe_inputs.go, codemod_sandbox_agent.go, codemod_schedule.go, codemod_schema_file.go, codemod_slash_command.go, codemod_timeout_minutes.go, codemod_upload_assets.go

Key Takeaway: This is a model subsystem demonstrating perfect feature-based organization.


🏆 Best Practice #3: Creation Functions (pkg/workflow)

Why It's Excellent:

  • One file per creation concern:
    • create_issue.go - GitHub issue creation
    • create_pull_request.go - Pull request creation
    • create_discussion.go - Discussion creation
    • create_agent_session.go - Agent session creation
    • create_code_scanning_alert.go - Security alert creation
    • And more...
  • Clear naming: create_{entity}.go
  • Paired with comprehensive test files

Key Takeaway: One creation function per file makes code easy to locate and test.


🏆 Best Practice #4: Runtime Files (pkg/workflow)

Why It's Excellent:

  • Clear separation by concern:
    • runtime_definitions.go - Type definitions
    • runtime_detection.go - Runtime detection logic
    • runtime_deduplication.go - Deduplication
    • runtime_validation.go - Validation
  • Each file has a single, clear purpose
  • Easy to find related functionality

Key Takeaway: Split by functional concern (definitions, detection, validation) rather than mixing in one large file.


🏆 Best Practice #5: Expression Handling (pkg/workflow)

Why It's Excellent:

  • Well-separated by concern:
    • expression_parser.go - Parsing
    • expression_validation.go - Validation
    • expression_extraction.go - Extraction
    • expression_builder.go - Building
    • expression_patterns.go - Pattern matching
  • Each file focuses on one operation on expressions
  • Easy to navigate (parser → validator → builder flow)

Key Takeaway: Organize by operation type when dealing with a core concept.


Detailed Function Clusters

\u003cdetails\u003e
\u003csummary\u003e\u003cb\u003eSemantic Clustering Analysis by Pattern\u003c/b\u003e\u003c/summary\u003e

Cluster 1: Validation Functions (validate*, Validate*)

Pattern: Functions that validate configurations, inputs, or workflows
Files: 27 validation files in pkg/workflow

Well-Organized Examples:

  • pip_validation.go, npm_validation.go, docker_validation.go - Domain-specific validators
  • validation_helpers.go - Generic validators (ValidateRequired(), ValidateMaxLength())
  • strict_mode_validation.go - 7 expression safety validators

Analysis: ✅ Excellent organization - validation is well-separated by domain


Cluster 2: Parsing Functions (parse*, Parse*)

Pattern: Functions that parse strings, YAML, or configurations
Files: 20+ files across pkg/workflow and pkg/cli

Well-Organized Examples:

  • trigger_parser.go - 16 functions for trigger parsing
  • tools_parser.go - 13 functions for tool configuration parsing
  • slash_command_parser.go - Slash command parsing
  • schema_compiler.go - Schema compilation and validation

Analysis: ✅ Good organization - domain-specific parsers have dedicated files


Cluster 3: Creation Functions (create*)

Pattern: Functions that create new entities
Files: 10+ files in pkg/workflow

Examples:

  • create_issue.go - GitHub issue creation
  • create_pull_request.go - Pull request creation
  • create_discussion.go - Discussion creation
  • create_agent_session.go - Agent session creation
  • create_code_scanning_alert.go - Security alert creation

Analysis: ✅ Excellent organization - each creation function has its own file


Cluster 4: Building/Generation Functions (build*, generate*)

Pattern: Functions that construct objects, generate output, or render templates
Files: 15+ files

Examples:

  • expression_builder.go - 26 functions for building expression trees
  • mcp_renderer.go - 14 functions for rendering MCP configurations
  • safe_inputs_generator.go - Generating safe input configurations
  • safe_outputs_config_generation.go - Safe outputs configuration

Analysis: ✅ Well-organized, clear separation of building concerns


Cluster 5: Helper/Utility Functions

Common Patterns: ensure*, get*, is*, has*, check*, find*, extract*
Occurrences: 500+ functions across 150+ files

Well-Consolidated Examples:

  • strings.go - String normalization utilities
  • validation_helpers.go - Generic validators
  • config_helpers.go - Configuration parsing helpers
  • error_helpers.go - Error construction helpers

Scattered Examples:

  • String processing helpers in multiple files
  • Config parsing helpers spread across files
  • Repository utilities in repoutil/ package

Analysis: ⚠️ Some consolidation opportunities, but most utilities are well-organized

\u003c/details\u003e


Implementation Priorities

Priority 1: High-Impact, Quick Wins (2-4 hours)

  1. Consolidate Exact Duplicate: extractBaseRepo()

    • Merge into pkg/repoutil/repoutil.go
    • Update imports in pkg/workflow/action_resolver.go and pkg/cli/update_actions.go
    • Effort: 1 hour
    • Impact: Immediate code deduplication
  2. Rename ParseGitHubURL Variants for Clarity

    • Rename repoutil.ParseGitHubURLrepoutil.ParseGitRepoURL
    • Add cross-reference comments
    • Effort: 30 minutes
    • Impact: API clarity
  3. Move Outlier Functions from git.go

    • Move ensureGitAttributes() to compile_post_processing.go
    • Move confirmPushOperation() to interactive.go or prompts.go
    • Move parseGitHubRepoSlugFromURL() to repoutil/repoutil.go
    • Move ensureLogsGitignore() to logs_setup.go
    • Effort: 2-3 hours
    • Impact: Clearer file boundaries, improved discoverability

Priority 2: File Size Refactoring (Tracked in Separate Issues)

The following large files have dedicated refactoring issues:

Note: These should be addressed through their dedicated issues to avoid duplication.


Priority 3: Long-Term Improvements (Optional)

  1. Documentation Improvements
    • Add header comments explaining file organization (like frontmatter_editor.go)
    • Add cross-references for related utilities
    • Effort: 1-2 hours
    • Impact: Improved maintainability

Implementation Guidelines

For All Refactorings:

  1. Preserve Behavior - Ensure existing functionality works identically
  2. Maintain Exports - Keep public API unchanged (unless renaming for clarity)
  3. Write Tests First - Add tests before refactoring (especially for untested code)
  4. Incremental Changes - Move one function at a time
  5. Run Tests Frequently - Verify tests pass after each change
  6. Update Imports - Ensure all import paths are updated
  7. Add Documentation - Explain boundaries with header comments

Testing Strategy:

  • Use table-driven tests for validation/parsing logic
  • Mock external dependencies (git commands, GitHub API)
  • Aim for ≥80% coverage for refactored code
  • Verify integration tests still pass

Metrics Summary

  • Total Go Files Analyzed: 487 non-test files
  • Major Packages:
    • pkg/workflow: 247 files
    • pkg/cli: 163 files
    • pkg/parser: 30 files
    • pkg/console: 14 files
  • Function Clusters Identified: 5 major clusters (validation, parsing, creation, building, helpers)
  • Exact Duplicates Detected: 2 functions
  • Similar Functions Requiring Renaming: 2 functions
  • Outliers Found: 4 high-priority functions in wrong files
  • Well-Organized Subsystems: 5 exemplary patterns (validation_, codemod_, create_, runtime_, expression_*)
  • Detection Method: Semantic code analysis + naming pattern analysis + manual code inspection
  • Analysis Date: 2026-01-30

Acceptance Criteria

Refactoring is successful when:

  • Exact duplicate extractBaseRepo() functions are consolidated
  • ParseGitHubURL variants are renamed for clarity
  • Outlier functions are moved to appropriate files
  • All tests pass (unit + integration)
  • Code passes linting
  • Build succeeds
  • Public API remains unchanged (except intentional renames)

References:

AI generated by Semantic Function Refactoring

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions