Skip to content

[P0] LLM Retry Logic Infinite Loop Risk #104

@jeremyeder

Description

@jeremyeder

Summary

The rate limit retry logic in LLMEnricher.enrich_skill() recursively calls itself without any retry limit counter, creating potential infinite loop. If Anthropic API returns rate limit errors repeatedly (e.g., account suspended, quota exhausted), this will retry infinitely causing stack overflow or hang.

Impact

  • API key revoked → retry forever → stack overflow or hang
  • User cannot interrupt (no max retry parameter)
  • Each retry consumes stack space (recursive calls)
  • Real scenario: API key revoked → retry forever → production system hangs
  • Unnecessary API costs if quota not completely exhausted

Location

  • File: src/agentready/learners/llm_enricher.py
  • Lines: 93-99
  • Function: LLMEnricher.enrich_skill()

Current Code

except RateLimitError as e:
    logger.warning(f"Rate limit hit for {skill.skill_id}: {e}")
    # Exponential backoff
    retry_after = int(getattr(e, "retry_after", 60))
    logger.info(f"Retrying after {retry_after} seconds...")
    sleep(retry_after)
    return self.enrich_skill(skill, repository, finding, use_cache)

Solution

Add bounded retry with graceful fallback:

def enrich_skill(
    self,
    skill: DiscoveredSkill,
    repository: Repository,
    finding: Finding,
    use_cache: bool = True,
    max_retries: int = 3,
    _retry_count: int = 0,
) -> DiscoveredSkill:
    """Enrich skill with LLM-generated content.

    Args:
        skill: Skill to enrich
        repository: Repository context
        finding: Assessment finding
        use_cache: Use cached responses if available (default: True)
        max_retries: Maximum retry attempts for rate limits (default: 3)
        _retry_count: Internal retry counter (do not set manually)

    Returns:
        Enriched skill with LLM content, or original skill if enrichment fails
    """
    # ... existing code ...

    except RateLimitError as e:
        # Check if max retries exceeded
        if _retry_count >= max_retries:
            logger.error(
                f"Max retries ({max_retries}) exceeded for {skill.skill_id}. "
                f"Falling back to heuristic skill. "
                f"Check API quota: https://console.anthropic.com/settings/limits"
            )
            return skill  # Graceful fallback

        # Calculate backoff with jitter
        retry_after = int(getattr(e, "retry_after", 60))
        jitter = random.uniform(0, min(retry_after * 0.1, 5))
        total_wait = retry_after + jitter

        logger.warning(
            f"Rate limit hit for {skill.skill_id} "
            f"(retry {_retry_count + 1}/{max_retries}): {e}"
        )
        logger.info(f"Retrying after {total_wait:.1f} seconds...")

        sleep(total_wait)

        return self.enrich_skill(
            skill, repository, finding, use_cache, max_retries, _retry_count + 1
        )

Testing

# 1. Unit tests for retry behavior
pytest tests/unit/test_llm_enricher.py::test_llm_enricher_max_retries -v
pytest tests/unit/test_llm_enricher.py::test_llm_enricher_successful_retry -v

# 2. Manual test with invalid API key (should fail gracefully)
export ANTHROPIC_API_KEY="invalid-key"
agentready learn . --enable-llm --llm-max-retries 2

# Expected: Retries 2 times, then falls back to heuristic

Acceptance Criteria

  • max_retries parameter added to function signature
  • Retry counter checked before recursive call
  • Graceful fallback to heuristic skill on max retries
  • Jitter added to prevent thundering herd
  • CLI option --llm-max-retries added
  • Unit tests for retry limit added
  • Unit tests for successful retry added
  • Documentation updated with retry behavior
  • Error messages include helpful context (API quota link)
  • All existing tests pass

Best Practices Applied

  1. Exponential backoff with jitter: Prevents thundering herd
  2. Bounded retries: Prevents infinite loops
  3. Graceful degradation: Falls back to heuristic on failure
  4. User control: CLI option for retry limit
  5. Helpful errors: Links to API quota page

References

  • Full remediation plan: .plans/code-review-remediation-plan.md

Labels: security, bug, P0, llm
Milestone: v1.24.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions