Skip to content

Conversation

@Gkrumbach07
Copy link
Collaborator

Summary

This PR fixes a critical performance issue discovered after merging #428 (pagination and parallel SSAR checks).

Problem

The default Kubernetes client rate limits (QPS=5, Burst=10) were causing significant throttling during parallel SSAR checks:

"Waited before sending request" delay="1.997446758s" reason="client-side throttling"

Solution

Increase rate limits:

  • QPS: 5 → 100
  • Burst: 10 → 200

Performance Impact

Tested with 51 namespaces:

  • Before: 8.2 seconds
  • After: 50 milliseconds
  • Improvement: ~164x faster

Testing

Verified on dev cluster with 51 managed namespaces. Timing logs confirmed SSAR checks completed in 30ms with the increased rate limits.

- Increase QPS from 5 to 100 and Burst from 10 to 200
- This fixes client-side throttling during parallel SSAR checks
- Performance improvement: 8.2s -> 50ms for 51 namespaces (~164x faster)

The default Kubernetes client rate limits (QPS=5, Burst=10) cause
significant throttling when performing parallel SSAR checks for
project listing. This was identified via timing logs showing
'client-side throttling' delays of ~2s per batch.
@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Claude Code Review

Summary

This PR increases Kubernetes client rate limits (QPS: 5→100, Burst: 10→200) to fix performance bottlenecks discovered after PR #428 added parallel SSAR checks. The change addresses client-side throttling that was causing 2-second delays during project listing operations.

Performance Impact: Testing shows ~164x improvement (8.2s → 50ms for 51 namespaces)

Issues by Severity

🟡 Major Issues

1. User-scoped clients don't inherit rate limit settings

Location: components/backend/handlers/middleware.go:92-102

The rate limits are only set on BaseKubeConfig during initialization, but GetK8sClientsForRequest() creates user-scoped clients by copying the base config:

cfg := *BaseKubeConfig  // Shallow copy
cfg.BearerToken = token
// QPS and Burst ARE copied here, so this is actually fine

Status: ✅ Actually OK - The shallow copy of rest.Config includes QPS/Burst fields, so user-scoped clients DO inherit the rate limits. However, this could be made more explicit with a comment.

2. No rate limit configuration for operator clients

Location: components/operator/ (not visible in this PR, but worth noting)

The operator likely uses similar client initialization patterns. If it performs parallel operations (e.g., watching multiple CRs, spawning Jobs), it may also benefit from increased rate limits.

Recommendation: Review operator client initialization at components/operator/internal/config/config.go and consider applying similar rate limit increases if parallel operations are present.

3. Rate limits are hardcoded rather than configurable

Location: components/backend/server/k8s.go:42-43

config.QPS = 100
config.Burst = 200

Issue: No way to tune these values without code changes. For different deployment sizes (10 vs 1000 namespaces), optimal values may differ.

Recommendation: Make rate limits configurable via environment variables:

qps := getEnvFloat32("K8S_CLIENT_QPS", 100.0)
burst := getEnvInt("K8S_CLIENT_BURST", 200)
config.QPS = qps
config.Burst = int(burst)

This allows operators to tune performance without redeployment.

🔵 Minor Issues

4. Comment could be more specific about the issue

Location: components/backend/server/k8s.go:40-41

Current comment:

// Increase rate limits for parallel operations (default is QPS=5, Burst=10)
// This is needed for parallel SSAR checks when listing projects

Suggestion: Add context about the problem being solved:

// Increase rate limits for parallel operations (default is QPS=5, Burst=10)
// This is needed for parallel SSAR checks when listing projects (PR #428)
// which spawns 10 concurrent goroutines. Default limits caused 2s throttling
// delays. Testing shows 164x performance improvement with these values.

5. No unit tests for rate limit configuration

Location: components/backend/server/k8s_test.go (doesn't exist)

Issue: No automated validation that rate limits are actually set correctly.

Recommendation: Add a test:

func TestInitK8sClientsRateLimits(t *testing.T) {
    // Mock K8s config and verify QPS/Burst are set
    // Verify user-scoped clients inherit settings
}

Positive Highlights

Excellent problem identification - Clear documentation of the issue with timing logs
Measured performance impact - Concrete before/after metrics (8.2s → 50ms)
Appropriate values - QPS=100, Burst=200 are reasonable for 10 concurrent workers
Minimal, focused change - Only touches what's necessary
Good comment placement - Explains why the change is needed

Architecture Alignment

Follows backend patterns - Changes made in the correct location (server/k8s.go)
Security intact - Rate limits don't affect authentication or authorization logic
No RBAC bypass - User-scoped clients still enforce permissions correctly

Performance Considerations

Current implementation is sound for the described use case:

  • 10 parallel workers × QPS 100 = theoretical 1000 req/s capacity
  • Burst=200 handles initial spike when all workers start
  • K8s API server defaults allow this (client-side limits were the bottleneck)

Potential future concerns:

  • If worker count increases beyond 20-30, may need further tuning
  • Large clusters (>500 namespaces) may hit K8s API server limits
  • Consider implementing adaptive rate limiting based on cluster size

Testing Validation

Tested on real cluster - Dev cluster with 51 namespaces
Timing logs confirm fix - SSAR checks complete in 30-50ms
⚠️ Missing edge case testing:

  • Behavior when K8s API server is under load
  • Behavior with 500+ namespaces
  • Impact on other backend operations during parallel SSAR checks

Recommendations (Prioritized)

High Priority (Address before merge)

None - code is safe to merge as-is.

Medium Priority (Consider for follow-up PR)

  1. Make rate limits configurable via env vars - Allows tuning without redeployment
  2. Review operator client configuration - Apply similar fixes if needed
  3. Add integration test - Verify rate limits survive client creation

Low Priority (Future improvements)

  1. Add monitoring/metrics - Track rate limit exhaustion events
  2. Document tuning guidelines - Help operators choose appropriate values
  3. Consider adaptive rate limiting - Adjust based on cluster size/load

Security Review

No security concerns - Rate limits are a performance optimization
Token handling unchanged - No impact on authentication patterns
RBAC enforcement intact - User permissions still validated via SSAR

Conclusion

Recommendation: ✅ APPROVE

This is a well-targeted fix for a real performance issue. The change is minimal, well-documented, and the performance improvement is substantial. While there are suggestions for making rate limits configurable, the current hardcoded values are reasonable and don't block merging.

The code follows established patterns in the codebase and doesn't introduce any security or correctness issues. User-scoped clients inherit the rate limit settings via config copy, so all client types benefit from the fix.


Review performed by Claude Code using repository standards from CLAUDE.md, backend-development.md, security-standards.md, and k8s-client-usage.md patterns.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@Gkrumbach07 Gkrumbach07 merged commit 6605008 into main Dec 4, 2025
20 checks passed
@Gkrumbach07 Gkrumbach07 deleted the fix/k8s-rate-limits branch December 4, 2025 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants