fix: Increase K8s client rate limits for parallel SSAR checks #433

Gkrumbach07 · 2025-12-04T20:01:32Z

Summary

This PR fixes a critical performance issue discovered after merging #428 (pagination and parallel SSAR checks).

Problem

The default Kubernetes client rate limits (QPS=5, Burst=10) were causing significant throttling during parallel SSAR checks:

"Waited before sending request" delay="1.997446758s" reason="client-side throttling"

Solution

Increase rate limits:

QPS: 5 → 100
Burst: 10 → 200

Performance Impact

Tested with 51 namespaces:

Before: 8.2 seconds
After: 50 milliseconds
Improvement: ~164x faster

Testing

Verified on dev cluster with 51 managed namespaces. Timing logs confirmed SSAR checks completed in 30ms with the increased rate limits.

- Increase QPS from 5 to 100 and Burst from 10 to 200 - This fixes client-side throttling during parallel SSAR checks - Performance improvement: 8.2s -> 50ms for 51 namespaces (~164x faster) The default Kubernetes client rate limits (QPS=5, Burst=10) cause significant throttling when performing parallel SSAR checks for project listing. This was identified via timing logs showing 'client-side throttling' delays of ~2s per batch.

github-actions · 2025-12-04T20:03:44Z

Claude Code Review

Summary

This PR increases Kubernetes client rate limits (QPS: 5→100, Burst: 10→200) to fix performance bottlenecks discovered after PR #428 added parallel SSAR checks. The change addresses client-side throttling that was causing 2-second delays during project listing operations.

Performance Impact: Testing shows ~164x improvement (8.2s → 50ms for 51 namespaces)

Issues by Severity

🟡 Major Issues

1. User-scoped clients don't inherit rate limit settings

Location: components/backend/handlers/middleware.go:92-102

The rate limits are only set on BaseKubeConfig during initialization, but GetK8sClientsForRequest() creates user-scoped clients by copying the base config:

cfg := *BaseKubeConfig  // Shallow copy
cfg.BearerToken = token
// QPS and Burst ARE copied here, so this is actually fine

Status: ✅ Actually OK - The shallow copy of rest.Config includes QPS/Burst fields, so user-scoped clients DO inherit the rate limits. However, this could be made more explicit with a comment.

2. No rate limit configuration for operator clients

Location: components/operator/ (not visible in this PR, but worth noting)

The operator likely uses similar client initialization patterns. If it performs parallel operations (e.g., watching multiple CRs, spawning Jobs), it may also benefit from increased rate limits.

Recommendation: Review operator client initialization at components/operator/internal/config/config.go and consider applying similar rate limit increases if parallel operations are present.

3. Rate limits are hardcoded rather than configurable

Location: components/backend/server/k8s.go:42-43

config.QPS = 100
config.Burst = 200

Issue: No way to tune these values without code changes. For different deployment sizes (10 vs 1000 namespaces), optimal values may differ.

Recommendation: Make rate limits configurable via environment variables:

qps := getEnvFloat32("K8S_CLIENT_QPS", 100.0)
burst := getEnvInt("K8S_CLIENT_BURST", 200)
config.QPS = qps
config.Burst = int(burst)

This allows operators to tune performance without redeployment.

🔵 Minor Issues

4. Comment could be more specific about the issue

Location: components/backend/server/k8s.go:40-41

Current comment:

// Increase rate limits for parallel operations (default is QPS=5, Burst=10)
// This is needed for parallel SSAR checks when listing projects

Suggestion: Add context about the problem being solved:

// Increase rate limits for parallel operations (default is QPS=5, Burst=10)
// This is needed for parallel SSAR checks when listing projects (PR #428)
// which spawns 10 concurrent goroutines. Default limits caused 2s throttling
// delays. Testing shows 164x performance improvement with these values.

5. No unit tests for rate limit configuration

Location: components/backend/server/k8s_test.go (doesn't exist)

Issue: No automated validation that rate limits are actually set correctly.

Recommendation: Add a test:

func TestInitK8sClientsRateLimits(t *testing.T) {
    // Mock K8s config and verify QPS/Burst are set
    // Verify user-scoped clients inherit settings
}

Positive Highlights

✅ Excellent problem identification - Clear documentation of the issue with timing logs
✅ Measured performance impact - Concrete before/after metrics (8.2s → 50ms)
✅ Appropriate values - QPS=100, Burst=200 are reasonable for 10 concurrent workers
✅ Minimal, focused change - Only touches what's necessary
✅ Good comment placement - Explains why the change is needed

Architecture Alignment

✅ Follows backend patterns - Changes made in the correct location (server/k8s.go)
✅ Security intact - Rate limits don't affect authentication or authorization logic
✅ No RBAC bypass - User-scoped clients still enforce permissions correctly

Performance Considerations

Current implementation is sound for the described use case:

10 parallel workers × QPS 100 = theoretical 1000 req/s capacity
Burst=200 handles initial spike when all workers start
K8s API server defaults allow this (client-side limits were the bottleneck)

Potential future concerns:

If worker count increases beyond 20-30, may need further tuning
Large clusters (>500 namespaces) may hit K8s API server limits
Consider implementing adaptive rate limiting based on cluster size

Testing Validation

✅ Tested on real cluster - Dev cluster with 51 namespaces
✅ Timing logs confirm fix - SSAR checks complete in 30-50ms
⚠️ Missing edge case testing:

Behavior when K8s API server is under load
Behavior with 500+ namespaces
Impact on other backend operations during parallel SSAR checks

Recommendations (Prioritized)

High Priority (Address before merge)

None - code is safe to merge as-is.

Medium Priority (Consider for follow-up PR)

Make rate limits configurable via env vars - Allows tuning without redeployment
Review operator client configuration - Apply similar fixes if needed
Add integration test - Verify rate limits survive client creation

Low Priority (Future improvements)

Add monitoring/metrics - Track rate limit exhaustion events
Document tuning guidelines - Help operators choose appropriate values
Consider adaptive rate limiting - Adjust based on cluster size/load

Security Review

✅ No security concerns - Rate limits are a performance optimization
✅ Token handling unchanged - No impact on authentication patterns
✅ RBAC enforcement intact - User permissions still validated via SSAR

Conclusion

Recommendation: ✅ APPROVE

This is a well-targeted fix for a real performance issue. The change is minimal, well-documented, and the performance improvement is substantial. While there are suggestions for making rate limits configurable, the current hardcoded values are reasonable and don't block merging.

The code follows established patterns in the codebase and doesn't introduce any security or correctness issues. User-scoped clients inherit the rate limit settings via config copy, so all client types benefit from the fix.

Review performed by Claude Code using repository standards from CLAUDE.md, backend-development.md, security-standards.md, and k8s-client-usage.md patterns.

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

CLAUDE.md - Master project instructions, development standards
backend-development.md - Go backend, K8s integration patterns
frontend-development.md - NextJS, Shadcn UI, React Query patterns
security-standards.md - Auth, RBAC, token handling
k8s-client-usage.md - User token vs service account patterns
error-handling.md - Consistent error patterns
react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Gkrumbach07 merged commit 6605008 into main Dec 4, 2025
20 checks passed

Gkrumbach07 deleted the fix/k8s-rate-limits branch December 4, 2025 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Increase K8s client rate limits for parallel SSAR checks #433

fix: Increase K8s client rate limits for parallel SSAR checks #433

Uh oh!

Gkrumbach07 commented Dec 4, 2025

Uh oh!

github-actions bot commented Dec 4, 2025 •

edited

Loading

What Amber Loaded for Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Increase K8s client rate limits for parallel SSAR checks #433

fix: Increase K8s client rate limits for parallel SSAR checks #433

Uh oh!

Conversation

Gkrumbach07 commented Dec 4, 2025

Summary

Problem

Solution

Performance Impact

Testing

Uh oh!

github-actions bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

Issues by Severity

🟡 Major Issues

🔵 Minor Issues

Positive Highlights

Architecture Alignment

Performance Considerations

Testing Validation

Recommendations (Prioritized)

High Priority (Address before merge)

Medium Priority (Consider for follow-up PR)

Low Priority (Future improvements)

Security Review

Conclusion

What Amber Loaded for Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Dec 4, 2025 •

edited

Loading