feat(kb): add adjustable concurrency and batching to uploads and embeddings#1198
feat(kb): add adjustable concurrency and batching to uploads and embeddings#1198waleedlatif1 merged 1 commit intostagingfrom
Conversation
There was a problem hiding this comment.
Greptile Summary
This PR introduces comprehensive configuration options for Knowledge Base (KB) processing operations by making concurrency limits, batch sizes, timeouts, and retry parameters adjustable through environment variables. The changes span three key files:
Environment Configuration (lib/env.ts): Adds 9 new environment variables with sensible defaults for KB processing configuration, including KB_CONFIG_CONCURRENCY_LIMIT (20), KB_CONFIG_BATCH_SIZE (20), KB_CONFIG_MAX_ATTEMPTS (3), and various timeout/delay parameters. These variables are well-documented and provide backward compatibility through default values.
Document Service (lib/knowledge/documents/service.ts): Replaces hardcoded processing parameters with the new environment-based configuration. Creates separate configuration objects for different processing modes - a conservative PROCESSING_CONFIG for in-memory processing and a more aggressive REDIS_PROCESSING_CONFIG for Redis-backed queues. The service now dynamically calculates concurrency limits and batch sizes based on the environment variables, with mathematical transformations to optimize for different processing backends.
Background Processing (background/knowledge-processing.ts): Updates the Trigger.dev task configuration to use the new environment variables for maxDuration, retry settings (maxAttempts, factor, minTimeoutInMs, maxTimeoutInMs), and queue concurrencyLimit. This ensures consistent configuration across all KB processing components.
The implementation allows operators to tune KB processing performance based on infrastructure capacity, API rate limits, and deployment environment requirements without code changes. The system maintains different behavior profiles - more conservative settings for fallback in-memory processing versus optimized settings when Redis queuing is available. This change integrates well with the existing codebase architecture and follows the pattern of using environment variables for operational configuration seen elsewhere in the application.
Confidence score: 3/5
- This PR introduces significant configurability improvements but has some potential runtime safety issues
- Score reflects concerns about mathematical operations that could result in zero values and missing fallback handling in critical paths
- Pay close attention to
apps/sim/lib/knowledge/documents/service.tsandapps/sim/background/knowledge-processing.tsfor potential runtime failures
3 files reviewed, 3 comments
| batchSize: 10, | ||
| delayBetweenBatches: 200, | ||
| delayBetweenDocuments: 100, | ||
| maxConcurrentDocuments: Math.max(1, Math.floor(env.KB_CONFIG_CONCURRENCY_LIMIT / 5)) || 4, |
There was a problem hiding this comment.
logic: The || 4 fallback won't trigger because Math.max(1, ...) ensures the result is never falsy. Consider using nullish coalescing if the env var might be undefined.
| delayBetweenBatches: env.KB_CONFIG_DELAY_BETWEEN_BATCHES * 2, | ||
| delayBetweenDocuments: env.KB_CONFIG_DELAY_BETWEEN_DOCUMENTS * 2, |
There was a problem hiding this comment.
logic: Direct multiplication without null checks could result in NaN if env variables are undefined. Consider adding validation or default values.
| retryDelay: env.KB_CONFIG_MIN_TIMEOUT, | ||
| maxRetries: env.KB_CONFIG_MAX_ATTEMPTS, |
There was a problem hiding this comment.
logic: Using environment variables directly without validation could cause runtime errors. Consider adding type checks or default values.
Summary
add adjustable concurrency and batching to uploads and embeddings
Type of Change
Testing
Defaults all remain standard config.
Checklist