Skip to content

feat(kb): add adjustable concurrency and batching to uploads and embeddings#1198

Merged
waleedlatif1 merged 1 commit intostagingfrom
improvement/kb-concurrency
Aug 30, 2025
Merged

feat(kb): add adjustable concurrency and batching to uploads and embeddings#1198
waleedlatif1 merged 1 commit intostagingfrom
improvement/kb-concurrency

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

add adjustable concurrency and batching to uploads and embeddings

Type of Change

  • New feature

Testing

Defaults all remain standard config.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Aug 30, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
sim Building Building Preview Comment Aug 30, 2025 1:37am
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
docs Skipped Skipped Aug 30, 2025 1:37am

@waleedlatif1 waleedlatif1 merged commit a383830 into staging Aug 30, 2025
3 of 4 checks passed
@waleedlatif1 waleedlatif1 deleted the improvement/kb-concurrency branch August 30, 2025 01:37
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces comprehensive configuration options for Knowledge Base (KB) processing operations by making concurrency limits, batch sizes, timeouts, and retry parameters adjustable through environment variables. The changes span three key files:

Environment Configuration (lib/env.ts): Adds 9 new environment variables with sensible defaults for KB processing configuration, including KB_CONFIG_CONCURRENCY_LIMIT (20), KB_CONFIG_BATCH_SIZE (20), KB_CONFIG_MAX_ATTEMPTS (3), and various timeout/delay parameters. These variables are well-documented and provide backward compatibility through default values.

Document Service (lib/knowledge/documents/service.ts): Replaces hardcoded processing parameters with the new environment-based configuration. Creates separate configuration objects for different processing modes - a conservative PROCESSING_CONFIG for in-memory processing and a more aggressive REDIS_PROCESSING_CONFIG for Redis-backed queues. The service now dynamically calculates concurrency limits and batch sizes based on the environment variables, with mathematical transformations to optimize for different processing backends.

Background Processing (background/knowledge-processing.ts): Updates the Trigger.dev task configuration to use the new environment variables for maxDuration, retry settings (maxAttempts, factor, minTimeoutInMs, maxTimeoutInMs), and queue concurrencyLimit. This ensures consistent configuration across all KB processing components.

The implementation allows operators to tune KB processing performance based on infrastructure capacity, API rate limits, and deployment environment requirements without code changes. The system maintains different behavior profiles - more conservative settings for fallback in-memory processing versus optimized settings when Redis queuing is available. This change integrates well with the existing codebase architecture and follows the pattern of using environment variables for operational configuration seen elsewhere in the application.

Confidence score: 3/5

  • This PR introduces significant configurability improvements but has some potential runtime safety issues
  • Score reflects concerns about mathematical operations that could result in zero values and missing fallback handling in critical paths
  • Pay close attention to apps/sim/lib/knowledge/documents/service.ts and apps/sim/background/knowledge-processing.ts for potential runtime failures

3 files reviewed, 3 comments

Edit Code Review Bot Settings | Greptile

batchSize: 10,
delayBetweenBatches: 200,
delayBetweenDocuments: 100,
maxConcurrentDocuments: Math.max(1, Math.floor(env.KB_CONFIG_CONCURRENCY_LIMIT / 5)) || 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The || 4 fallback won't trigger because Math.max(1, ...) ensures the result is never falsy. Consider using nullish coalescing if the env var might be undefined.

Comment on lines +43 to +44
delayBetweenBatches: env.KB_CONFIG_DELAY_BETWEEN_BATCHES * 2,
delayBetweenDocuments: env.KB_CONFIG_DELAY_BETWEEN_DOCUMENTS * 2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Direct multiplication without null checks could result in NaN if env variables are undefined. Consider adding validation or default values.

Comment on lines +62 to +63
retryDelay: env.KB_CONFIG_MIN_TIMEOUT,
maxRetries: env.KB_CONFIG_MAX_ATTEMPTS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Using environment variables directly without validation could cause runtime errors. Consider adding type checks or default values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant