What problem does this solve?
Many LLM providers enforce concurrency (RPM / concurrent request) limits on their APIs. Currently, Clawith has no built-in mechanism to control the number of simultaneous outgoing LLM requests. This leads to several problems:
-
No concurrency control – When multiple Agents are active simultaneously (e.g. via the Aware/Pulse Engine triggers, A2A conversations, or multi-channel IM messages), the system can fire an unbounded number of parallel LLM calls, easily exceeding a provider's rate limit and resulting in HTTP 429 errors or dropped connections.
-
Shared API key, shared limit – The current LLM configuration (PROVIDER_REGISTRY in llm_client.py) is keyed per-provider, not per-API-key. Multiple model entries (e.g. gpt-4o, gpt-4o-mini, o1-preview) may all use the same OpenAI API key and therefore share a single concurrency quota from the provider, but the system treats them independently with no awareness of this shared ceiling.
-
No way to group models – There is no facility to declare "these 5 models share a pool of N concurrent requests" or "these two providers should be merged into one limit group because they route through the same gateway."
-
Silent failures – When the provider rejects a request due to concurrency overflow, the user sees a generic LLMError rather than a queued/retried response.
Proposed solution
Introduce a Concurrency Limit Management layer with the following capabilities:
1. Per-group concurrency semaphores
Allow administrators to define named concurrency groups. Each group holds:
- A list of LLM config IDs (or provider + model patterns) that belong to the group
- A maximum concurrent request count (
max_concurrency)
Internally this could be backed by asyncio.Semaphore instances keyed by group name.
2. Default auto-grouping by provider
Out of the box, group all models under the same provider (e.g. all openai models → one semaphore). Admins can override this by creating explicit custom groups.
3. Custom cross-provider grouping
Support merging models from different providers into a single group. This covers scenarios where:
- A reverse proxy / API gateway fronts multiple providers under one rate limit
- An organization uses a shared quota across multiple API keys
- Specific high-traffic models need a dedicated pool separate from the default
Example configuration concept:
concurrency_groups:
- name: openai-shared
max_concurrency: 5
models: ["gpt-4o", "gpt-4o-mini", "o1-preview"]
- name: deepseek-group
max_concurrency: 3
providers: ["deepseek"]
- name: custom-gateway
max_concurrency: 10
models: ["claude-3-opus", "gpt-4o"] # behind same proxy
4. Graceful queuing & visibility
- When a group is at capacity, new LLM requests wait in a queue (with configurable timeout) rather than failing immediately
- Expose current concurrency utilization via an admin API endpoint and optionally in the frontend dashboard
- Log warnings when approaching the limit (e.g. 80% utilization)
5. Admin UI
Add a concurrency management section in the enterprise settings page to configure groups without editing config files.
Why this is useful
- Reliability – Eliminates HTTP 429 errors from providers by respecting their limits
- Fairness – Ensures no single Agent or trigger monopolizes the LLM capacity
- Flexibility – Adapts to any deployment topology (single provider, multi-provider, gateway-proxied)
- Cost control – Prevents unexpected burst usage that could spike API costs
- Production readiness – Essential for multi-tenant SaaS deployments where many Agents run concurrently
Willing to contribute?
Yes – I would be interested in working on this.
What problem does this solve?
Many LLM providers enforce concurrency (RPM / concurrent request) limits on their APIs. Currently, Clawith has no built-in mechanism to control the number of simultaneous outgoing LLM requests. This leads to several problems:
No concurrency control – When multiple Agents are active simultaneously (e.g. via the Aware/Pulse Engine triggers, A2A conversations, or multi-channel IM messages), the system can fire an unbounded number of parallel LLM calls, easily exceeding a provider's rate limit and resulting in HTTP 429 errors or dropped connections.
Shared API key, shared limit – The current LLM configuration (
PROVIDER_REGISTRYinllm_client.py) is keyed per-provider, not per-API-key. Multiple model entries (e.g.gpt-4o,gpt-4o-mini,o1-preview) may all use the same OpenAI API key and therefore share a single concurrency quota from the provider, but the system treats them independently with no awareness of this shared ceiling.No way to group models – There is no facility to declare "these 5 models share a pool of N concurrent requests" or "these two providers should be merged into one limit group because they route through the same gateway."
Silent failures – When the provider rejects a request due to concurrency overflow, the user sees a generic LLMError rather than a queued/retried response.
Proposed solution
Introduce a Concurrency Limit Management layer with the following capabilities:
1. Per-group concurrency semaphores
Allow administrators to define named concurrency groups. Each group holds:
max_concurrency)Internally this could be backed by
asyncio.Semaphoreinstances keyed by group name.2. Default auto-grouping by provider
Out of the box, group all models under the same provider (e.g. all
openaimodels → one semaphore). Admins can override this by creating explicit custom groups.3. Custom cross-provider grouping
Support merging models from different providers into a single group. This covers scenarios where:
Example configuration concept:
4. Graceful queuing & visibility
5. Admin UI
Add a concurrency management section in the enterprise settings page to configure groups without editing config files.
Why this is useful
Willing to contribute?
Yes – I would be interested in working on this.