Skip to content

[Feature Request]: Concurrency control for LLM calls while generating knowledge graph #5917

Open
@Randname666

Description

@Randname666

Is there an existing issue for the same feature request?

  • I have checked the existing issues.

Is your feature request related to a problem?

Seems too many concurrent LLM chat requests are sent while generating knowledge graph which causes trouble to LLM backends.

- For a remote API this could result failed request due to concurrency/rate limit set by API providers. Example:

 Exception: **ERROR**: Error code: 500 - {'error': {'message': 'concurrency exceeded', 'type': 'runtime_error', 'param': None, 'code': '20034'}}
ragflow-server  | ERROR:root:error extracting graph
ragflow-server  | Traceback (most recent call last):
ragflow-server  |   File "/ragflow/graphrag/light/graph_extractor.py", line 95, in _process_single_content
ragflow-server  |     final_result = self._chat(hint_prompt, [{"role": "user", "content": "Output:"}], gen_conf)
ragflow-server  |   File "/ragflow/graphrag/general/extractor.py", line 65, in _chat
ragflow-server  |     raise Exception(response)
ragflow-server  | Exception: **ERROR**: Error code: 500 - {'error': {'message': 'concurrency exceeded', 'type': 'runtime_error', 'param': None, 'code': '20034'}}
ragflow-server  | ERROR:root:error extracting graph
ragflow-server  | Traceback (most recent call last):
ragflow-server  |   File "/ragflow/graphrag/light/graph_extractor.py", line 95, in _process_single_content
ragflow-server  |     final_result = self._chat(hint_prompt, [{"role": "user", "content": "Output:"}], gen_conf)
ragflow-server  |   File "/ragflow/graphrag/general/extractor.py", line 65, in _chat
ragflow-server  |     raise Exception(response)
ragflow-server  | Exception: **ERROR**: Error code: 500 - {'error': {'message': 'concurrency exceeded', 'type': 'runtime_error', 'param': None, 'code': '20034'}}


- For a locally deployed LLM (especially with relatively limited resources) this may result in backend offloading to CPU due to resource bottleneck which unnecessarily impacts the performance.
Example:

Hardware: Tesla P40.
Generating graph:
NAME                  ID              SIZE     PROCESSOR          UNTIL
qwq-32b-rag:latest    b175c9dc4138    32 GB    23%/77% CPU/GPU    Forever

Chatting with a single user:
NAME                  ID              SIZE     PROCESSOR          UNTIL
qwq-32b-rag:latest    b175c9dc4138    32 GB    100% GPU    Forever

Visually speaking this probably has about 40~50% performance impact.

Describe the feature you'd like

Concurrency control for LLM requests during knowledge graph generation.
e.g. how many LLM requests are sent at the same time; or stop generating new requests when unfinished requests hit a certain limit.

Describe implementation you've considered

No response

Documentation, adoption, use case

Additional information

Issue #5257 might be related.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions