Open
Description
Is there an existing issue for the same feature request?
- I have checked the existing issues.
Is your feature request related to a problem?
Seems too many concurrent LLM chat requests are sent while generating knowledge graph which causes trouble to LLM backends.
- For a remote API this could result failed request due to concurrency/rate limit set by API providers. Example:
Exception: **ERROR**: Error code: 500 - {'error': {'message': 'concurrency exceeded', 'type': 'runtime_error', 'param': None, 'code': '20034'}}
ragflow-server | ERROR:root:error extracting graph
ragflow-server | Traceback (most recent call last):
ragflow-server | File "/ragflow/graphrag/light/graph_extractor.py", line 95, in _process_single_content
ragflow-server | final_result = self._chat(hint_prompt, [{"role": "user", "content": "Output:"}], gen_conf)
ragflow-server | File "/ragflow/graphrag/general/extractor.py", line 65, in _chat
ragflow-server | raise Exception(response)
ragflow-server | Exception: **ERROR**: Error code: 500 - {'error': {'message': 'concurrency exceeded', 'type': 'runtime_error', 'param': None, 'code': '20034'}}
ragflow-server | ERROR:root:error extracting graph
ragflow-server | Traceback (most recent call last):
ragflow-server | File "/ragflow/graphrag/light/graph_extractor.py", line 95, in _process_single_content
ragflow-server | final_result = self._chat(hint_prompt, [{"role": "user", "content": "Output:"}], gen_conf)
ragflow-server | File "/ragflow/graphrag/general/extractor.py", line 65, in _chat
ragflow-server | raise Exception(response)
ragflow-server | Exception: **ERROR**: Error code: 500 - {'error': {'message': 'concurrency exceeded', 'type': 'runtime_error', 'param': None, 'code': '20034'}}
- For a locally deployed LLM (especially with relatively limited resources) this may result in backend offloading to CPU due to resource bottleneck which unnecessarily impacts the performance.
Example:
Hardware: Tesla P40.
Generating graph:
NAME ID SIZE PROCESSOR UNTIL
qwq-32b-rag:latest b175c9dc4138 32 GB 23%/77% CPU/GPU Forever
Chatting with a single user:
NAME ID SIZE PROCESSOR UNTIL
qwq-32b-rag:latest b175c9dc4138 32 GB 100% GPU Forever
Visually speaking this probably has about 40~50% performance impact.
Describe the feature you'd like
Concurrency control for LLM requests during knowledge graph generation.
e.g. how many LLM requests are sent at the same time; or stop generating new requests when unfinished requests hit a certain limit.
Describe implementation you've considered
No response
Documentation, adoption, use case
Additional information
Issue #5257 might be related.
Metadata
Metadata
Assignees
Labels
No labels