[Feature Request]: Concurrency control for LLM calls while generating knowledge graph

### Is there an existing issue for the same feature request?

- [x] I have checked the existing issues.

### Is your feature request related to a problem?

```Markdown
Seems too many concurrent LLM chat requests are sent while generating knowledge graph which causes trouble to LLM backends.

- For a remote API this could result failed request due to concurrency/rate limit set by API providers. Example:

 Exception: **ERROR**: Error code: 500 - {'error': {'message': 'concurrency exceeded', 'type': 'runtime_error', 'param': None, 'code': '20034'}}
ragflow-server  | ERROR:root:error extracting graph
ragflow-server  | Traceback (most recent call last):
ragflow-server  |   File "/ragflow/graphrag/light/graph_extractor.py", line 95, in _process_single_content
ragflow-server  |     final_result = self._chat(hint_prompt, [{"role": "user", "content": "Output:"}], gen_conf)
ragflow-server  |   File "/ragflow/graphrag/general/extractor.py", line 65, in _chat
ragflow-server  |     raise Exception(response)
ragflow-server  | Exception: **ERROR**: Error code: 500 - {'error': {'message': 'concurrency exceeded', 'type': 'runtime_error', 'param': None, 'code': '20034'}}
ragflow-server  | ERROR:root:error extracting graph
ragflow-server  | Traceback (most recent call last):
ragflow-server  |   File "/ragflow/graphrag/light/graph_extractor.py", line 95, in _process_single_content
ragflow-server  |     final_result = self._chat(hint_prompt, [{"role": "user", "content": "Output:"}], gen_conf)
ragflow-server  |   File "/ragflow/graphrag/general/extractor.py", line 65, in _chat
ragflow-server  |     raise Exception(response)
ragflow-server  | Exception: **ERROR**: Error code: 500 - {'error': {'message': 'concurrency exceeded', 'type': 'runtime_error', 'param': None, 'code': '20034'}}


- For a locally deployed LLM (especially with relatively limited resources) this may result in backend offloading to CPU due to resource bottleneck which unnecessarily impacts the performance.
Example:

Hardware: Tesla P40.
Generating graph:
NAME                  ID              SIZE     PROCESSOR          UNTIL
qwq-32b-rag:latest    b175c9dc4138    32 GB    23%/77% CPU/GPU    Forever

Chatting with a single user:
NAME                  ID              SIZE     PROCESSOR          UNTIL
qwq-32b-rag:latest    b175c9dc4138    32 GB    100% GPU    Forever

Visually speaking this probably has about 40~50% performance impact.
```

### Describe the feature you'd like

Concurrency control for LLM requests during knowledge graph generation.
e.g. how many LLM requests are sent at the same time; or stop generating new requests when unfinished requests hit a certain limit.

### Describe implementation you've considered

_No response_

### Documentation, adoption, use case

```Markdown

```

### Additional information

Issue #5257 might be related.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Concurrency control for LLM calls while generating knowledge graph #5917

Is there an existing issue for the same feature request?

Is your feature request related to a problem?

Describe the feature you'd like

Describe implementation you've considered

Documentation, adoption, use case

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]: Concurrency control for LLM calls while generating knowledge graph #5917

Description

Is there an existing issue for the same feature request?

Is your feature request related to a problem?

Describe the feature you'd like

Describe implementation you've considered

Documentation, adoption, use case

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions