[Bug]: DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf  can't normally use by ragflow

### Is there an existing issue for the same bug?

- [x] I have checked the existing issues.

### RAGFlow workspace code commit ID

v0.17.0

### RAGFlow image version

v0.17.0

### Other environment information

```Markdown
Ubuntu 24.04 Server LTS
2* Nvidia A16
```

### Actual behavior

I created a knowledge store using my dataset. 
I use /models/DeepSeek-R1-Distill-Qwen-32B-AWQ in the vLLM environment, and I can chat normally.

### Expected behavior

I try to use /models/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf by vllm-env.
I can't use the chat properly. The LLM is too slow and doesn't provide normal responses.
I checked the logs in vLLM, and they seem normal.
`
INFO 03-10 19:39:03 logger.py:39] Received request chatcmpl-57ec2b77c3a34770988b9d0ef83db2e6: prompt: '你是一个智能助手，请总结知识库的内容来回答问题，请列举知识库中的数据详细回答。当所有知识库内容都与问题无关时，你的回答必须包括“知识库中未找到您要的答案！”这句话。回答需要考虑聊天历史。\n        以下是知识库：\n        \n------\nDocument: 计算机工段控制系统操作规程.docx \nRelevant fragments as following:\n1. 5.2  操作必须由取得相应资格证的人员持证上岗操作。操作必须同时至少两人进行，一人监护，一人 操作。特别重要或复杂组态修改操作由班长或工段负责人监护。5.3  操作必须在监护人的监护下进行，监护人不得擅离职守、做与监护工作无关的事。6  异常情况上报\n2. 5  环境要求5.1  按操作项目规范要求穿戴好劳动保护用品。5.2  操作必须由取得相应资格证的人员持证上岗操作。操作必须同时至少两人进行，一人监护，一人 操作。特别重要或复杂操作由班长或工段负责人监护。5.3  操作必须在监护人的监护下进行，监护人不得擅离职守、做与监护工作无关的事。\n\n\n------\n\nDocument: 电仪运行部设备检修规程.docx \nRelevant fragments as following:\n1. 3)根据大修项目及工作内容，制定出材料、备品、备件计划。4)准备好工序卡、网络图、质检计划、安全技术措施等。5)对绝缘材料、备品、备件做必要的试验，鉴定其质量好坏，能否使用。6)能够事先加工的部件，要画图加工制作。工具准备：现场使用的工具要有 数，并列出 工具清单，对于电动工具要有绝缘合格证，并按正确方法使用。\n\n\n------\n\nDocument: store.txt \nRelevant fragments as following:\n1. （4）作业面堆积大量饭盒、烟头和矿泉水瓶61.请简述开工前针对设备的工艺处置措施及具体要求（1）隔断—氢气、物料输送管道和反应釜应做分段隔离，优先采用盲板隔断（2）吹扫—惰性气体吹扫，吹扫中应严格控制吹扫介质压力和进气量\n2. （4）作业面堆积大量饭盒、烟头和矿泉水瓶61.请简述开工前针对设备的工艺处置精 措施及具体要求（1）隔断—氢气、物料输送管道和反应釜应做分段隔离，优先采用盲板隔断（2）吹扫—惰性气体吹扫，吹扫中应严格控制吹扫介质压力和进气量\n3. 特点：一级释放源做5 0 密闭化处理，例如取样点做密闭\n\n### Query:\n你可以做什么<｜User｜>你可以做什么<｜Assistant｜>', params: SamplingParams(n=1, presence_penalty=0.4, frequency_penalty=0.7, repetition_penalty=1.0, temperature=0.1, top_p=0.3, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=31, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.20.0.1:51596 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 03-10 19:39:03 engine.py:280] Added request chatcmpl-57ec2b77c3a34770988b9d0ef83db2e6.
`


### Steps to reproduce

```Markdown
1. create model
2. create assistant
3. chat
```

### Additional information

```
my-vllm's docker-compose.yml
services:
  vllm-deepseek-gguf:
    image: vllm/vllm-openai:latest
    container_name: vllm-deepseek-gguf
    # restart: no
    shm_size: '64g'
    # 使用宿主机的 IPC 模式，提升性能
    ipc: host
    ports:
      - 8000:8000
    volumes:
      - ./cache:/workspace/.cache
      - /data/projects/LLM_Models/Original_Models:/models
    entrypoint: python3
      #command: -m vllm.entrypoints.openai.api_server --port=5000 --host=0.0.0.0 ${H2OGPT_VLLM_ARGS}
    command: 
      - "-m"
      - "vllm.entrypoints.openai.api_server" 
      - "--model"
      - "/models/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf"
      - "--host"
      - "0.0.0.0" 
      - "--port" 
      - "8000" 
      - "--tensor-parallel-size" 
      - "2" 
      - "--gpu-memory-utilization"
      - "0.95"
      - "--max-model-len"
      - "4096"
      - "--enforce-eager" 
      - "--distributed-executor-backend"
      - "ray"
      - "--trust-remote-code"
      - "--quantization" # 如果是量化版本，该参数进行使用
      - "gguf" 
    # env_file:
    #   - .env
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
      - PYTHONPATH=/workspace
      # - PYTORCH_MULTIPROCESSING_START_METHOD=spawn  # 强制使用 spawn
      - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
      - NCCL_P2P_DISABLE=1  # 如果 GPU 不支持 P2P（如 NVLink），禁用
      - NCCL_IB_DISABLE=1   # 如果没有 InfiniBand，禁用
      - NCCL_DEBUG=INFO     # 输出 NCCL 调试信息   
      - VLLM_API_KEY=550e8400-e29b-41d4-a716-446655440000  
    healthcheck:
      # 无api-key验证情况
      # test: [ "CMD", "curl", "-f", "http://0.0.0.0:8001/v1/models" ]
      test: [ "CMD", "curl", "-f", "-H", "Authorization: Bearer 550e8400-e29b-41d4-a716-446655440000","http://0.0.0.0:8000/v1/models" ]
      interval: 30s
      timeout: 5s
      retries: 20
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['6', '7']
            capabilities: [gpu]

```
this is my-vllm-docker-compose

![Image](https://github.com/user-attachments/assets/8d446be8-02ae-406e-b7c0-1a9ed13cc38a)

![Image](https://github.com/user-attachments/assets/2a7e952a-5a2f-4818-8249-5d2398ed942f)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf can't normally use by ragflow #5887

Is there an existing issue for the same bug?

RAGFlow workspace code commit ID

RAGFlow image version

Other environment information

Actual behavior

Expected behavior

Steps to reproduce

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf can't normally use by ragflow #5887

Description

Is there an existing issue for the same bug?

RAGFlow workspace code commit ID

RAGFlow image version

Other environment information

Actual behavior

Expected behavior

Steps to reproduce

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions