Skip to content

[Bug]: DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf can't normally use by ragflow #5887

Open
@phoenixZZZ

Description

@phoenixZZZ

Is there an existing issue for the same bug?

  • I have checked the existing issues.

RAGFlow workspace code commit ID

v0.17.0

RAGFlow image version

v0.17.0

Other environment information

Ubuntu 24.04 Server LTS
2* Nvidia A16

Actual behavior

I created a knowledge store using my dataset.
I use /models/DeepSeek-R1-Distill-Qwen-32B-AWQ in the vLLM environment, and I can chat normally.

Expected behavior

I try to use /models/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf by vllm-env.
I can't use the chat properly. The LLM is too slow and doesn't provide normal responses.
I checked the logs in vLLM, and they seem normal.
INFO 03-10 19:39:03 logger.py:39] Received request chatcmpl-57ec2b77c3a34770988b9d0ef83db2e6: prompt: '你是一个智能助手,请总结知识库的内容来回答问题,请列举知识库中的数据详细回答。当所有知识库内容都与问题无关时,你的回答必须包括“知识库中未找到您要的答案!”这句话。回答需要考虑聊天历史。\n 以下是知识库:\n \n------\nDocument: 计算机工段控制系统操作规程.docx \nRelevant fragments as following:\n1. 5.2 操作必须由取得相应资格证的人员持证上岗操作。操作必须同时至少两人进行,一人监护,一人 操作。特别重要或复杂组态修改操作由班长或工段负责人监护。5.3 操作必须在监护人的监护下进行,监护人不得擅离职守、做与监护工作无关的事。6 异常情况上报\n2. 5 环境要求5.1 按操作项目规范要求穿戴好劳动保护用品。5.2 操作必须由取得相应资格证的人员持证上岗操作。操作必须同时至少两人进行,一人监护,一人 操作。特别重要或复杂操作由班长或工段负责人监护。5.3 操作必须在监护人的监护下进行,监护人不得擅离职守、做与监护工作无关的事。\n\n\n------\n\nDocument: 电仪运行部设备检修规程.docx \nRelevant fragments as following:\n1. 3)根据大修项目及工作内容,制定出材料、备品、备件计划。4)准备好工序卡、网络图、质检计划、安全技术措施等。5)对绝缘材料、备品、备件做必要的试验,鉴定其质量好坏,能否使用。6)能够事先加工的部件,要画图加工制作。工具准备:现场使用的工具要有 数,并列出 工具清单,对于电动工具要有绝缘合格证,并按正确方法使用。\n\n\n------\n\nDocument: store.txt \nRelevant fragments as following:\n1. (4)作业面堆积大量饭盒、烟头和矿泉水瓶61.请简述开工前针对设备的工艺处置措施及具体要求(1)隔断—氢气、物料输送管道和反应釜应做分段隔离,优先采用盲板隔断(2)吹扫—惰性气体吹扫,吹扫中应严格控制吹扫介质压力和进气量\n2. (4)作业面堆积大量饭盒、烟头和矿泉水瓶61.请简述开工前针对设备的工艺处置精 措施及具体要求(1)隔断—氢气、物料输送管道和反应釜应做分段隔离,优先采用盲板隔断(2)吹扫—惰性气体吹扫,吹扫中应严格控制吹扫介质压力和进气量\n3. 特点:一级释放源做5 0 密闭化处理,例如取样点做密闭\n\n### Query:\n你可以做什么<|User|>你可以做什么<|Assistant|>', params: SamplingParams(n=1, presence_penalty=0.4, frequency_penalty=0.7, repetition_penalty=1.0, temperature=0.1, top_p=0.3, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=31, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None. INFO: 172.20.0.1:51596 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 03-10 19:39:03 engine.py:280] Added request chatcmpl-57ec2b77c3a34770988b9d0ef83db2e6.

Steps to reproduce

1. create model
2. create assistant
3. chat

Additional information

my-vllm's docker-compose.yml
services:
  vllm-deepseek-gguf:
    image: vllm/vllm-openai:latest
    container_name: vllm-deepseek-gguf
    # restart: no
    shm_size: '64g'
    # 使用宿主机的 IPC 模式,提升性能
    ipc: host
    ports:
      - 8000:8000
    volumes:
      - ./cache:/workspace/.cache
      - /data/projects/LLM_Models/Original_Models:/models
    entrypoint: python3
      #command: -m vllm.entrypoints.openai.api_server --port=5000 --host=0.0.0.0 ${H2OGPT_VLLM_ARGS}
    command: 
      - "-m"
      - "vllm.entrypoints.openai.api_server" 
      - "--model"
      - "/models/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf"
      - "--host"
      - "0.0.0.0" 
      - "--port" 
      - "8000" 
      - "--tensor-parallel-size" 
      - "2" 
      - "--gpu-memory-utilization"
      - "0.95"
      - "--max-model-len"
      - "4096"
      - "--enforce-eager" 
      - "--distributed-executor-backend"
      - "ray"
      - "--trust-remote-code"
      - "--quantization" # 如果是量化版本,该参数进行使用
      - "gguf" 
    # env_file:
    #   - .env
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
      - PYTHONPATH=/workspace
      # - PYTORCH_MULTIPROCESSING_START_METHOD=spawn  # 强制使用 spawn
      - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
      - NCCL_P2P_DISABLE=1  # 如果 GPU 不支持 P2P(如 NVLink),禁用
      - NCCL_IB_DISABLE=1   # 如果没有 InfiniBand,禁用
      - NCCL_DEBUG=INFO     # 输出 NCCL 调试信息   
      - VLLM_API_KEY=550e8400-e29b-41d4-a716-446655440000  
    healthcheck:
      # 无api-key验证情况
      # test: [ "CMD", "curl", "-f", "http://0.0.0.0:8001/v1/models" ]
      test: [ "CMD", "curl", "-f", "-H", "Authorization: Bearer 550e8400-e29b-41d4-a716-446655440000","http://0.0.0.0:8000/v1/models" ]
      interval: 30s
      timeout: 5s
      retries: 20
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['6', '7']
            capabilities: [gpu]

this is my-vllm-docker-compose

Image

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions