Description
Is there an existing issue for the same bug?
- I have checked the existing issues.
RAGFlow workspace code commit ID
v0.17.0
RAGFlow image version
v0.17.0
Other environment information
Ubuntu 24.04 Server LTS
2* Nvidia A16
Actual behavior
I created a knowledge store using my dataset.
I use /models/DeepSeek-R1-Distill-Qwen-32B-AWQ in the vLLM environment, and I can chat normally.
Expected behavior
I try to use /models/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf by vllm-env.
I can't use the chat properly. The LLM is too slow and doesn't provide normal responses.
I checked the logs in vLLM, and they seem normal.
INFO 03-10 19:39:03 logger.py:39] Received request chatcmpl-57ec2b77c3a34770988b9d0ef83db2e6: prompt: '你是一个智能助手,请总结知识库的内容来回答问题,请列举知识库中的数据详细回答。当所有知识库内容都与问题无关时,你的回答必须包括“知识库中未找到您要的答案!”这句话。回答需要考虑聊天历史。\n 以下是知识库:\n \n------\nDocument: 计算机工段控制系统操作规程.docx \nRelevant fragments as following:\n1. 5.2 操作必须由取得相应资格证的人员持证上岗操作。操作必须同时至少两人进行,一人监护,一人 操作。特别重要或复杂组态修改操作由班长或工段负责人监护。5.3 操作必须在监护人的监护下进行,监护人不得擅离职守、做与监护工作无关的事。6 异常情况上报\n2. 5 环境要求5.1 按操作项目规范要求穿戴好劳动保护用品。5.2 操作必须由取得相应资格证的人员持证上岗操作。操作必须同时至少两人进行,一人监护,一人 操作。特别重要或复杂操作由班长或工段负责人监护。5.3 操作必须在监护人的监护下进行,监护人不得擅离职守、做与监护工作无关的事。\n\n\n------\n\nDocument: 电仪运行部设备检修规程.docx \nRelevant fragments as following:\n1. 3)根据大修项目及工作内容,制定出材料、备品、备件计划。4)准备好工序卡、网络图、质检计划、安全技术措施等。5)对绝缘材料、备品、备件做必要的试验,鉴定其质量好坏,能否使用。6)能够事先加工的部件,要画图加工制作。工具准备:现场使用的工具要有 数,并列出 工具清单,对于电动工具要有绝缘合格证,并按正确方法使用。\n\n\n------\n\nDocument: store.txt \nRelevant fragments as following:\n1. (4)作业面堆积大量饭盒、烟头和矿泉水瓶61.请简述开工前针对设备的工艺处置措施及具体要求(1)隔断—氢气、物料输送管道和反应釜应做分段隔离,优先采用盲板隔断(2)吹扫—惰性气体吹扫,吹扫中应严格控制吹扫介质压力和进气量\n2. (4)作业面堆积大量饭盒、烟头和矿泉水瓶61.请简述开工前针对设备的工艺处置精 措施及具体要求(1)隔断—氢气、物料输送管道和反应釜应做分段隔离,优先采用盲板隔断(2)吹扫—惰性气体吹扫,吹扫中应严格控制吹扫介质压力和进气量\n3. 特点:一级释放源做5 0 密闭化处理,例如取样点做密闭\n\n### Query:\n你可以做什么<|User|>你可以做什么<|Assistant|>', params: SamplingParams(n=1, presence_penalty=0.4, frequency_penalty=0.7, repetition_penalty=1.0, temperature=0.1, top_p=0.3, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=31, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None. INFO: 172.20.0.1:51596 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 03-10 19:39:03 engine.py:280] Added request chatcmpl-57ec2b77c3a34770988b9d0ef83db2e6.
Steps to reproduce
1. create model
2. create assistant
3. chat
Additional information
my-vllm's docker-compose.yml
services:
vllm-deepseek-gguf:
image: vllm/vllm-openai:latest
container_name: vllm-deepseek-gguf
# restart: no
shm_size: '64g'
# 使用宿主机的 IPC 模式,提升性能
ipc: host
ports:
- 8000:8000
volumes:
- ./cache:/workspace/.cache
- /data/projects/LLM_Models/Original_Models:/models
entrypoint: python3
#command: -m vllm.entrypoints.openai.api_server --port=5000 --host=0.0.0.0 ${H2OGPT_VLLM_ARGS}
command:
- "-m"
- "vllm.entrypoints.openai.api_server"
- "--model"
- "/models/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
- "--tensor-parallel-size"
- "2"
- "--gpu-memory-utilization"
- "0.95"
- "--max-model-len"
- "4096"
- "--enforce-eager"
- "--distributed-executor-backend"
- "ray"
- "--trust-remote-code"
- "--quantization" # 如果是量化版本,该参数进行使用
- "gguf"
# env_file:
# - .env
environment:
- CUDA_VISIBLE_DEVICES=0,1
- PYTHONPATH=/workspace
# - PYTORCH_MULTIPROCESSING_START_METHOD=spawn # 强制使用 spawn
- PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
- NCCL_P2P_DISABLE=1 # 如果 GPU 不支持 P2P(如 NVLink),禁用
- NCCL_IB_DISABLE=1 # 如果没有 InfiniBand,禁用
- NCCL_DEBUG=INFO # 输出 NCCL 调试信息
- VLLM_API_KEY=550e8400-e29b-41d4-a716-446655440000
healthcheck:
# 无api-key验证情况
# test: [ "CMD", "curl", "-f", "http://0.0.0.0:8001/v1/models" ]
test: [ "CMD", "curl", "-f", "-H", "Authorization: Bearer 550e8400-e29b-41d4-a716-446655440000","http://0.0.0.0:8000/v1/models" ]
interval: 30s
timeout: 5s
retries: 20
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['6', '7']
capabilities: [gpu]
this is my-vllm-docker-compose