Skip to content

[Bug]: Missing "type":"function" in OpenAI-Compatible Streaming Tool Calls with specific tool_choice #16340

@iGmainC

Description

@iGmainC

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-100-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   43 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 7542 32-Core Processor
CPU family:                      23
Model:                           49
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        0
Frequency boost:                 enabled
CPU max MHz:                     2900.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        5800.18
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
Virtualization:                  AMD-V
L1d cache:                       2 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        32 MiB (64 instances)
L3 cache:                        256 MiB (16 instances)
NUMA node(s):                    8
NUMA node0 CPU(s):               0-7,64-71
NUMA node1 CPU(s):               8-15,72-79
NUMA node2 CPU(s):               16-23,80-87
NUMA node3 CPU(s):               24-31,88-95
NUMA node4 CPU(s):               32-39,96-103
NUMA node5 CPU(s):               40-47,104-111
NUMA node6 CPU(s):               48-55,112-119
NUMA node7 CPU(s):               56-63,120-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.1.3
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pynvml==11.5.3
[pip3] pyzmq==26.2.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.1
[pip3] triton==3.2.0
[conda] numpy                     2.1.3                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pynvml                    11.5.3                   pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torchaudio                2.6.0                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] transformers              4.51.1                   pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
	GPU0	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	PHB	SYS	40-47,104-111	5		N/A
NIC0	PHB	 X 	PIX	SYS
NIC1	PHB	PIX	 X 	SYS
NIC2	SYS	SYS	SYS	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

Issue Overview

When using streaming responses with a specific tool choice ("tool_choice": {"type": "function", "function": {"name": "xxx"}}), VLLM's streaming output format doesn't comply with the OpenAI API standard. Specifically, the first tool call chunk is missing the required "type":"function" field.

Steps to Reproduce

  1. Start VLLM server with the following command:

    vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
      --quantization awq \
      --served-model-name qwen2.5-7b-instruct \
      --api-key xxx \
      --enable-auto-tool-choice \
      --tool-call-parser hermes \
      --gpu-memory-utilization 0.9 \
      --max-model-len 8192 \
      --max-num-seqs 512
  2. Make a request with specific tool_choice:

    curl --location --request POST 'http://127.0.0.1:8000/v1/chat/completions' \
      --header 'Authorization: Bearer xxx' \
      --header 'Content-Type: application/json' \
      --data-raw '{
        "model": "qwen2.5-7b-instruct",
        "messages": [{"role": "user", "content": "What is the weather like in Boston today?"}],
        "tools": [{
          "type": "function",
          "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
              "type": "object",
              "properties": {
                "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
              },
              "required": ["location"]
            }
          }
        }],
        "tool_choice": {"type": "function", "function": {"name": "get_current_weather"}},
        "stream": true
      }'

Current Output (with specific tool_choice)

data: {"id":"chatcmpl-e529e3b7d9bf4eddb2ae87a687b02c65","object":"chat.completion.chunk","created":1744189839,"model":"qwen2.5-7b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-e529e3b7d9bf4eddb2ae87a687b02c65","object":"chat.completion.chunk","created":1744189839,"model":"qwen2.5-7b-instruct","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"name":"get_current_weather","arguments":"{\n"}}]},"logprobs":null,"finish_reason":null}]}

Expected Output (OpenAI standard)

data: {"id":"chatcmpl-BKLwktzPcPGbK0ha9zurbznPoRDYD","object":"chat.completion.chunk","created":1744190622,"model":"gpt-4o-mini-2024-07-18","choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_qj5ZjQ2f60WZVrKUkG1LgWxK","type":"function","function":{"name":"get_current_weather","arguments":""}}],"logprobs":null,"finish_reason":null}]},"usage":null}

Additional Information

Interestingly, when using "tool_choice": "auto" instead of specifying a function, the response correctly includes "type":"function" in the tool calls:

data: {"id":"chatcmpl-787541e939a84cda87217a19f891a45b","object":"chat.completion.chunk","created":1744199068,"model":"qwen2.5-7b-instruct","choices":[{"index":0,"delta":{"tool_calls":[{"id":"chatcmpl-tool-a0027b6dbc0f43e080c3b15a804e6a14","type":"function","index":0,"function":{"name":"get_current_weather"}}]},"logprobs":null,"finish_reason":null}]}

This inconsistency causes compatibility issues with clients expecting OpenAI-compliant responses.

Impact

This formatting issue breaks client applications that perform strict validation on the OpenAI API response structure:

  1. My application uses Vercel's AI SDK, which performs strict validation on AI server responses. VLLM's current output format fails this validation.
  2. The Vercel AI SDK specifically expects this structure as noted in their code at https://github.com/vercel/ai/blob/main/packages/openai/src/openai-chat-language-model.ts#L655
  3. Any other libraries or applications that strictly validate OpenAI API responses will encounter similar failures when using VLLM with specific tool choices.

This issue renders VLLM's OpenAI compatibility layer unusable with tools like Vercel's AI SDK when using specific tool_choice parameters.

Proposed Solution

Since VLLM OpenAI-Compatible positions itself as an OpenAI-compatible server, I believe the appropriate solution is to fix this issue in the VLLM OpenAI compatibility layer itself. The fundamental purpose of an API-compatible server is to provide 100% format and functionality compatibility with the original API.

This discrepancy inadvertently forces developers to add conditional logic to their code to detect whether they're connecting to VLLM or the official OpenAI API, or to modify third-party libraries to accommodate these differences. VLLM should ensure its output format strictly adheres to the OpenAI API specification in all cases to maintain its value proposition as a drop-in replacement and prevent compatibility issues with the ecosystem of tools built around the OpenAI API format.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions