-
-
Couldn't load subscription status.
- Fork 10.9k
Description
Your current environment
The output of python collect_env.py
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35
Python version: 3.12.10 (main, Apr 9 2025, 08:55:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-116-ycgpu-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 180
On-line CPU(s) list: 0-179
Vendor ID: AuthenticAMD
Model name: AMD EPYC-Milan Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 45
Socket(s): 2
Stepping: 1
BogoMIPS: 7199.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid fsrm
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 2.8 MiB (90 instances)
L1i cache: 2.8 MiB (90 instances)
L2 cache: 45 MiB (90 instances)
L3 cache: 64 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-89
NUMA node1 CPU(s): 90-179
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flashinfer-python==0.2.1.post2+cu124torch2.6
[pip3] numpy==2.2.5
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-89 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-89 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-89 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-89 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 90-179 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 90-179 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 90-179 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 90-179 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NCCL_VERSION=2.20.5-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.4.0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
Bug Report: TypeError When Processing Canceled Requests with Tracing Enabled
Your current environment
Environment:
- vLLM version: 0.8.5.post1
- Docker image: vllm/vllm-openai:latest
- Hardware: 8x NVIDIA H100 GPUs
🐛 Describe the bug
Description:
When running vLLM with tracing enabled through the --otlp-traces-endpoint parameter, the engine crashes with a TypeError when processing a canceled request. The error occurs because the code attempts to calculate end-to-end processing time by subtracting metrics.arrival_time (a float) from metrics.finished_time (which is None for canceled requests).
Steps to Reproduce:
- Start vLLM server with tracing enabled:
vllm serve /mnt/models/Qwen3-235B-A22B-FP8 --enable-reasoning --reasoning-parser deepseek_r1 --enable-expert-parallel --tensor-parallel-size 8 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 --enable-prefix-caching --enable-chunked-prefill --otlp-traces-endpoint "collector.tracing.cloud.yandex.net:4317" --kv-cache-dtype fp8 --trust-remote-code --enable-server-load-tracking --host ::- Submit a request to the vLLM engine
- Cancel the request before it completes
- The server crashes with the TypeError
Expected Behavior:
The server should gracefully handle canceled requests, perhaps by skipping the e2e_time calculation or providing a default value when finished_time is None.
Actual Behavior:
The server crashes with:
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
Traceback/Debugging Information:
ERROR 05-19 07:08:29 [engine.py:160] Traceback (most recent call last):
...
ERROR 05-19 07:08:29 [engine.py:160] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1967, in create_trace_span
ERROR 05-19 07:08:29 [engine.py:160] e2e_time = metrics.finished_time - metrics.arrival_time
ERROR 05-19 07:08:29 [engine.py:160] ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
ERROR 05-19 07:08:29 [engine.py:160] TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
Additional Context/Observations:
The issue is located in llm_engine.py in the create_trace_span method. The relevant code is:
with self.tracer.start_as_current_span(
"llm_request",
kind=SpanKind.SERVER,
context=trace_context,
start_time=arrival_time_nano_seconds) as seq_span:
metrics = seq_group.metrics
ttft = metrics.first_token_time - metrics.arrival_time
e2e_time = metrics.finished_time - metrics.arrival_time # <-- Error happens here
# ... more code ...The problem occurs specifically when a request is canceled, which causes metrics.finished_time to be None, while metrics.arrival_time is a valid float value.
A simple fix would be to add a null check before performing the calculation:
if metrics.finished_time is not None:
e2e_time = metrics.finished_time - metrics.arrival_time
seq_span.set_attribute("e2e_time", e2e_time)
# Add appropriate else logic if needed for canceled requestsBefore submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.