Description
Describe the Bug
3 GPUs are needed for disagg EPD serving. 1 gpu for each E P and D. If the number of GPUs available to the user is less than the number of GPUs that an EPD service needs, the service "starts up" normally and hangs when the user tries to make an inference call, without any sign or error that the user doesn't have enough GPUs available to run the service.
Steps to Reproduce
On a machine with 2 GPUs:
cd examples/multimodal
dynamo serve graphs.disagg:Frontend -f configs/disagg.yaml
- Once you see the line
Starting Processor instance with all registered endpoints
, go to another terminal window and make a curl request:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"stream": false
}'
Expected Behavior
When serving, an error should be thrown at/around the GPU discovery step, saying something like "number of available GPUs is insufficient for serving your graph, X GPUs required".
Actual Behavior
The service appears to start up normally, but hangs at inference time with no errors.
Environment
System Information:
OS: Ubuntu 24.04
Glibc Version: 2.39
GCC Version: 13.3.0
Cmake Version: 3.28.3
Rust Version: 1.87.0
Docker Version: 28.2.2
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
CPU family: 25
Model: 8
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU(s) scaling MHz: 46%
CPU max MHz: 7006.6401
CPU min MHz: 1800.0000
BogoMIPS: 7186.51
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization: AMD-V
L1d cache: 1 MiB (32 instances)
L1i cache: 1 MiB (32 instances)
L2 cache: 16 MiB (32 instances)
L3 cache: 128 MiB (4 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Python Version: 3.12.3
Python Platform: Linux-6.8.0-57-generic-x86_64-with-glibc2.39
Python Packages:
ai-dynamo: 0.3.0
ai-dynamo-runtime: 0.3.0
ai-dynamo-vllm: 0.8.4.post2
genai-perf: Not installed
nixl: 0.3.0
numpy: 1.26.4
nvidia-cublas-cu12: 12.4.5.8
nvidia-cuda-cupti-cu12: 12.4.127
nvidia-cuda-nvrtc-cu12: 12.4.127
nvidia-cuda-runtime-cu12: 12.4.127
nvidia-cudnn-cu12: 9.1.0.70
nvidia-cufft-cu12: 11.2.1.3
nvidia-curand-cu12: 10.3.5.147
nvidia-cusolver-cu12: 11.6.1.9
nvidia-cusparse-cu12: 12.3.1.170
nvidia-ml-py: 12.575.51
nvidia-nccl-cu12: 2.21.5
nvidia-nvjitlink-cu12: 12.4.127
nvidia-nvtx-cu12: 12.4.127
pyzmq: 26.4.0
tensorrt_llm: Not installed
torch: 2.6.0
torchaudio: 2.6.0
transformers: 4.52.4
tritonclient: 2.53.0
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:2B:00.0 Off | Off |
| 30% 45C P8 18W / 300W | 223MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:61:00.0 Off | Off |
| 30% 41C P8 22W / 300W | 15MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Additional Context
Raw disagg logs:
dynamo serve graphs.disagg:Frontend -f configs/disagg.yaml
2025-06-03T23:57:40.033Z INFO utils.resolve_service_config: Running dynamo serve with config: {'Common': {'model': 'llava-hf/llava-1.5-7b-hf', 'block-size': 64, 'max-model-len': 4096, 'kv-transfer-config': '{"kv_connector":"DynamoNixlConnector"}'}, 'Processor': {'router': 'round-robin', 'common-configs': ['model', 'block-size']}, 'VllmDecodeWorker': {'remote-prefill': True, 'conditional-disagg': True, 'max-local-prefill-length': 10, 'max-prefill-queue-size': 2, 'ServiceArgs': {'workers': 1, 'resources': {'gpu': 1}}, 'common-configs': ['model', 'block-size', 'max-model-len', 'kv-transfer-config']}, 'VllmPrefillWorker': {'max-num-batched-tokens': 16384, 'ServiceArgs': {'workers': 1, 'resources': {'gpu': 1}}, 'common-configs': ['model', 'block-size', 'max-model-len', 'kv-transfer-config']}, 'VllmEncodeWorker': {'tensor-parallel-size': 1, 'router': 'random', 'ServiceArgs': {'workers': 1, 'resources': {'gpu': 1}}, 'common-configs': ['model']}}
2025-06-03T23:57:41.625Z INFO __init__: Utilizing cupy to enable GPU acceleration.
2025-06-03T23:57:43.732Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:43.733Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:43.735Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:43.740Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:43.741Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:43.743Z INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.
2025-06-03T23:57:43.930Z INFO nixl: NIXL is available
2025-06-03T23:57:44.871Z INFO encode_worker: Using cupy for array operations (GPU mode).
╭───────────────── Dynamo Serve ──────────────────╮
│ Starting Dynamo service: graphs.disagg:Frontend │
╰─────────────────────────────────────────────────╯
2025-06-03T23:57:44.982Z INFO resource._discover_gpus: Discovered 2 GPUs
2025-06-03T23:57:45.010Z INFO resource._discover_gpus: Discovered 2 GPUs
2025-06-03T23:57:45.010Z INFO allocator.get_resource_envs: Getting resource envs for service Frontend
2025-06-03T23:57:45.010Z INFO allocator.get_resource_envs: Using configured worker count: 1
2025-06-03T23:57:45.010Z INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: []
2025-06-03T23:57:45.010Z INFO serving.serve_dynamo_graph: Serving dynamo graph with namespace dynamo
2025-06-03T23:57:45.010Z INFO serving.serve_dynamo_graph: Clearing namespace dynamo before serving
2025-06-03T23:57:45.010Z INFO allocator.get_resource_envs: Getting resource envs for service Processor
2025-06-03T23:57:45.010Z INFO allocator.get_resource_envs: Using configured worker count: 1
2025-06-03T23:57:45.010Z INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: []
2025-06-03T23:57:45.010Z INFO serving.create_dynamo_watcher: Created watcher for Processor's in the dynamo namespace
2025-06-03T23:57:45.010Z INFO allocator.get_resource_envs: Getting resource envs for service VllmDecodeWorker
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: GPU requirement found: 1
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: Using configured worker count: 1
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: GPU allocation enabled
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: Local deployment detected. Allocating GPUs for 1 workers of 'VllmDecodeWorker'
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: GPU 0 (NVIDIA RTX A6000): Memory: 47.2GB free / 48.0GB total, Utilization: 0%, Temperature: 49°C
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: [{'CUDA_VISIBLE_DEVICES': '0'}]
2025-06-03T23:57:45.011Z INFO serving.create_dynamo_watcher: Created watcher for VllmDecodeWorker's in the dynamo namespace
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: Getting resource envs for service VllmPrefillWorker
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: GPU requirement found: 1
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: Using configured worker count: 1
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: GPU allocation enabled
2025-06-03T23:57:45.011Z INFO allocator.get_resource_envs: Local deployment detected. Allocating GPUs for 1 workers of 'VllmPrefillWorker'
2025-06-03T23:57:45.012Z INFO allocator.get_resource_envs: GPU 1 (NVIDIA RTX A6000): Memory: 47.4GB free / 48.0GB total, Utilization: 0%, Temperature: 43°C
2025-06-03T23:57:45.012Z INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: [{'CUDA_VISIBLE_DEVICES': '1'}]
2025-06-03T23:57:45.012Z INFO serving.create_dynamo_watcher: Created watcher for VllmPrefillWorker's in the dynamo namespace
2025-06-03T23:57:45.012Z INFO allocator.get_resource_envs: Getting resource envs for service VllmEncodeWorker
2025-06-03T23:57:45.012Z INFO allocator.get_resource_envs: GPU requirement found: 1
2025-06-03T23:57:45.012Z INFO allocator.get_resource_envs: Using configured worker count: 1
2025-06-03T23:57:45.012Z INFO allocator.get_resource_envs: GPU allocation enabled
2025-06-03T23:57:45.012Z INFO allocator.get_resource_envs: Local deployment detected. Allocating GPUs for 1 workers of 'VllmEncodeWorker'
2025-06-03T23:57:45.012Z WARN allocator.assign_gpus: Requested 1 GPUs, but only 0 are remaining. Serving may fail due to inadequate GPUs. Set DYN_DISABLE_AUTO_GPU_ALLOCATION=1 to disable automatic allocation and allocate GPUs manually.
2025-06-03T23:57:45.012Z WARN allocator.assign_gpus: Not enough GPUs to be assigned, 1 is requested
2025-06-03T23:57:45.012Z INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: [{'CUDA_VISIBLE_DEVICES': '2'}]
2025-06-03T23:57:45.013Z INFO serving.create_dynamo_watcher: Created watcher for VllmEncodeWorker's in the dynamo namespace
2025-06-03T23:57:45.013Z INFO serving.serve_dynamo_graph: Created watcher for Frontend with 1 workers in the dynamo namespace
2025-06-03T23:57:45.013Z INFO arbiter._ensure_ioloop: Installing handle_callback_exception to loop
2025-06-03T23:57:45.013Z INFO sighandler.__init__: Registering signals...
2025-06-03T23:57:45.014Z INFO arbiter.start: Starting master on pid 527996
2025-06-03T23:57:45.015Z INFO arbiter.initialize: sockets started
2025-06-03T23:57:45.052Z INFO arbiter.start: Arbiter now waiting for commands
2025-06-03T23:57:45.052Z INFO watcher._start: dynamo_Processor started
2025-06-03T23:57:45.084Z INFO watcher._start: dynamo_VllmDecodeWorker started
2025-06-03T23:57:45.115Z INFO watcher._start: dynamo_VllmPrefillWorker started
2025-06-03T23:57:45.146Z INFO watcher._start: dynamo_VllmEncodeWorker started
2025-06-03T23:57:45.175Z INFO watcher._start: dynamo_Frontend started
2025-06-03T23:57:45.175Z INFO serving.<lambda>: Starting graphs.disagg:Frontend (Press CTRL+C to quit)
2025-06-03T23:57:47.791Z INFO __init__: Utilizing cupy to enable GPU acceleration.
2025-06-03T23:57:47.793Z INFO __init__: Utilizing cupy to enable GPU acceleration.
2025-06-03T23:57:47.827Z INFO __init__: Utilizing cupy to enable GPU acceleration.
2025-06-03T23:57:47.832Z INFO __init__: Utilizing cupy to enable GPU acceleration.
2025-06-03T23:57:47.882Z INFO __init__: Utilizing cupy to enable GPU acceleration.
2025-06-03T23:57:49.730Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.731Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.733Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.739Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.740Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.741Z INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.
2025-06-03T23:57:49.795Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.796Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.796Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.797Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.798Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.799Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.803Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.803Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.805Z INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.
2025-06-03T23:57:49.805Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.806Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.807Z INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.
2025-06-03T23:57:49.811Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.812Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.814Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.819Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.820Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.822Z INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.
2025-06-03T23:57:49.836Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.837Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.841Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.846Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.847Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:57:49.849Z INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.
2025-06-03T23:57:49.926Z INFO nixl: NIXL is available
2025-06-03T23:57:49.989Z INFO nixl: NIXL is available
2025-06-03T23:57:49.996Z INFO nixl: NIXL is available
2025-06-03T23:57:50.013Z INFO nixl: NIXL is available
2025-06-03T23:57:50.036Z INFO nixl: NIXL is available
2025-06-03T23:57:50.692Z INFO encode_worker: Using cupy for array operations (GPU mode).
2025-06-03T23:57:50.772Z INFO serve_dynamo.dyn_worker: [Processor:1] Registering component dynamo/Processor
2025-06-03T23:57:50.773Z INFO serve_dynamo.dyn_worker: [Processor:1] Created Processor component
2025-06-03T23:57:50.773Z INFO config.as_args: [Processor:1] Running Processor with args=['--model', 'llava-hf/llava-1.5-7b-hf', '--block-size', '64', '--router', 'round-robin']
2025-06-03T23:57:50.815Z INFO encode_worker: Using cupy for array operations (GPU mode).
2025-06-03T23:57:50.872Z INFO encode_worker: Using cupy for array operations (GPU mode).
2025-06-03T23:57:50.880Z INFO encode_worker: Using cupy for array operations (GPU mode).
2025-06-03T23:57:50.890Z WARN encode_worker: Failed to import cupy, falling back to numpy: CUDA is not available..
2025-06-03T23:57:50.895Z INFO serve_dynamo.dyn_worker: [VllmPrefillWorker:1] Registering component dynamo/VllmPrefillWorker
2025-06-03T23:57:50.896Z INFO serve_dynamo.dyn_worker: [VllmPrefillWorker:1] Created VllmPrefillWorker component
2025-06-03T23:57:50.896Z INFO config.as_args: [VllmPrefillWorker:1] Running VllmPrefillWorker with args=['--model', 'llava-hf/llava-1.5-7b-hf', '--block-size', '64', '--max-model-len', '4096', '--kv-transfer-config', '{"kv_connector":"DynamoNixlConnector"}', '--max-num-batched-tokens', '16384']
2025-06-03T23:57:50.946Z INFO prefill_worker.__init__: [VllmPrefillWorker:1] Chunked prefill is not supported yet, setting to False
2025-06-03T23:57:50.946Z INFO prefill_worker.__init__: [VllmPrefillWorker:1] Async output processing is not supported yet, setting to True
2025-06-03T23:57:50.946Z INFO prefill_worker.__init__: [VllmPrefillWorker:1] Prefill must be done eagerly, setting to True
2025-06-03T23:57:50.946Z INFO prefill_worker.__init__: [VllmPrefillWorker:1] Prefix caching is not supported yet in prefill worker, setting to False
2025-06-03T23:57:50.953Z INFO serve_dynamo.dyn_worker: [VllmDecodeWorker:1] Registering component dynamo/VllmDecodeWorker
2025-06-03T23:57:50.953Z INFO serve_dynamo.dyn_worker: [Frontend:1] Registering component dynamo/Frontend
2025-06-03T23:57:50.954Z INFO serve_dynamo.dyn_worker: [VllmDecodeWorker:1] Created VllmDecodeWorker component
2025-06-03T23:57:50.954Z INFO config.as_args: [VllmDecodeWorker:1] Running VllmDecodeWorker with args=['--model', 'llava-hf/llava-1.5-7b-hf', '--block-size', '64', '--max-model-len', '4096', '--kv-transfer-config', '{"kv_connector":"DynamoNixlConnector"}', '--remote-prefill', '--conditional-disagg', '--max-local-prefill-length', '10', '--max-prefill-queue-size', '2']
2025-06-03T23:57:50.954Z INFO serve_dynamo.dyn_worker: [Frontend:1] Created Frontend component
2025-06-03T23:57:50.954Z INFO serve_dynamo.dyn_worker: [Frontend:1] Starting Frontend instance with all registered endpoints
2025-06-03T23:57:50.954Z INFO serve_dynamo.dyn_worker: [Frontend:1] No Dynamo endpoints found in service Frontend but keeping service alive
2025-06-03T23:57:50.954Z INFO serve_dynamo.add_fastapi_routes: [Frontend:1] Registering regular endpoint /v1/chat/completions
2025-06-03T23:57:50.955Z INFO serve_dynamo.add_fastapi_routes: [Frontend:1] Added API route /v1/chat/completions to FastAPI app
2025-06-03T23:57:50.955Z INFO serve_dynamo.web_worker: [Frontend:1] Starting FastAPI server on 0.0.0.0:8000 with routes: ['/v1/chat/completions']
2025-06-03T23:57:50.969Z INFO serve_dynamo.dyn_worker: [VllmEncodeWorker:1] Registering component dynamo/VllmEncodeWorker
2025-06-03T23:57:50.970Z INFO serve_dynamo.dyn_worker: [VllmEncodeWorker:1] Created VllmEncodeWorker component
2025-06-03T23:57:50.970Z INFO config.as_args: [VllmEncodeWorker:1] Running VllmEncodeWorker with args=['--model', 'llava-hf/llava-1.5-7b-hf', '--tensor-parallel-size', '1', '--router', 'random']
2025-06-03T23:57:50.971Z INFO server._serve: [Frontend:1] Started server process [528212]
2025-06-03T23:57:50.971Z INFO on.startup: [Frontend:1] Waiting for application startup.
2025-06-03T23:57:50.971Z INFO on.startup: [Frontend:1] Application startup complete.
2025-06-03T23:57:50.971Z INFO server._log_started_message: [Frontend:1] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2025-06-03T23:57:51.006Z INFO decode_worker.__init__: [VllmDecodeWorker:1] Prefill queue: nats://localhost:4222:vllm
2025-06-03T23:57:51.006Z INFO decode_worker.__init__: [VllmDecodeWorker:1] Chunked prefill is not supported yet, setting to False
2025-06-03T23:57:51.006Z INFO decode_worker.__init__: [VllmDecodeWorker:1] Preemption mode is not supported yet, setting to swap
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 20.74it/s]
2025-06-03T23:57:52.192Z INFO encode_worker.async_init: [VllmEncodeWorker:1] Startup started.
Backend UCX was instantiated
Initialized NIXL agent: 2b529c7d-4d88-43a3-b093-170f62d69942
2025-06-03T23:57:52.403Z INFO encode_worker.async_init: [VllmEncodeWorker:1] Startup completed.
2025-06-03T23:57:52.403Z INFO serve_dynamo.dyn_worker: [VllmEncodeWorker:1] Starting VllmEncodeWorker instance with all registered endpoints
2025-06-03T23:57:58.034Z INFO config._resolve_task: This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
2025-06-03T23:57:58.040Z INFO api_server.build_async_engine_client_from_engine_args: Started engine process with PID 529513
2025-06-03T23:57:58.104Z INFO config._resolve_task: This model supports multiple tasks: {'reward', 'classify', 'embed', 'generate', 'score'}. Defaulting to 'generate'.
2025-06-03T23:57:58.189Z INFO config._resolve_task: This model supports multiple tasks: {'score', 'generate', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
2025-06-03T23:57:58.193Z INFO api_server.build_async_engine_client_from_engine_args: Started engine process with PID 529516
2025-06-03T23:58:02.174Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.175Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.177Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.182Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.183Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.184Z INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.
2025-06-03T23:58:02.341Z INFO nixl: NIXL is available
2025-06-03T23:58:02.347Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.348Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.350Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.355Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.356Z WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm
2025-06-03T23:58:02.358Z INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.
2025-06-03T23:58:02.517Z INFO nixl: NIXL is available
2025-06-03T23:58:03.000Z INFO llm_engine.__init__: Initializing a V0 LLM engine (v0.8.4) with config: model='llava-hf/llava-1.5-7b-hf', speculative_config=None, tokenizer='llava-hf/llava-1.5-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=llava-hf/llava-1.5-7b-hf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
2025-06-03T23:58:03.193Z INFO llm_engine.__init__: Initializing a V0 LLM engine (v0.8.4) with config: model='llava-hf/llava-1.5-7b-hf', speculative_config=None, tokenizer='llava-hf/llava-1.5-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=llava-hf/llava-1.5-7b-hf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
2025-06-03T23:58:04.282Z INFO cuda.get_attn_backend_cls: Using Flash Attention backend.
2025-06-03T23:58:04.462Z INFO cuda.get_attn_backend_cls: Using Flash Attention backend.
2025-06-03T23:58:04.690Z INFO parallel_state.initialize_model_parallel: rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
2025-06-03T23:58:04.690Z INFO model_runner.load_model: Starting to load model llava-hf/llava-1.5-7b-hf...
2025-06-03T23:58:04.854Z INFO parallel_state.initialize_model_parallel: rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
2025-06-03T23:58:04.854Z INFO model_runner.load_model: Starting to load model llava-hf/llava-1.5-7b-hf...
2025-06-03T23:58:05.006Z INFO cuda.get_attn_backend_cls: Using Flash Attention backend.
2025-06-03T23:58:05.025Z INFO config.init_with_cudagraph_sizes: cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
2025-06-03T23:58:05.154Z INFO cuda.get_attn_backend_cls: Using Flash Attention backend.
2025-06-03T23:58:05.172Z INFO config.init_with_cudagraph_sizes: cudagraph sizes specified by model runner [] is overridden by config []
2025-06-03T23:58:05.321Z INFO weight_utils.download_weights_from_hf: Using model weights format ['*.safetensors']
2025-06-03T23:58:05.460Z INFO weight_utils.download_weights_from_hf: Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.52it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.61it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.37it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.42it/s]
2025-06-03T23:58:07.787Z INFO loader.load_model: Loading weights took 2.26 seconds
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.41it/s]
2025-06-03T23:58:07.980Z INFO model_runner.load_model: Model loading took 13.1343 GiB and 3.100958 seconds
2025-06-03T23:58:08.099Z INFO loader.load_model: Loading weights took 2.31 seconds
2025-06-03T23:58:08.291Z INFO model_runner.load_model: Model loading took 13.1343 GiB and 3.248275 seconds
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
2025-06-03T23:58:10.635Z INFO worker.determine_num_available_blocks: Memory profiling takes 2.49 seconds
the current vLLM instance can use total_gpu_memory (47.39GiB) x gpu_memory_utilization (0.90) = 42.65GiB
model weights take 13.13GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 0.49GiB; the rest of the memory reserved for KV Cache is 28.97GiB.
2025-06-03T23:58:10.801Z INFO executor_base.initialize_cache: # cuda blocks: 926, # CPU blocks: 128
2025-06-03T23:58:10.801Z INFO executor_base.initialize_cache: Maximum concurrency for 4096 tokens per request: 14.47x
2025-06-03T23:58:12.767Z INFO worker.determine_num_available_blocks: Memory profiling takes 4.29 seconds
the current vLLM instance can use total_gpu_memory (47.41GiB) x gpu_memory_utilization (0.90) = 42.67GiB
model weights take 13.13GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.55GiB; the rest of the memory reserved for KV Cache is 27.93GiB.
2025-06-03T23:58:12.971Z INFO executor_base.initialize_cache: # cuda blocks: 893, # CPU blocks: 128
2025-06-03T23:58:12.971Z INFO executor_base.initialize_cache: Maximum concurrency for 4096 tokens per request: 13.95x
2025-06-03T23:58:13.201Z INFO model_runner.capture_model: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 11%|████████▌ | 4/35 [00:01<00:15, 2.07it/s]2025-06-03T23:58:15.375Z INFO llm_engine._initialize_kv_caches: init engine (profile, create kv cache, warmup model) took 7.08 seconds
2025-06-03T23:58:15.399Z INFO nixl.__init__: Initializing NIXL wrapper
Backend UCX was instantiated
Initialized NIXL agent: 251736b8-a84c-46b3-9a68-3a0a7d50657d
Capturing CUDA graph shapes: 14%|██████████▋ | 5/35 [00:02<00:14, 2.07it/s]Backend UCX was instantiated
Initialized NIXL agent: d5d864a0-5fc3-4b32-a418-b1b78c010ed1
Workers ready: [7587887212314322840]
2025-06-03T23:58:16.035Z INFO prefill_worker.async_init: [VllmPrefillWorker:1] Initialization complete.
2025-06-03T23:58:16.035Z INFO serve_dynamo.dyn_worker: [VllmPrefillWorker:1] Starting VllmPrefillWorker instance with all registered endpoints
2025-06-03T23:58:16.035Z INFO prefill_worker.prefill_queue_handler: [VllmPrefillWorker:1] Prefill queue handler entered
2025-06-03T23:58:16.035Z INFO prefill_worker.prefill_queue_handler: [VllmPrefillWorker:1] Prefill queue: nats://localhost:4222:vllm
2025-06-03T23:58:16.038Z INFO prefill_worker.prefill_queue_handler: [VllmPrefillWorker:1] prefill queue handler started
2025-06-03T23:58:16.039Z INFO prefill_worker.prefill_queue_handler: [VllmPrefillWorker:1] Dequeued prefill request: 56d80da5-d65b-4f79-b8b8-c8d723bf88e4
2025-06-03T23:58:16.039Z INFO prefill_worker.generate: [VllmPrefillWorker:1] Received prefill request { id: 56d80da5-d65b-4f79-b8b8-c8d723bf88e4, engine_id: cc9a3288-5a0a-402b-b1e7-38f5a9d3eba7 }.
Capturing CUDA graph shapes: 17%|████████████▊ | 6/35 [00:02<00:13, 2.10it/s]2025-06-03T23:58:16.265Z INFO _client._send_single_request: [VllmEncodeWorker:1] HTTP Request: GET http://images.cocodataset.org/test2017/000000155781.jpg "HTTP/1.1 200 OK"
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████| 35/35 [00:15<00:00, 2.31it/s]
2025-06-03T23:58:28.344Z INFO model_runner.capture_model: Graph capturing finished in 15 secs, took 0.24 GiB
2025-06-03T23:58:28.344Z INFO llm_engine._initialize_kv_caches: init engine (profile, create kv cache, warmup model) took 20.36 seconds
2025-06-03T23:58:28.362Z INFO nixl.__init__: Initializing NIXL wrapper
Backend UCX was instantiated
Initialized NIXL agent: c7c0eb15-5af1-45e5-a5e8-1a582d85ec64
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.25it/s]
2025-06-03T23:58:32.030Z INFO decode_worker.async_init: [VllmDecodeWorker:1] Initialization complete { configuration: Disaggregated }.
2025-06-03T23:58:33.887Z INFO serve_dynamo.dyn_worker: [VllmDecodeWorker:1] Starting VllmDecodeWorker instance with all registered endpoints
2025-06-03T23:58:34.125Z INFO logging.check_required_workers: [Processor:1] Waiting for more workers to be ready.
Current: 1, Required: 1
Workers ready: [7587887212314322836]
2025-06-03T23:58:34.129Z INFO serve_dynamo.dyn_worker: [Processor:1] Starting Processor instance with all registered endpoints
2025-06-03T23:59:35.386Z INFO httptools_impl.send: [Frontend:1] 127.0.0.1:42238 - "POST /v1/chat/completions HTTP/1.1" 200
2025-06-03T23:59:35.394Z INFO chat_utils._log_chat_template_content_format: Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
2025-06-03T23:59:35.399Z INFO decode_worker.generate: [VllmDecodeWorker:1] Received multimodal request { id: d60125a3-66c8-40fe-a25b-91e43f34f593 }.
2025-06-03T23:59:35.402Z INFO disagg_router.prefill_remote: [VllmDecodeWorker:1] Remote prefill: True (prefill length: 23/10, prefill queue size: 0/2)
2025-06-03T23:59:35.403Z INFO engine._handle_process_request: Added request d60125a3-66c8-40fe-a25b-91e43f34f593.
2025-06-03T23:59:35.404Z INFO metrics.log: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 90.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.1%, CPU KV cache usage: 0.0%.
2025-06-03T23:59:40.406Z INFO metrics.log: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.1%, CPU KV cache usage: 0.0%.
Screenshots
No response