Skip to content

[BUG]: Error should be thrown if number of GPUs are insufficient #1373

Open
@hhzhang16

Description

@hhzhang16

Describe the Bug

3 GPUs are needed for disagg EPD serving. 1 gpu for each E P and D. If the number of GPUs available to the user is less than the number of GPUs that an EPD service needs, the service "starts up" normally and hangs when the user tries to make an inference call, without any sign or error that the user doesn't have enough GPUs available to run the service.

Steps to Reproduce

On a machine with 2 GPUs:

  1. cd examples/multimodal
  2. dynamo serve graphs.disagg:Frontend -f configs/disagg.yaml
  3. Once you see the line Starting Processor instance with all registered endpoints, go to another terminal window and make a curl request:
curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
      "model": "llava-hf/llava-1.5-7b-hf",
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": "What is in this image?"
            },
            {
              "type": "image_url",
              "image_url": {
                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
              }
            }
          ]
        }
      ],
      "max_tokens": 300,
      "stream": false
    }'

Expected Behavior

When serving, an error should be thrown at/around the GPU discovery step, saying something like "number of available GPUs is insufficient for serving your graph, X GPUs required".

Actual Behavior

The service appears to start up normally, but hangs at inference time with no errors.

Environment

System Information:
OS: Ubuntu 24.04
Glibc Version: 2.39
GCC Version: 13.3.0
Cmake Version: 3.28.3
Rust Version: 1.87.0
Docker Version: 28.2.2

CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
CPU family: 25
Model: 8
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU(s) scaling MHz: 46%
CPU max MHz: 7006.6401
CPU min MHz: 1800.0000
BogoMIPS: 7186.51
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization: AMD-V
L1d cache: 1 MiB (32 instances)
L1i cache: 1 MiB (32 instances)
L2 cache: 16 MiB (32 instances)
L3 cache: 128 MiB (4 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Python Version: 3.12.3
Python Platform: Linux-6.8.0-57-generic-x86_64-with-glibc2.39

Python Packages:
ai-dynamo: 0.3.0
ai-dynamo-runtime: 0.3.0
ai-dynamo-vllm: 0.8.4.post2
genai-perf: Not installed
nixl: 0.3.0
numpy: 1.26.4
nvidia-cublas-cu12: 12.4.5.8
nvidia-cuda-cupti-cu12: 12.4.127
nvidia-cuda-nvrtc-cu12: 12.4.127
nvidia-cuda-runtime-cu12: 12.4.127
nvidia-cudnn-cu12: 9.1.0.70
nvidia-cufft-cu12: 11.2.1.3
nvidia-curand-cu12: 10.3.5.147
nvidia-cusolver-cu12: 11.6.1.9
nvidia-cusparse-cu12: 12.3.1.170
nvidia-ml-py: 12.575.51
nvidia-nccl-cu12: 2.21.5
nvidia-nvjitlink-cu12: 12.4.127
nvidia-nvtx-cu12: 12.4.127
pyzmq: 26.4.0
tensorrt_llm: Not installed
torch: 2.6.0
torchaudio: 2.6.0
transformers: 4.52.4
tritonclient: 2.53.0

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:2B:00.0 Off | Off |
| 30% 45C P8 18W / 300W | 223MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:61:00.0 Off | Off |
| 30% 41C P8 22W / 300W | 15MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

Additional Context

Raw disagg logs:

dynamo serve graphs.disagg:Frontend -f configs/disagg.yaml

2025-06-03T23:57:40.033Z  INFO utils.resolve_service_config: Running dynamo serve with config: {'Common': {'model': 'llava-hf/llava-1.5-7b-hf', 'block-size': 64, 'max-model-len': 4096, 'kv-transfer-config': '{"kv_connector":"DynamoNixlConnector"}'}, 'Processor': {'router': 'round-robin', 'common-configs': ['model', 'block-size']}, 'VllmDecodeWorker': {'remote-prefill': True, 'conditional-disagg': True, 'max-local-prefill-length': 10, 'max-prefill-queue-size': 2, 'ServiceArgs': {'workers': 1, 'resources': {'gpu': 1}}, 'common-configs': ['model', 'block-size', 'max-model-len', 'kv-transfer-config']}, 'VllmPrefillWorker': {'max-num-batched-tokens': 16384, 'ServiceArgs': {'workers': 1, 'resources': {'gpu': 1}}, 'common-configs': ['model', 'block-size', 'max-model-len', 'kv-transfer-config']}, 'VllmEncodeWorker': {'tensor-parallel-size': 1, 'router': 'random', 'ServiceArgs': {'workers': 1, 'resources': {'gpu': 1}}, 'common-configs': ['model']}}   
2025-06-03T23:57:41.625Z  INFO __init__: Utilizing cupy to enable GPU acceleration.   
2025-06-03T23:57:43.732Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:43.733Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:43.735Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:43.740Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:43.741Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:43.743Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-06-03T23:57:43.930Z  INFO nixl: NIXL is available   
2025-06-03T23:57:44.871Z  INFO encode_worker: Using cupy for array operations (GPU mode).   
╭───────────────── Dynamo Serve ──────────────────╮
│ Starting Dynamo service: graphs.disagg:Frontend │
╰─────────────────────────────────────────────────╯
2025-06-03T23:57:44.982Z  INFO resource._discover_gpus: Discovered 2 GPUs   
2025-06-03T23:57:45.010Z  INFO resource._discover_gpus: Discovered 2 GPUs   
2025-06-03T23:57:45.010Z  INFO allocator.get_resource_envs: Getting resource envs for service Frontend   
2025-06-03T23:57:45.010Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-06-03T23:57:45.010Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: []   
2025-06-03T23:57:45.010Z  INFO serving.serve_dynamo_graph: Serving dynamo graph with namespace dynamo   
2025-06-03T23:57:45.010Z  INFO serving.serve_dynamo_graph: Clearing namespace dynamo before serving   
2025-06-03T23:57:45.010Z  INFO allocator.get_resource_envs: Getting resource envs for service Processor   
2025-06-03T23:57:45.010Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-06-03T23:57:45.010Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: []   
2025-06-03T23:57:45.010Z  INFO serving.create_dynamo_watcher: Created watcher for Processor's in the dynamo namespace   
2025-06-03T23:57:45.010Z  INFO allocator.get_resource_envs: Getting resource envs for service VllmDecodeWorker   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: GPU requirement found: 1   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: GPU allocation enabled   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: Local deployment detected. Allocating GPUs for 1 workers of 'VllmDecodeWorker'   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: GPU 0 (NVIDIA RTX A6000): Memory: 47.2GB free / 48.0GB total, Utilization: 0%, Temperature: 49°C   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: [{'CUDA_VISIBLE_DEVICES': '0'}]   
2025-06-03T23:57:45.011Z  INFO serving.create_dynamo_watcher: Created watcher for VllmDecodeWorker's in the dynamo namespace   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: Getting resource envs for service VllmPrefillWorker   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: GPU requirement found: 1   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: GPU allocation enabled   
2025-06-03T23:57:45.011Z  INFO allocator.get_resource_envs: Local deployment detected. Allocating GPUs for 1 workers of 'VllmPrefillWorker'   
2025-06-03T23:57:45.012Z  INFO allocator.get_resource_envs: GPU 1 (NVIDIA RTX A6000): Memory: 47.4GB free / 48.0GB total, Utilization: 0%, Temperature: 43°C   
2025-06-03T23:57:45.012Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: [{'CUDA_VISIBLE_DEVICES': '1'}]   
2025-06-03T23:57:45.012Z  INFO serving.create_dynamo_watcher: Created watcher for VllmPrefillWorker's in the dynamo namespace   
2025-06-03T23:57:45.012Z  INFO allocator.get_resource_envs: Getting resource envs for service VllmEncodeWorker   
2025-06-03T23:57:45.012Z  INFO allocator.get_resource_envs: GPU requirement found: 1   
2025-06-03T23:57:45.012Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-06-03T23:57:45.012Z  INFO allocator.get_resource_envs: GPU allocation enabled   
2025-06-03T23:57:45.012Z  INFO allocator.get_resource_envs: Local deployment detected. Allocating GPUs for 1 workers of 'VllmEncodeWorker'   
2025-06-03T23:57:45.012Z  WARN allocator.assign_gpus: Requested 1 GPUs, but only 0 are remaining. Serving may fail due to inadequate GPUs. Set DYN_DISABLE_AUTO_GPU_ALLOCATION=1 to disable automatic allocation and allocate GPUs manually.   
2025-06-03T23:57:45.012Z  WARN allocator.assign_gpus: Not enough GPUs to be assigned, 1 is requested   
2025-06-03T23:57:45.012Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: [{'CUDA_VISIBLE_DEVICES': '2'}]   
2025-06-03T23:57:45.013Z  INFO serving.create_dynamo_watcher: Created watcher for VllmEncodeWorker's in the dynamo namespace   
2025-06-03T23:57:45.013Z  INFO serving.serve_dynamo_graph: Created watcher for Frontend with 1 workers in the dynamo namespace   
2025-06-03T23:57:45.013Z  INFO arbiter._ensure_ioloop: Installing handle_callback_exception to loop   
2025-06-03T23:57:45.013Z  INFO sighandler.__init__: Registering signals...   
2025-06-03T23:57:45.014Z  INFO arbiter.start: Starting master on pid 527996   
2025-06-03T23:57:45.015Z  INFO arbiter.initialize: sockets started   
2025-06-03T23:57:45.052Z  INFO arbiter.start: Arbiter now waiting for commands   
2025-06-03T23:57:45.052Z  INFO watcher._start: dynamo_Processor started   
2025-06-03T23:57:45.084Z  INFO watcher._start: dynamo_VllmDecodeWorker started   
2025-06-03T23:57:45.115Z  INFO watcher._start: dynamo_VllmPrefillWorker started   
2025-06-03T23:57:45.146Z  INFO watcher._start: dynamo_VllmEncodeWorker started   
2025-06-03T23:57:45.175Z  INFO watcher._start: dynamo_Frontend started   
2025-06-03T23:57:45.175Z  INFO serving.<lambda>: Starting graphs.disagg:Frontend (Press CTRL+C to quit)   
2025-06-03T23:57:47.791Z  INFO __init__: Utilizing cupy to enable GPU acceleration.   
2025-06-03T23:57:47.793Z  INFO __init__: Utilizing cupy to enable GPU acceleration.   
2025-06-03T23:57:47.827Z  INFO __init__: Utilizing cupy to enable GPU acceleration.   
2025-06-03T23:57:47.832Z  INFO __init__: Utilizing cupy to enable GPU acceleration.   
2025-06-03T23:57:47.882Z  INFO __init__: Utilizing cupy to enable GPU acceleration.   
2025-06-03T23:57:49.730Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.731Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.733Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.739Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.740Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.741Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-06-03T23:57:49.795Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.796Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.796Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.797Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.798Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.799Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.803Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.803Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.805Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-06-03T23:57:49.805Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.806Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.807Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-06-03T23:57:49.811Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.812Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.814Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.819Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.820Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.822Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-06-03T23:57:49.836Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.837Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.841Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.846Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.847Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:57:49.849Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-06-03T23:57:49.926Z  INFO nixl: NIXL is available   
2025-06-03T23:57:49.989Z  INFO nixl: NIXL is available   
2025-06-03T23:57:49.996Z  INFO nixl: NIXL is available   
2025-06-03T23:57:50.013Z  INFO nixl: NIXL is available   
2025-06-03T23:57:50.036Z  INFO nixl: NIXL is available   
2025-06-03T23:57:50.692Z  INFO encode_worker: Using cupy for array operations (GPU mode).   
2025-06-03T23:57:50.772Z  INFO serve_dynamo.dyn_worker: [Processor:1] Registering component dynamo/Processor   
2025-06-03T23:57:50.773Z  INFO serve_dynamo.dyn_worker: [Processor:1] Created Processor component   
2025-06-03T23:57:50.773Z  INFO config.as_args: [Processor:1] Running Processor with args=['--model', 'llava-hf/llava-1.5-7b-hf', '--block-size', '64', '--router', 'round-robin']   
2025-06-03T23:57:50.815Z  INFO encode_worker: Using cupy for array operations (GPU mode).   
2025-06-03T23:57:50.872Z  INFO encode_worker: Using cupy for array operations (GPU mode).   
2025-06-03T23:57:50.880Z  INFO encode_worker: Using cupy for array operations (GPU mode).   
2025-06-03T23:57:50.890Z  WARN encode_worker: Failed to import cupy, falling back to numpy: CUDA is not available..   
2025-06-03T23:57:50.895Z  INFO serve_dynamo.dyn_worker: [VllmPrefillWorker:1] Registering component dynamo/VllmPrefillWorker   
2025-06-03T23:57:50.896Z  INFO serve_dynamo.dyn_worker: [VllmPrefillWorker:1] Created VllmPrefillWorker component   
2025-06-03T23:57:50.896Z  INFO config.as_args: [VllmPrefillWorker:1] Running VllmPrefillWorker with args=['--model', 'llava-hf/llava-1.5-7b-hf', '--block-size', '64', '--max-model-len', '4096', '--kv-transfer-config', '{"kv_connector":"DynamoNixlConnector"}', '--max-num-batched-tokens', '16384']   
2025-06-03T23:57:50.946Z  INFO prefill_worker.__init__: [VllmPrefillWorker:1] Chunked prefill is not supported yet, setting to False   
2025-06-03T23:57:50.946Z  INFO prefill_worker.__init__: [VllmPrefillWorker:1] Async output processing is not supported yet, setting to True   
2025-06-03T23:57:50.946Z  INFO prefill_worker.__init__: [VllmPrefillWorker:1] Prefill must be done eagerly, setting to True   
2025-06-03T23:57:50.946Z  INFO prefill_worker.__init__: [VllmPrefillWorker:1] Prefix caching is not supported yet in prefill worker, setting to False   
2025-06-03T23:57:50.953Z  INFO serve_dynamo.dyn_worker: [VllmDecodeWorker:1] Registering component dynamo/VllmDecodeWorker   
2025-06-03T23:57:50.953Z  INFO serve_dynamo.dyn_worker: [Frontend:1] Registering component dynamo/Frontend   
2025-06-03T23:57:50.954Z  INFO serve_dynamo.dyn_worker: [VllmDecodeWorker:1] Created VllmDecodeWorker component   
2025-06-03T23:57:50.954Z  INFO config.as_args: [VllmDecodeWorker:1] Running VllmDecodeWorker with args=['--model', 'llava-hf/llava-1.5-7b-hf', '--block-size', '64', '--max-model-len', '4096', '--kv-transfer-config', '{"kv_connector":"DynamoNixlConnector"}', '--remote-prefill', '--conditional-disagg', '--max-local-prefill-length', '10', '--max-prefill-queue-size', '2']   
2025-06-03T23:57:50.954Z  INFO serve_dynamo.dyn_worker: [Frontend:1] Created Frontend component   
2025-06-03T23:57:50.954Z  INFO serve_dynamo.dyn_worker: [Frontend:1] Starting Frontend instance with all registered endpoints   
2025-06-03T23:57:50.954Z  INFO serve_dynamo.dyn_worker: [Frontend:1] No Dynamo endpoints found in service Frontend but keeping service alive   
2025-06-03T23:57:50.954Z  INFO serve_dynamo.add_fastapi_routes: [Frontend:1] Registering regular endpoint /v1/chat/completions   
2025-06-03T23:57:50.955Z  INFO serve_dynamo.add_fastapi_routes: [Frontend:1] Added API route /v1/chat/completions to FastAPI app   
2025-06-03T23:57:50.955Z  INFO serve_dynamo.web_worker: [Frontend:1] Starting FastAPI server on 0.0.0.0:8000 with routes: ['/v1/chat/completions']   
2025-06-03T23:57:50.969Z  INFO serve_dynamo.dyn_worker: [VllmEncodeWorker:1] Registering component dynamo/VllmEncodeWorker   
2025-06-03T23:57:50.970Z  INFO serve_dynamo.dyn_worker: [VllmEncodeWorker:1] Created VllmEncodeWorker component   
2025-06-03T23:57:50.970Z  INFO config.as_args: [VllmEncodeWorker:1] Running VllmEncodeWorker with args=['--model', 'llava-hf/llava-1.5-7b-hf', '--tensor-parallel-size', '1', '--router', 'random']   
2025-06-03T23:57:50.971Z  INFO server._serve: [Frontend:1] Started server process [528212]   
2025-06-03T23:57:50.971Z  INFO on.startup: [Frontend:1] Waiting for application startup.   
2025-06-03T23:57:50.971Z  INFO on.startup: [Frontend:1] Application startup complete.   
2025-06-03T23:57:50.971Z  INFO server._log_started_message: [Frontend:1] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)   
2025-06-03T23:57:51.006Z  INFO decode_worker.__init__: [VllmDecodeWorker:1] Prefill queue: nats://localhost:4222:vllm   
2025-06-03T23:57:51.006Z  INFO decode_worker.__init__: [VllmDecodeWorker:1] Chunked prefill is not supported yet, setting to False   
2025-06-03T23:57:51.006Z  INFO decode_worker.__init__: [VllmDecodeWorker:1] Preemption mode is not supported yet, setting to swap   
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 20.74it/s]
2025-06-03T23:57:52.192Z  INFO encode_worker.async_init: [VllmEncodeWorker:1] Startup started.   
Backend UCX was instantiated
Initialized NIXL agent: 2b529c7d-4d88-43a3-b093-170f62d69942
2025-06-03T23:57:52.403Z  INFO encode_worker.async_init: [VllmEncodeWorker:1] Startup completed.   
2025-06-03T23:57:52.403Z  INFO serve_dynamo.dyn_worker: [VllmEncodeWorker:1] Starting VllmEncodeWorker instance with all registered endpoints   
2025-06-03T23:57:58.034Z  INFO config._resolve_task: This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.   
2025-06-03T23:57:58.040Z  INFO api_server.build_async_engine_client_from_engine_args: Started engine process with PID 529513   
2025-06-03T23:57:58.104Z  INFO config._resolve_task: This model supports multiple tasks: {'reward', 'classify', 'embed', 'generate', 'score'}. Defaulting to 'generate'.   
2025-06-03T23:57:58.189Z  INFO config._resolve_task: This model supports multiple tasks: {'score', 'generate', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.   
2025-06-03T23:57:58.193Z  INFO api_server.build_async_engine_client_from_engine_args: Started engine process with PID 529516   
2025-06-03T23:58:02.174Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.175Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.177Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.182Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.183Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.184Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-06-03T23:58:02.341Z  INFO nixl: NIXL is available   
2025-06-03T23:58:02.347Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.348Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.350Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.355Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.356Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-06-03T23:58:02.358Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-06-03T23:58:02.517Z  INFO nixl: NIXL is available   
2025-06-03T23:58:03.000Z  INFO llm_engine.__init__: Initializing a V0 LLM engine (v0.8.4) with config: model='llava-hf/llava-1.5-7b-hf', speculative_config=None, tokenizer='llava-hf/llava-1.5-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=llava-hf/llava-1.5-7b-hf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,    
2025-06-03T23:58:03.193Z  INFO llm_engine.__init__: Initializing a V0 LLM engine (v0.8.4) with config: model='llava-hf/llava-1.5-7b-hf', speculative_config=None, tokenizer='llava-hf/llava-1.5-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=llava-hf/llava-1.5-7b-hf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,    
2025-06-03T23:58:04.282Z  INFO cuda.get_attn_backend_cls: Using Flash Attention backend.   
2025-06-03T23:58:04.462Z  INFO cuda.get_attn_backend_cls: Using Flash Attention backend.   
2025-06-03T23:58:04.690Z  INFO parallel_state.initialize_model_parallel: rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0   
2025-06-03T23:58:04.690Z  INFO model_runner.load_model: Starting to load model llava-hf/llava-1.5-7b-hf...   
2025-06-03T23:58:04.854Z  INFO parallel_state.initialize_model_parallel: rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0   
2025-06-03T23:58:04.854Z  INFO model_runner.load_model: Starting to load model llava-hf/llava-1.5-7b-hf...   
2025-06-03T23:58:05.006Z  INFO cuda.get_attn_backend_cls: Using Flash Attention backend.   
2025-06-03T23:58:05.025Z  INFO config.init_with_cudagraph_sizes: cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]   
2025-06-03T23:58:05.154Z  INFO cuda.get_attn_backend_cls: Using Flash Attention backend.   
2025-06-03T23:58:05.172Z  INFO config.init_with_cudagraph_sizes: cudagraph sizes specified by model runner [] is overridden by config []   
2025-06-03T23:58:05.321Z  INFO weight_utils.download_weights_from_hf: Using model weights format ['*.safetensors']   
2025-06-03T23:58:05.460Z  INFO weight_utils.download_weights_from_hf: Using model weights format ['*.safetensors']   
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.52it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.61it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.37it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.42it/s]

2025-06-03T23:58:07.787Z  INFO loader.load_model: Loading weights took 2.26 seconds   
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.41it/s]

2025-06-03T23:58:07.980Z  INFO model_runner.load_model: Model loading took 13.1343 GiB and 3.100958 seconds   
2025-06-03T23:58:08.099Z  INFO loader.load_model: Loading weights took 2.31 seconds   
2025-06-03T23:58:08.291Z  INFO model_runner.load_model: Model loading took 13.1343 GiB and 3.248275 seconds   
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
2025-06-03T23:58:10.635Z  INFO worker.determine_num_available_blocks: Memory profiling takes 2.49 seconds
the current vLLM instance can use total_gpu_memory (47.39GiB) x gpu_memory_utilization (0.90) = 42.65GiB
model weights take 13.13GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 0.49GiB; the rest of the memory reserved for KV Cache is 28.97GiB.   
2025-06-03T23:58:10.801Z  INFO executor_base.initialize_cache: # cuda blocks: 926, # CPU blocks: 128   
2025-06-03T23:58:10.801Z  INFO executor_base.initialize_cache: Maximum concurrency for 4096 tokens per request: 14.47x   
2025-06-03T23:58:12.767Z  INFO worker.determine_num_available_blocks: Memory profiling takes 4.29 seconds
the current vLLM instance can use total_gpu_memory (47.41GiB) x gpu_memory_utilization (0.90) = 42.67GiB
model weights take 13.13GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.55GiB; the rest of the memory reserved for KV Cache is 27.93GiB.   
2025-06-03T23:58:12.971Z  INFO executor_base.initialize_cache: # cuda blocks: 893, # CPU blocks: 128   
2025-06-03T23:58:12.971Z  INFO executor_base.initialize_cache: Maximum concurrency for 4096 tokens per request: 13.95x   
2025-06-03T23:58:13.201Z  INFO model_runner.capture_model: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.   
Capturing CUDA graph shapes:  11%|████████▌                                                                  | 4/35 [00:01<00:15,  2.07it/s]2025-06-03T23:58:15.375Z  INFO llm_engine._initialize_kv_caches: init engine (profile, create kv cache, warmup model) took 7.08 seconds   
2025-06-03T23:58:15.399Z  INFO nixl.__init__: Initializing NIXL wrapper   
Backend UCX was instantiated
Initialized NIXL agent: 251736b8-a84c-46b3-9a68-3a0a7d50657d
Capturing CUDA graph shapes:  14%|██████████▋                                                                | 5/35 [00:02<00:14,  2.07it/s]Backend UCX was instantiated
Initialized NIXL agent: d5d864a0-5fc3-4b32-a418-b1b78c010ed1
Workers ready: [7587887212314322840]
2025-06-03T23:58:16.035Z  INFO prefill_worker.async_init: [VllmPrefillWorker:1] Initialization complete.   
2025-06-03T23:58:16.035Z  INFO serve_dynamo.dyn_worker: [VllmPrefillWorker:1] Starting VllmPrefillWorker instance with all registered endpoints   
2025-06-03T23:58:16.035Z  INFO prefill_worker.prefill_queue_handler: [VllmPrefillWorker:1] Prefill queue handler entered   
2025-06-03T23:58:16.035Z  INFO prefill_worker.prefill_queue_handler: [VllmPrefillWorker:1] Prefill queue: nats://localhost:4222:vllm   
2025-06-03T23:58:16.038Z  INFO prefill_worker.prefill_queue_handler: [VllmPrefillWorker:1] prefill queue handler started   
2025-06-03T23:58:16.039Z  INFO prefill_worker.prefill_queue_handler: [VllmPrefillWorker:1] Dequeued prefill request: 56d80da5-d65b-4f79-b8b8-c8d723bf88e4   
2025-06-03T23:58:16.039Z  INFO prefill_worker.generate: [VllmPrefillWorker:1] Received prefill request { id: 56d80da5-d65b-4f79-b8b8-c8d723bf88e4, engine_id: cc9a3288-5a0a-402b-b1e7-38f5a9d3eba7 }.   
Capturing CUDA graph shapes:  17%|████████████▊                                                              | 6/35 [00:02<00:13,  2.10it/s]2025-06-03T23:58:16.265Z  INFO _client._send_single_request: [VllmEncodeWorker:1] HTTP Request: GET http://images.cocodataset.org/test2017/000000155781.jpg "HTTP/1.1 200 OK"   
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████| 35/35 [00:15<00:00,  2.31it/s]
2025-06-03T23:58:28.344Z  INFO model_runner.capture_model: Graph capturing finished in 15 secs, took 0.24 GiB   
2025-06-03T23:58:28.344Z  INFO llm_engine._initialize_kv_caches: init engine (profile, create kv cache, warmup model) took 20.36 seconds   
2025-06-03T23:58:28.362Z  INFO nixl.__init__: Initializing NIXL wrapper   
Backend UCX was instantiated
Initialized NIXL agent: c7c0eb15-5af1-45e5-a5e8-1a582d85ec64
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.25it/s]
2025-06-03T23:58:32.030Z  INFO decode_worker.async_init: [VllmDecodeWorker:1] Initialization complete { configuration: Disaggregated }.   
2025-06-03T23:58:33.887Z  INFO serve_dynamo.dyn_worker: [VllmDecodeWorker:1] Starting VllmDecodeWorker instance with all registered endpoints   
2025-06-03T23:58:34.125Z  INFO logging.check_required_workers: [Processor:1] Waiting for more workers to be ready.
 Current: 1, Required: 1   
Workers ready: [7587887212314322836]
2025-06-03T23:58:34.129Z  INFO serve_dynamo.dyn_worker: [Processor:1] Starting Processor instance with all registered endpoints   
2025-06-03T23:59:35.386Z  INFO httptools_impl.send: [Frontend:1] 127.0.0.1:42238 - "POST /v1/chat/completions HTTP/1.1" 200   
2025-06-03T23:59:35.394Z  INFO chat_utils._log_chat_template_content_format: Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.   
2025-06-03T23:59:35.399Z  INFO decode_worker.generate: [VllmDecodeWorker:1] Received multimodal request { id: d60125a3-66c8-40fe-a25b-91e43f34f593 }.   
2025-06-03T23:59:35.402Z  INFO disagg_router.prefill_remote: [VllmDecodeWorker:1] Remote prefill: True (prefill length: 23/10, prefill queue size: 0/2)   
2025-06-03T23:59:35.403Z  INFO engine._handle_process_request: Added request d60125a3-66c8-40fe-a25b-91e43f34f593.   
2025-06-03T23:59:35.404Z  INFO metrics.log: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 90.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.1%, CPU KV cache usage: 0.0%.   
2025-06-03T23:59:40.406Z  INFO metrics.log: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.1%, CPU KV cache usage: 0.0%.

Screenshots

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdynamo-serveRelates to dynamo-serve

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions