Skip to content

[Bug]: ValueError: No available memory for the cache blocks on main branch after commit 46f98893 #14992

Closed
@engchina

Description

@engchina

Your current environment

The output of `python collect_env.py`
INFO 03-18 10:28:31 [__init__.py:256] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.35

Python version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.6.85
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
GPU 2: NVIDIA GeForce RTX 4090
GPU 3: NVIDIA GeForce RTX 4090

Nvidia driver version: 560.94
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.6.0
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudnn.so.8.9.7
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.7
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.7
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.7
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.7
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.7
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Gold 6430
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   64
Socket(s):                            1
Stepping:                             8
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 avx512vbmi umip waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                       VT-x
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            3 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             128 MiB (64 instances)
L3 cache:                             60 MiB (1 instance)
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.3.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.50.0.dev0
[pip3] triton==3.2.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pyzmq                     26.3.0                   pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torchaudio                2.6.0                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] transformers              4.50.0.dev0              pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.4.dev474+g3556a414
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS                             N/A
GPU1    SYS      X      SYS     SYS                             N/A
GPU2    SYS     SYS      X      SYS                             N/A
GPU3    SYS     SYS     SYS      X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NCCL_P2P_DISABLE=1
CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VISIBLE_DEVICES=3,1,0
CUDA_VISIBLE_DEVICES=3,1,0
LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:
NCCL_IB_DISABLE=1
CUDA_HOME=/usr/local/cuda-12.6
CUDA_HOME=/usr/local/cuda-12.6
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

how to product:

run vllm serve,

vllm serve /root/HuggingFaceCache/models--google--gemma-3-27b-it --trust-remote-code --served-model-name gpt-4o --gpu-memory-utilization 0.99 --tensor-parallel-size 4 --port 8000 --api-key sk-123456 --max-model-len 32768 --enable-chunked-prefill --limit-mm-per-prompt image=3

error log:

INFO 03-18 10:21:18 [__init__.py:256] Automatically detected platform cuda.
INFO 03-18 10:21:20 [api_server.py:966] vLLM API server version 0.7.4.dev474+g3556a414
INFO 03-18 10:21:20 [api_server.py:967] args: Namespace(subparser='serve', model_tag='/root/HuggingFaceCache/models--google--gemma-3-27b-it', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='sk-123456', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/root/HuggingFaceCache/models--google--gemma-3-27b-it', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.99, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 3}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['gpt-4o'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f8e7a3d1bc0>)
INFO 03-18 10:21:20 [config.py:2521] For Gemma 2 and 3, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
INFO 03-18 10:21:20 [config.py:2579] Downcasting torch.float32 to torch.bfloat16.
INFO 03-18 10:21:26 [config.py:583] This model supports multiple tasks: {'score', 'generate', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 03-18 10:21:27 [config.py:1499] Defaulting to use mp for distributed inference
INFO 03-18 10:21:27 [config.py:1677] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-18 10:21:33 [__init__.py:256] Automatically detected platform cuda.
INFO 03-18 10:21:35 [core.py:53] Initializing a V1 LLM engine (v0.7.4.dev474+g3556a414) with config: model='/root/HuggingFaceCache/models--google--gemma-3-27b-it', speculative_config=None, tokenizer='/root/HuggingFaceCache/models--google--gemma-3-27b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=gpt-4o, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-18 10:21:35 [multiproc_worker_utils.py:310] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-18 10:21:35 [custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 03-18 10:21:35 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 10485760, 10, 'psm_bea7a810'), local_subscribe_addr='ipc:///tmp/8470ae2e-a8ed-4ce6-8d2f-c9bab3efae29', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 03-18 10:21:38 [__init__.py:256] Automatically detected platform cuda.
WARNING 03-18 10:21:41 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7febc2321280>
(VllmWorker rank=0 pid=903349) INFO 03-18 10:21:41 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_bc5ca0ca'), local_subscribe_addr='ipc:///tmp/35fd9a76-e96b-4512-87d3-f43e50b0a70e', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 03-18 10:21:45 [__init__.py:256] Automatically detected platform cuda.
WARNING 03-18 10:21:47 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fcda94fb860>
(VllmWorker rank=1 pid=903752) INFO 03-18 10:21:47 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_88bac3ba'), local_subscribe_addr='ipc:///tmp/34354183-bd2a-4958-afa5-d4b985115f84', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 03-18 10:21:50 [__init__.py:256] Automatically detected platform cuda.
WARNING 03-18 10:21:53 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f61327717c0>
(VllmWorker rank=2 pid=904143) INFO 03-18 10:21:53 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_5299412c'), local_subscribe_addr='ipc:///tmp/3baeb637-c2de-491d-9c4d-a9cd8fe7ed09', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 03-18 10:21:56 [__init__.py:256] Automatically detected platform cuda.
WARNING 03-18 10:21:59 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f964d721cd0>
(VllmWorker rank=3 pid=904497) INFO 03-18 10:21:59 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1ef6a97a'), local_subscribe_addr='ipc:///tmp/8a556fb4-4813-4dd8-8777-f3311fc2aa69', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=903752) INFO 03-18 10:22:00 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=903752) INFO 03-18 10:22:00 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:00 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:00 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=3 pid=904497) INFO 03-18 10:22:00 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=904497) INFO 03-18 10:22:00 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=2 pid=904143) INFO 03-18 10:22:00 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=904143) INFO 03-18 10:22:00 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=3 pid=904497) WARNING 03-18 10:22:01 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=2 pid=904143) WARNING 03-18 10:22:01 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=903349) WARNING 03-18 10:22:01 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=1 pid=903752) WARNING 03-18 10:22:01 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:01 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_4c674e73'), local_subscribe_addr='ipc:///tmp/329e55a3-aff6-4711-a5bd-37cb00742a7d', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:01 [parallel_state.py:948] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=0 pid=903349) WARNING 03-18 10:22:01 [interface.py:305] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:01 [cuda.py:215] Using Flash Attention backend on V1 engine.
(VllmWorker rank=2 pid=904143) INFO 03-18 10:22:01 [parallel_state.py:948] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
(VllmWorker rank=1 pid=903752) INFO 03-18 10:22:01 [parallel_state.py:948] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=2 pid=904143) WARNING 03-18 10:22:01 [interface.py:305] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorker rank=1 pid=903752) WARNING 03-18 10:22:01 [interface.py:305] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorker rank=2 pid=904143) INFO 03-18 10:22:01 [cuda.py:215] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=904497) INFO 03-18 10:22:01 [parallel_state.py:948] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
(VllmWorker rank=1 pid=903752) INFO 03-18 10:22:01 [cuda.py:215] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=904497) WARNING 03-18 10:22:01 [interface.py:305] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorker rank=3 pid=904497) INFO 03-18 10:22:01 [cuda.py:215] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=903349) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker rank=3 pid=904497) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker rank=2 pid=904143) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker rank=1 pid=903752) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker rank=2 pid=904143) INFO 03-18 10:22:07 [gpu_model_runner.py:1112] Starting to load model /root/HuggingFaceCache/models--google--gemma-3-27b-it...
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:07 [gpu_model_runner.py:1112] Starting to load model /root/HuggingFaceCache/models--google--gemma-3-27b-it...
(VllmWorker rank=3 pid=904497) INFO 03-18 10:22:07 [gpu_model_runner.py:1112] Starting to load model /root/HuggingFaceCache/models--google--gemma-3-27b-it...
(VllmWorker rank=2 pid=904143) INFO 03-18 10:22:07 [config.py:3206] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:07 [config.py:3206] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=3 pid=904497) INFO 03-18 10:22:07 [config.py:3206] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=903752) INFO 03-18 10:22:07 [gpu_model_runner.py:1112] Starting to load model /root/HuggingFaceCache/models--google--gemma-3-27b-it...
(VllmWorker rank=1 pid=903752) INFO 03-18 10:22:08 [config.py:3206] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=2 pid=904143) WARNING 03-18 10:22:08 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=3 pid=904497) WARNING 03-18 10:22:08 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=903349) WARNING 03-18 10:22:08 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   0% Completed | 0/12 [00:00<?, ?it/s]
(VllmWorker rank=1 pid=903752) WARNING 03-18 10:22:09 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   8% Completed | 1/12 [00:03<00:34,  3.18s/it]
Loading safetensors checkpoint shards:  17% Completed | 2/12 [00:06<00:32,  3.27s/it]
Loading safetensors checkpoint shards:  25% Completed | 3/12 [00:09<00:29,  3.33s/it]
Loading safetensors checkpoint shards:  33% Completed | 4/12 [00:13<00:26,  3.32s/it]
Loading safetensors checkpoint shards:  42% Completed | 5/12 [00:16<00:23,  3.33s/it]
Loading safetensors checkpoint shards:  50% Completed | 6/12 [00:19<00:18,  3.07s/it]
Loading safetensors checkpoint shards:  58% Completed | 7/12 [00:19<00:11,  2.21s/it]
Loading safetensors checkpoint shards:  67% Completed | 8/12 [00:22<00:10,  2.55s/it]
Loading safetensors checkpoint shards:  75% Completed | 9/12 [00:26<00:08,  2.81s/it]
Loading safetensors checkpoint shards:  83% Completed | 10/12 [00:29<00:06,  3.03s/it]
Loading safetensors checkpoint shards:  92% Completed | 11/12 [00:33<00:03,  3.12s/it]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [00:36<00:00,  3.24s/it]
Loading safetensors checkpoint shards: 100% Completed | 12/12 [00:36<00:00,  3.05s/it]
(VllmWorker rank=0 pid=903349)
(VllmWorker rank=2 pid=904143) INFO 03-18 10:22:45 [loader.py:429] Loading weights took 36.77 seconds
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:45 [loader.py:429] Loading weights took 36.76 seconds
(VllmWorker rank=1 pid=903752) INFO 03-18 10:22:45 [loader.py:429] Loading weights took 36.73 seconds
(VllmWorker rank=3 pid=904497) INFO 03-18 10:22:45 [loader.py:429] Loading weights took 36.91 seconds
(VllmWorker rank=2 pid=904143) INFO 03-18 10:22:45 [gpu_model_runner.py:1124] Model loading took 13.1666 GB and 37.816293 seconds
(VllmWorker rank=3 pid=904497) INFO 03-18 10:22:46 [gpu_model_runner.py:1124] Model loading took 13.1666 GB and 37.926920 seconds
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:46 [gpu_model_runner.py:1124] Model loading took 13.1666 GB and 37.939645 seconds
(VllmWorker rank=1 pid=903752) INFO 03-18 10:22:46 [gpu_model_runner.py:1124] Model loading took 13.1666 GB and 37.823492 seconds
(VllmWorker rank=1 pid=903752) INFO 03-18 10:22:46 [gpu_model_runner.py:1342] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
(VllmWorker rank=3 pid=904497) INFO 03-18 10:22:46 [gpu_model_runner.py:1342] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
(VllmWorker rank=2 pid=904143) INFO 03-18 10:22:46 [gpu_model_runner.py:1342] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
(VllmWorker rank=0 pid=903349) INFO 03-18 10:22:46 [gpu_model_runner.py:1342] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
(VllmWorker rank=0 pid=903349) INFO 03-18 10:23:14 [backends.py:409] Using cache directory: /root/.cache/vllm/torch_compile_cache/e114398272/rank_0_0 for vLLM's torch.compile
(VllmWorker rank=0 pid=903349) INFO 03-18 10:23:14 [backends.py:419] Dynamo bytecode transform time: 19.84 s
(VllmWorker rank=1 pid=903752) INFO 03-18 10:23:14 [backends.py:409] Using cache directory: /root/.cache/vllm/torch_compile_cache/e114398272/rank_1_0 for vLLM's torch.compile
(VllmWorker rank=1 pid=903752) INFO 03-18 10:23:14 [backends.py:419] Dynamo bytecode transform time: 19.89 s
(VllmWorker rank=3 pid=904497) INFO 03-18 10:23:14 [backends.py:409] Using cache directory: /root/.cache/vllm/torch_compile_cache/e114398272/rank_3_0 for vLLM's torch.compile
(VllmWorker rank=3 pid=904497) INFO 03-18 10:23:14 [backends.py:419] Dynamo bytecode transform time: 19.94 s
(VllmWorker rank=2 pid=904143) INFO 03-18 10:23:14 [backends.py:409] Using cache directory: /root/.cache/vllm/torch_compile_cache/e114398272/rank_2_0 for vLLM's torch.compile
(VllmWorker rank=2 pid=904143) INFO 03-18 10:23:14 [backends.py:419] Dynamo bytecode transform time: 20.09 s
(VllmWorker rank=0 pid=903349) INFO 03-18 10:23:20 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=1 pid=903752) INFO 03-18 10:23:20 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=3 pid=904497) INFO 03-18 10:23:20 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=2 pid=904143) INFO 03-18 10:23:20 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=3 pid=904497) INFO 03-18 10:24:38 [backends.py:144] Compiling a graph for general shape takes 83.08 s
(VllmWorker rank=0 pid=903349) INFO 03-18 10:24:38 [backends.py:144] Compiling a graph for general shape takes 83.30 s
(VllmWorker rank=1 pid=903752) INFO 03-18 10:24:39 [backends.py:144] Compiling a graph for general shape takes 83.66 s
(VllmWorker rank=2 pid=904143) INFO 03-18 10:24:39 [backends.py:144] Compiling a graph for general shape takes 83.82 s
(VllmWorker rank=3 pid=904497) INFO 03-18 10:25:35 [monitor.py:33] torch.compile takes 103.02 s in total
(VllmWorker rank=2 pid=904143) INFO 03-18 10:25:35 [monitor.py:33] torch.compile takes 103.91 s in total
(VllmWorker rank=1 pid=903752) INFO 03-18 10:25:35 [monitor.py:33] torch.compile takes 103.55 s in total
(VllmWorker rank=0 pid=903349) INFO 03-18 10:25:35 [monitor.py:33] torch.compile takes 103.14 s in total
ERROR 03-18 10:25:43 [core.py:337] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-18 10:25:43 [core.py:337]   File "/root/myvllm/vllm_main_oom/vllm/v1/engine/core.py", line 329, in run_engine_core
ERROR 03-18 10:25:43 [core.py:337]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 03-18 10:25:43 [core.py:337]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 10:25:43 [core.py:337]   File "/root/myvllm/vllm_main_oom/vllm/v1/engine/core.py", line 284, in __init__
ERROR 03-18 10:25:43 [core.py:337]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 03-18 10:25:43 [core.py:337]   File "/root/myvllm/vllm_main_oom/vllm/v1/engine/core.py", line 62, in __init__
ERROR 03-18 10:25:43 [core.py:337]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 03-18 10:25:43 [core.py:337]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 10:25:43 [core.py:337]   File "/root/myvllm/vllm_main_oom/vllm/v1/engine/core.py", line 124, in _initialize_kv_caches
ERROR 03-18 10:25:43 [core.py:337]     kv_cache_configs = get_kv_cache_configs(vllm_config, kv_cache_specs,
ERROR 03-18 10:25:43 [core.py:337]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-18 10:25:43 [core.py:337]   File "/root/myvllm/vllm_main_oom/vllm/v1/core/kv_cache_utils.py", line 576, in get_kv_cache_configs
ERROR 03-18 10:25:43 [core.py:337]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec,
ERROR 03-18 10:25:43 [core.py:337]   File "/root/myvllm/vllm_main_oom/vllm/v1/core/kv_cache_utils.py", line 468, in check_enough_kv_cache_memory
ERROR 03-18 10:25:43 [core.py:337]     raise ValueError("No available memory for the cache blocks. "
ERROR 03-18 10:25:43 [core.py:337] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
ERROR 03-18 10:25:43 [core.py:337]
CRITICAL 03-18 10:25:43 [core_client.py:260] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
./gemma_3_27b.sh: line 12: 901470 Killed                  vllm serve /root/HuggingFaceCache/models--google--gemma-3-27b-it --trust-remote-code --served-model-name gpt-4o --gpu-memory-utilization 0.99 --tensor-parallel-size 4 --port 8000 --api-key sk-123456 --max-model-len 32768 --enable-chunked-prefill --limit-mm-per-prompt image=3

I have used git log to checkout 46f98893 [V1] Fix model parameterization for structured output tests (#14833), and it's okay.
This error happened after commit 46f9889, but not sure which commit. It's time consuming to test every commit.
My guess, maybe upgrade from torch==2.5.1 to torch==2.6.0 is the problem.
But can not install commit 14f301b because there is no wheel for commit 14f301b.

# git log --oneline
5eeabc2a (HEAD -> main, origin/main, origin/HEAD) [Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights (#14950)
18551e82 [V1] TPU - Fix CI/CD runner (#14974)
e41e1602 [V1] Guard Against Main Thread Usage (#14972)
b89fb2a4 [CI/Build] Use `AutoModelForImageTextToText` to load VLMs in tests (#14945)
5340b0e2 [Bugfix] Fix interface for Olmo2 on V1 (#14976)
37e38061 (tag: v0.8.0rc2) [Bugfix] Make Gemma3 MM V0 only for now (#14971)
c0efdd65 [Fix][Structured Output] using vocab_size to construct matcher (#14868)
aaaec52a [Bugfix][Model] Mixtral: use unused head_dim config argument (#14961)
e1eb45d3 [Bugfix] Fix precommit - line too long in pixtral.py (#14960)
89fca671 [V1] Default MLA to V1 (#14921)
d20b0c13 Add patch merger (#14957)
166a168b [Doc] Fix misleading log during multi-modal profiling (#14955)
2bb0e1a7 [Bugfix][ROCm] running new process using spawn method for rocm in tests. (#14810)
6eaf1e5c [Misc] Add `--seed` option to offline multi-modal examples (#14934)
868a8c5b [Bugfix] Fix Ultravox on V1 (#14929)
b4ad56c1 [V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846)
69698f25 fix minor miscalled method (#14327)
cd0cd851 [MISC] More AMD unused var clean up (#14926)
0a74bfce setup.py: drop assumption about local `main` branch (#14692)
dd3b8658 [Doc] Add vLLM Beijing meetup slide (#14938)
9b87a579 [Misc][XPU] Use None as device capacity for XPU (#14932)
b539222d [V1] Remove input cache client (#14864)
8d6cf895 (tag: v0.8.0rc1) [V1] [Spec Decode] Support random sampling for spec decode (#13933)
583a9778 [Benchmark] Do not save detailed info to json by default (#14879)
a73e183e [Misc] Replace os environ to monkeypatch in test suite (#14516)
1e799b7e [BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context (#14910)
7f6c5ee0 [V1][Minor] Add __repr__ to ConstantList (#14907)
faa02757 [V1] Optimize the overhead of rewinding (#14905)
8a5a9b70 [CI/Build] Update defaults for test reproducibility (#14893)
bb3aeddf [CI] Nightly Tests (#14898)
aecc780d [V1] Enable Entrypoints Tests (#14903)
90df7f23 [Doc] Add guidance for using `ccache` with `pip install -e .` in doc (#14901)
b9b5bdfc [Misc] Catching Ray Compiled Graph PP test failures for V1 (#14847)
31060b27 [V1][BugFix] Detect interleaved sliding window attention (#14896)
fc1f6771 [BugFix][V1] Fix overhead related to bad_words sampling when not in use (#14894)
f6137adb Revert "[Bugfix] Limit profiling run sequence length by max_model_len (#14785) (#14892)
e53b1350 [Bugfix] Explicitly disable Phi-4-multimodal in V1 (#14889)
d30aa7e9 [Bugfix] Limit profiling run sequence length by max_model_len (#14785)
d1ad2a57 [V1] [Spec Decode] Fix ngram tests (#14878)
b82662d9 [BugFix] Fix torch distributed stateless PG backend init (#14870)
71c1e071 [Kernel] Add more tuned configs (#14877)
b30c75dd [V1] Remove V0 fallback for mistral-tokenizer (#14873)
def232e1 [VLM] Clean up Phi-4-MM ViT implementation (#14812)
3453b964 [Misc][Doc] Minor benchmark README update (#14874)
61c6a5a7 [VLM] Merged multi-modal processor for Pixtral (#12211)
74bc397b [Core] Expose API endpoint `/is_sleeping` (#14312)
f58aea00 [CI][Intel GPU] refine intel GPU ci docker build (#14860)
3556a414 [VLM] Limit multimodal input cache by memory (#14805)
9ed6ee92 [Bugfix] EAGLE output norm bug (#14464)
ee3778d5 [Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes (#14839)
aaacf173 [Doc] V1 user guide (#13991)
4c7629ca [V1][Structured Output] calculate vocab_size eagerly (#14851)
e0fdfa16 [CI/Build] Delete LoRA bias test (#14849)
5952d8ab [Attention] Get rid of mla cache alignment (#14842)
a2ae4965 [CPU] Support FP8 KV cache (#14741)
877e3522 [Docs] Add new East Coast vLLM Meetup slides to README and meetups.md (#14852)
d4d93db2 [V1] V1 Enablement Oracle  (#13726)
8c0d15d5 [Misc][Easy] Annotate unused vars in the csrc files (#14798)
97ac781c [Misc] Remove misleading message in gemma2 and gemma3 (#14850)
776dcec8 Disable outlines cache by default (#14837)
ccf02fcb Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… (#14848)
acaea3bb [Bugfix][V1] Fix flashinfer sampling (#14815)
9f374227 [Neuron][CI] update docker run command (#14829)
dd344e03 [Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … (#14844)
54a88044 [Doc] More neutral K8s deployment guide (#14084)
bbd94a19 [Build/CI] Upgrade aiohttp to incldue CVE fix (#14840)
233ffce1 [Build/CI] Move ninja to common deps (#14835)
40677783 [CI] Add TPU v1 test (#14834)
14f301b5 Update to torch==2.6.0 (#12721)
46f98893 [V1] Fix model parameterization for structured output tests (#14833)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions