[Bug]: Unable to deploy NVFP4 quantized model

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 3.31.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cu126
Is debug build               : False
CUDA used to build PyTorch   : 12.6
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.10 (main, Dec  4 2024, 11:59:58) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-1018-aws-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.4.131
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA A10G
Nvidia driver version        : 550.127.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               8
On-line CPU(s) list:                  0-7
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7R32
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   4
Socket(s):                            1
Stepping:                             0
BogoMIPS:                             5599.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            128 KiB (4 instances)
L1i cache:                            128 KiB (4 instances)
L2 cache:                             2 MiB (4 instances)
L3 cache:                             16 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-7
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.4
[pip3] triton==3.3.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-7     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/cuda-12.4/extras/CUPTI/lib64:/usr/local/lib:/usr/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/cuda-12.4/extras/CUPTI/lib64:/usr/local/lib:/usr/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/cuda-12.4/extras/CUPTI/lib64:/usr/local/lib:/usr/lib
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>

### 🐛 Describe the bug

I am trying to deploy an NVFP4 quantized model as newly supported in 0.9.1. I copied the minimal setup from #18312:
```python
import torch

from vllm import LLM, SamplingParams

prompts = [
    "The Swiss Alps are", 
    "Brad Marchand is",
    "The Toronto Maple Leafs are"
]

# Create a sampling params object for greedy sampling
sampling_params = SamplingParams(temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10)
llm = LLM("nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4")
```

This fails with the following output:
<details>
<summary>output & stack trace</summary>

```text
INFO 06-19 09:53:33 [__init__.py:244] Automatically detected platform cuda.
INFO 06-19 09:53:47 [config.py:823] This model supports multiple tasks: {'score', 'generate', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 06-19 09:53:48 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-19 09:53:49 [core.py:455] Waiting for init message from front-end.
INFO 06-19 09:53:49 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4', speculative_config=None, tokenizer='nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-19 09:53:49 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7cd939b4de10>
INFO 06-19 09:53:50 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 06-19 09:53:50 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 06-19 09:53:50 [gpu_model_runner.py:1595] Starting to load model nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4...
INFO 06-19 09:53:50 [gpu_model_runner.py:1600] Loading model from scratch...
ERROR 06-19 09:53:50 [core.py:515] EngineCore failed to start.
ERROR 06-19 09:53:50 [core.py:515] Traceback (most recent call last):
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-19 09:53:50 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-19 09:53:50 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-19 09:53:50 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 76, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.model_executor = executor_class(vllm_config)
ERROR 06-19 09:53:50 [core.py:515]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self._init_executor()
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 06-19 09:53:50 [core.py:515]     self.collective_rpc("load_model")
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-19 09:53:50 [core.py:515]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-19 09:53:50 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 06-19 09:53:50 [core.py:515]     return func(*args, **kwargs)
ERROR 06-19 09:53:50 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
ERROR 06-19 09:53:50 [core.py:515]     self.model_runner.load_model()
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
ERROR 06-19 09:53:50 [core.py:515]     self.model = model_loader.load_model(
ERROR 06-19 09:53:50 [core.py:515]                  ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
ERROR 06-19 09:53:50 [core.py:515]     model = initialize_model(vllm_config=vllm_config,
ERROR 06-19 09:53:50 [core.py:515]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
ERROR 06-19 09:53:50 [core.py:515]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 06-19 09:53:50 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 521, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.model = self._init_model(vllm_config=vllm_config,
ERROR 06-19 09:53:50 [core.py:515]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 567, in _init_model
ERROR 06-19 09:53:50 [core.py:515]     return LlamaModel(vllm_config=vllm_config,
ERROR 06-19 09:53:50 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 152, in __init__
ERROR 06-19 09:53:50 [core.py:515]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 346, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 06-19 09:53:50 [core.py:515]                                                     ^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 626, in make_layers
ERROR 06-19 09:53:50 [core.py:515]     [PPMissingLayer() for _ in range(start_layer)] + [
ERROR 06-19 09:53:50 [core.py:515]                                                      ^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 627, in <listcomp>
ERROR 06-19 09:53:50 [core.py:515]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 06-19 09:53:50 [core.py:515]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 348, in <lambda>
ERROR 06-19 09:53:50 [core.py:515]     lambda prefix: layer_type(config=config,
ERROR 06-19 09:53:50 [core.py:515]                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 263, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.self_attn = LlamaAttention(
ERROR 06-19 09:53:50 [core.py:515]                      ^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 148, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.qkv_proj = QKVParallelLinear(
ERROR 06-19 09:53:50 [core.py:515]                     ^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 851, in __init__
ERROR 06-19 09:53:50 [core.py:515]     super().__init__(input_size=input_size,
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 397, in __init__
ERROR 06-19 09:53:50 [core.py:515]     super().__init__(input_size,
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 243, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.quant_method = quant_config.get_quant_method(self,
ERROR 06-19 09:53:50 [core.py:515]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 94, in get_quant_method
ERROR 06-19 09:53:50 [core.py:515]     scheme = self.get_scheme(layer=layer, layer_name=prefix)
ERROR 06-19 09:53:50 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 489, in get_scheme
ERROR 06-19 09:53:50 [core.py:515]     scheme = self._get_scheme_from_parts(  # type: ignore
ERROR 06-19 09:53:50 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 414, in _get_scheme_from_parts
ERROR 06-19 09:53:50 [core.py:515]     raise NotImplementedError(
ERROR 06-19 09:53:50 [core.py:515] NotImplementedError: No compressed-tensors compatible scheme was found.
Process EngineCore_0:
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
    raise e
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 76, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
    self.collective_rpc("load_model")
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
    self.model_runner.load_model()
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
    self.model = model_loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
    model = initialize_model(vllm_config=vllm_config,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
    return model_class(vllm_config=vllm_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 521, in __init__
    self.model = self._init_model(vllm_config=vllm_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 567, in _init_model
    return LlamaModel(vllm_config=vllm_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 152, in __init__
    old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 346, in __init__
    self.start_layer, self.end_layer, self.layers = make_layers(
                                                    ^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 626, in make_layers
    [PPMissingLayer() for _ in range(start_layer)] + [
                                                     ^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 627, in <listcomp>
    maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 348, in <lambda>
    lambda prefix: layer_type(config=config,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 263, in __init__
    self.self_attn = LlamaAttention(
                     ^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 148, in __init__
    self.qkv_proj = QKVParallelLinear(
                    ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 851, in __init__
    super().__init__(input_size=input_size,
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 397, in __init__
    super().__init__(input_size,
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 243, in __init__
    self.quant_method = quant_config.get_quant_method(self,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 94, in get_quant_method
    scheme = self.get_scheme(layer=layer, layer_name=prefix)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 489, in get_scheme
    scheme = self._get_scheme_from_parts(  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 414, in _get_scheme_from_parts
    raise NotImplementedError(
NotImplementedError: No compressed-tensors compatible scheme was found.
<output truncated due to github char limit>
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```

</details>

It looks like the quantization config is slightly different from what vllm expects. In `vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors:CompressedTensorsConfig._get_scheme_from_parts`, the quant type check `_is_fp4a4_nvfp4` fails due to the quantization strategy spec:
```python
weight_quant = QuantizationArgs(
    num_bits=4,
    type='float',
    symmetric=True,
    group_size=16,
    strategy='group', # <--- should be 'tensor_group'
    block_structure=None,
    dynamic=False,
    actorder=None,
    observer='minmax',
    observer_kwargs={},
)
input_quant = QuantizationArgs(
    num_bits=4,
    type='float',
    symmetric=True,
    group_size=16,
    strategy='group', # <--- should be 'tensor_group'
    block_structure=None,
    dynamic=False,
    actorder=None,
    observer='minmax',
    observer_kwargs={},
)
quant_format = 'nvfp4-pack-quantized'
```

I also tested a Qwen 3 8B NVFP4-quantized using llm-compressor. The quant spec seems correctly configured:
```python
weight_quant = QuantizationArgs(
    num_bits=4,
    type='float',
    symmetric=True,
    group_size=16,
    strategy='tensor_group', # differs from above
    block_structure=None,
    dynamic=False,
    actorder=None,
    observer='minmax',
    observer_kwargs={},
)
input_quant = QuantizationArgs(
    num_bits=4,
    type='float',
    symmetric=True,
    group_size=16,
    strategy='tensor_group', # differs from above
    block_structure=None,
    dynamic='local', # differs from above
    actorder=None,
    observer='minmax',
    observer_kwargs={},
)
quant_format = 'nvfp4-pack-quantized'
```

However the program crashes at a later step when trying to compile:
<details>
<summary>output & stack trace</summary>

```text
<truncated identical output due to github char limit>
INFO 06-19 10:27:33 [gpu_model_runner.py:1600] Loading model from scratch...
WARNING 06-19 10:27:33 [compressed_tensors_w4a4_nvfp4.py:38] Current platform does not support cutlass NVFP4. Running emulations.
INFO 06-19 10:27:34 [cuda.py:252] Using Flash Attention backend on V1 engine.
WARNING 06-19 10:27:34 [compressed_tensors_w4a4_nvfp4.py:38] Current platform does not support cutlass NVFP4. Running emulations.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.62it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.53it/s]

INFO 06-19 10:27:35 [default_loader.py:272] Loading weights took 1.38 seconds
INFO 06-19 10:27:36 [gpu_model_runner.py:1624] Model loading took 6.3905 GiB and 2.158821 seconds
INFO 06-19 10:28:14 [backends.py:462] Using cache directory: /home/ubuntu/.cache/vllm/torch_compile_cache/da6494da1e/rank_0_0 for vLLM's torch.compile
INFO 06-19 10:28:14 [backends.py:472] Dynamo bytecode transform time: 37.87 s
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] Triton compilation failed: triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] def triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1(in_out_ptr0, in_ptr0, in_ptr1, in_ptr2, in_ptr3, out_ptr0, out_ptr1, xnumel, r0_numel, XBLOCK : tl.constexpr):
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     r0_numel = 16
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     R0_BLOCK: tl.constexpr = 16
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     rnumel = r0_numel
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     RBLOCK: tl.constexpr = R0_BLOCK
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     xoffset = tl.program_id(0) * XBLOCK
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     xmask = xindex < xnumel
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     r0_index = tl.arange(0, R0_BLOCK)[None, :]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     r0_offset = 0
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     r0_mask = tl.full([XBLOCK, R0_BLOCK], True, tl.int1)
<truncated low level output due to github char limit>
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp88 = tmp87 > tmp83
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp89 = tl.where(tmp88, tmp45, tmp87)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tl.store(out_ptr1 + (r0_2 + 16*x3), tmp42, xmask)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tl.store(in_out_ptr0 + (r0_2 + 16*x3), tmp89, xmask)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tl.store(out_ptr0 + (x3), tmp16, xmask)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] 
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] metadata: {'signature': {'in_out_ptr0': '*fp32', 'in_ptr0': '*bf16', 'in_ptr1': '*fp32', 'in_ptr2': '*bf16', 'in_ptr3': '*fp32', 'out_ptr0': '*bf16', 'out_ptr1': '*fp32', 'xnumel': 'i32', 'r0_numel': 'i32', 'XBLOCK': 'constexpr'}, 'device': 0, 'constants': {'XBLOCK': 1}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (4,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 2, 'num_stages': 1, 'debug': True, 'cc': 86}
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] Traceback (most recent call last):
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 34, in wrapper
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return fn(*args, **kwargs)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 1043, in to
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return cast(self, dtype, fp_downcast_rounding, bitcast, _builder=_builder)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 34, in wrapper
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return fn(*args, **kwargs)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 1772, in cast
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return semantic.cast(input, dtype, _builder, fp_downcast_rounding)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/semantic.py", line 874, in cast
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return tl.tensor(builder.create_fp_to_fp(input.handle, dst_ty.to_ir(builder), fp_downcast_rounding), dst_ty)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]                                                            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 652, in to_ir
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return builder.get_block_ty(self.element_ty.to_ir(builder), self.shape)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 524, in to_ir
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     raise ValueError(f'type {self} not supported in this architecture. '
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ValueError: type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] 
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] The above exception was the direct cause of the following exception:
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] 
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] Traceback (most recent call last):
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 537, in _precompile_config
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     binary = triton.compile(*compile_args, **compile_kwargs)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 278, in compile
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     module = src.make_ir(options, codegen_fns, module_map, context)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 81, in make_ir
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] triton.compiler.errors.CompilationError: at 45:12:
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp15 = tl.where(xmask, tmp13, float("-inf"))
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp16 = triton_helpers.max2(tmp15, 1)[:, None]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp17 = tmp11.to(tl.float32)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp20 = tmp16.to(tl.float32)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp21 = 0.16666666666666666
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp22 = tmp20 * tmp21
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp23 = tmp19 * tmp22
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp24 = -448.0
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp25 = triton_helpers.maximum(tmp23, tmp24)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp26 = 448.0
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp27 = triton_helpers.minimum(tmp25, tmp26)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp28 = tmp27.to(tl.float8e4nv)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]             ^
ERROR 06-19 10:28:20 [core.py:515] EngineCore failed to start.
ERROR 06-19 10:28:20 [core.py:515] Traceback (most recent call last):
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-19 10:28:20 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-19 10:28:20 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 83, in __init__
ERROR 06-19 10:28:20 [core.py:515]     self._initialize_kv_caches(vllm_config)
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
ERROR 06-19 10:28:20 [core.py:515]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-19 10:28:20 [core.py:515]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-19 10:28:20 [core.py:515]     output = self.collective_rpc("determine_available_memory")
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-19 10:28:20 [core.py:515]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 06-19 10:28:20 [core.py:515]     return func(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 10:28:20 [core.py:515]     return func(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 205, in determine_available_memory
ERROR 06-19 10:28:20 [core.py:515]     self.model_runner.profile_run()
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2012, in profile_run
ERROR 06-19 10:28:20 [core.py:515]     hidden_states = self._dummy_run(self.max_num_tokens)
ERROR 06-19 10:28:20 [core.py:515]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 10:28:20 [core.py:515]     return func(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1847, in _dummy_run
ERROR 06-19 10:28:20 [core.py:515]     outputs = model(
ERROR 06-19 10:28:20 [core.py:515]               ^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 10:28:20 [core.py:515]     return self._call_impl(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-19 10:28:20 [core.py:515]     return forward_call(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
ERROR 06-19 10:28:20 [core.py:515]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 06-19 10:28:20 [core.py:515]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 239, in __call__
ERROR 06-19 10:28:20 [core.py:515]     output = self.compiled_callable(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
ERROR 06-19 10:28:20 [core.py:515]     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
ERROR 06-19 10:28:20 [core.py:515]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 760, in _compile_fx_inner
ERROR 06-19 10:28:20 [core.py:515]     raise InductorError(e, currentframe()).with_traceback(
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 745, in _compile_fx_inner
ERROR 06-19 10:28:20 [core.py:515]     mb_compiled_graph = fx_codegen_and_compile(
ERROR 06-19 10:28:20 [core.py:515]                         ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1295, in fx_codegen_and_compile
ERROR 06-19 10:28:20 [core.py:515]     return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1197, in codegen_and_compile
ERROR 06-19 10:28:20 [core.py:515]     compiled_fn = graph.compile_to_module().call
ERROR 06-19 10:28:20 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2083, in compile_to_module
ERROR 06-19 10:28:20 [core.py:515]     return self._compile_to_module()
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2130, in _compile_to_module
ERROR 06-19 10:28:20 [core.py:515]     mod = PyCodeCache.load_by_key_path(
ERROR 06-19 10:28:20 [core.py:515]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2747, in load_by_key_path
ERROR 06-19 10:28:20 [core.py:515]     mod = _reload_python_module(key, path)
ERROR 06-19 10:28:20 [core.py:515]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/compile_tasks.py", line 36, in _reload_python_module
ERROR 06-19 10:28:20 [core.py:515]     exec(code, mod.__dict__, mod.__dict__)
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/.cache/vllm/torch_compile_cache/da6494da1e/rank_0_0/inductor_cache/db/cdbabxxysvzdh5a2wghjuc5u2rz5ytt7a2s6eh633mh3elksf5xx.py", line 174, in <module>
ERROR 06-19 10:28:20 [core.py:515]     triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1 = async_compile.triton('triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1', '''
ERROR 06-19 10:28:20 [core.py:515]                                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/async_compile.py", line 346, in triton
ERROR 06-19 10:28:20 [core.py:515]     kernel.precompile(warm_cache_only=False)
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 276, in precompile
ERROR 06-19 10:28:20 [core.py:515]     self._precompile_worker()
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 296, in _precompile_worker
ERROR 06-19 10:28:20 [core.py:515]     compile_results.append(self._precompile_config(c))
ERROR 06-19 10:28:20 [core.py:515]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 537, in _precompile_config
ERROR 06-19 10:28:20 [core.py:515]     binary = triton.compile(*compile_args, **compile_kwargs)
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 278, in compile
ERROR 06-19 10:28:20 [core.py:515]     module = src.make_ir(options, codegen_fns, module_map, context)
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 81, in make_ir
ERROR 06-19 10:28:20 [core.py:515]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] torch._inductor.exc.InductorError: CompilationError: at 45:12:
ERROR 06-19 10:28:20 [core.py:515]     tmp15 = tl.where(xmask, tmp13, float("-inf"))
ERROR 06-19 10:28:20 [core.py:515]     tmp16 = triton_helpers.max2(tmp15, 1)[:, None]
ERROR 06-19 10:28:20 [core.py:515]     tmp17 = tmp11.to(tl.float32)
ERROR 06-19 10:28:20 [core.py:515]     tmp20 = tmp16.to(tl.float32)
ERROR 06-19 10:28:20 [core.py:515]     tmp21 = 0.16666666666666666
ERROR 06-19 10:28:20 [core.py:515]     tmp22 = tmp20 * tmp21
ERROR 06-19 10:28:20 [core.py:515]     tmp23 = tmp19 * tmp22
ERROR 06-19 10:28:20 [core.py:515]     tmp24 = -448.0
ERROR 06-19 10:28:20 [core.py:515]     tmp25 = triton_helpers.maximum(tmp23, tmp24)
ERROR 06-19 10:28:20 [core.py:515]     tmp26 = 448.0
ERROR 06-19 10:28:20 [core.py:515]     tmp27 = triton_helpers.minimum(tmp25, tmp26)
ERROR 06-19 10:28:20 [core.py:515]     tmp28 = tmp27.to(tl.float8e4nv)
ERROR 06-19 10:28:20 [core.py:515]             ^
ERROR 06-19 10:28:20 [core.py:515] 
ERROR 06-19 10:28:20 [core.py:515] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
ERROR 06-19 10:28:20 [core.py:515] 
Process EngineCore_0:
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
    raise e
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 83, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
    available_gpu_memory = self.model_executor.determine_available_memory()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
    output = self.collective_rpc("determine_available_memory")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 205, in determine_available_memory
    self.model_runner.profile_run()
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2012, in profile_run
    hidden_states = self._dummy_run(self.max_num_tokens)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1847, in _dummy_run
    outputs = model(
              ^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
    hidden_states = self.model(input_ids, positions, intermediate_tensors,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 239, in __call__
    output = self.compiled_callable(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 760, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 745, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1295, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1197, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2083, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2130, in _compile_to_module
    mod = PyCodeCache.load_by_key_path(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2747, in load_by_key_path
    mod = _reload_python_module(key, path)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/compile_tasks.py", line 36, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/home/ubuntu/.cache/vllm/torch_compile_cache/da6494da1e/rank_0_0/inductor_cache/db/cdbabxxysvzdh5a2wghjuc5u2rz5ytt7a2s6eh633mh3elksf5xx.py", line 174, in <module>
    triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1 = async_compile.triton('triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1', '''
                                                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/async_compile.py", line 346, in triton
    kernel.precompile(warm_cache_only=False)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 276, in precompile
    self._precompile_worker()
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 296, in _precompile_worker
    compile_results.append(self._precompile_config(c))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 537, in _precompile_config
    binary = triton.compile(*compile_args, **compile_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 278, in compile
    module = src.make_ir(options, codegen_fns, module_map, context)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 81, in make_ir
    return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._inductor.exc.InductorError: CompilationError: at 45:12:
    tmp15 = tl.where(xmask, tmp13, float("-inf"))
    tmp16 = triton_helpers.max2(tmp15, 1)[:, None]
    tmp17 = tmp11.to(tl.float32)
    tmp20 = tmp16.to(tl.float32)
    tmp21 = 0.16666666666666666
    tmp22 = tmp20 * tmp21
    tmp23 = tmp19 * tmp22
    tmp24 = -448.0
    tmp25 = triton_helpers.maximum(tmp23, tmp24)
    tmp26 = 448.0
    tmp27 = triton_helpers.minimum(tmp25, tmp26)
    tmp28 = tmp27.to(tl.float8e4nv)
            ^

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
<output truncated due to github char limit>
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```

</details>

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Unable to deploy NVFP4 quantized model #19853

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Unable to deploy NVFP4 quantized model #19853

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions