Skip to content

[Bug]: Unable to deploy NVFP4 quantized model #19853

Open
@leo-bujdei-leonte

Description

@leo-bujdei-leonte

Your current environment

The output of python collect_env.py
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 3.31.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cu126
Is debug build               : False
CUDA used to build PyTorch   : 12.6
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.10 (main, Dec  4 2024, 11:59:58) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-1018-aws-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.4.131
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA A10G
Nvidia driver version        : 550.127.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               8
On-line CPU(s) list:                  0-7
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7R32
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   4
Socket(s):                            1
Stepping:                             0
BogoMIPS:                             5599.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            128 KiB (4 instances)
L1i cache:                            128 KiB (4 instances)
L2 cache:                             2 MiB (4 instances)
L3 cache:                             16 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-7
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.4
[pip3] triton==3.3.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-7     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/cuda-12.4/extras/CUPTI/lib64:/usr/local/lib:/usr/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/cuda-12.4/extras/CUPTI/lib64:/usr/local/lib:/usr/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/cuda-12.4/extras/CUPTI/lib64:/usr/local/lib:/usr/lib
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

I am trying to deploy an NVFP4 quantized model as newly supported in 0.9.1. I copied the minimal setup from #18312:

import torch

from vllm import LLM, SamplingParams

prompts = [
    "The Swiss Alps are", 
    "Brad Marchand is",
    "The Toronto Maple Leafs are"
]

# Create a sampling params object for greedy sampling
sampling_params = SamplingParams(temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10)
llm = LLM("nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4")

This fails with the following output:

output & stack trace
INFO 06-19 09:53:33 [__init__.py:244] Automatically detected platform cuda.
INFO 06-19 09:53:47 [config.py:823] This model supports multiple tasks: {'score', 'generate', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 06-19 09:53:48 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-19 09:53:49 [core.py:455] Waiting for init message from front-end.
INFO 06-19 09:53:49 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4', speculative_config=None, tokenizer='nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-19 09:53:49 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7cd939b4de10>
INFO 06-19 09:53:50 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 06-19 09:53:50 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 06-19 09:53:50 [gpu_model_runner.py:1595] Starting to load model nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4...
INFO 06-19 09:53:50 [gpu_model_runner.py:1600] Loading model from scratch...
ERROR 06-19 09:53:50 [core.py:515] EngineCore failed to start.
ERROR 06-19 09:53:50 [core.py:515] Traceback (most recent call last):
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-19 09:53:50 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-19 09:53:50 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-19 09:53:50 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 76, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.model_executor = executor_class(vllm_config)
ERROR 06-19 09:53:50 [core.py:515]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self._init_executor()
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 06-19 09:53:50 [core.py:515]     self.collective_rpc("load_model")
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-19 09:53:50 [core.py:515]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-19 09:53:50 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 06-19 09:53:50 [core.py:515]     return func(*args, **kwargs)
ERROR 06-19 09:53:50 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
ERROR 06-19 09:53:50 [core.py:515]     self.model_runner.load_model()
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
ERROR 06-19 09:53:50 [core.py:515]     self.model = model_loader.load_model(
ERROR 06-19 09:53:50 [core.py:515]                  ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
ERROR 06-19 09:53:50 [core.py:515]     model = initialize_model(vllm_config=vllm_config,
ERROR 06-19 09:53:50 [core.py:515]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
ERROR 06-19 09:53:50 [core.py:515]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 06-19 09:53:50 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 521, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.model = self._init_model(vllm_config=vllm_config,
ERROR 06-19 09:53:50 [core.py:515]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 567, in _init_model
ERROR 06-19 09:53:50 [core.py:515]     return LlamaModel(vllm_config=vllm_config,
ERROR 06-19 09:53:50 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 152, in __init__
ERROR 06-19 09:53:50 [core.py:515]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 346, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 06-19 09:53:50 [core.py:515]                                                     ^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 626, in make_layers
ERROR 06-19 09:53:50 [core.py:515]     [PPMissingLayer() for _ in range(start_layer)] + [
ERROR 06-19 09:53:50 [core.py:515]                                                      ^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 627, in <listcomp>
ERROR 06-19 09:53:50 [core.py:515]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 06-19 09:53:50 [core.py:515]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 348, in <lambda>
ERROR 06-19 09:53:50 [core.py:515]     lambda prefix: layer_type(config=config,
ERROR 06-19 09:53:50 [core.py:515]                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 263, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.self_attn = LlamaAttention(
ERROR 06-19 09:53:50 [core.py:515]                      ^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 148, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.qkv_proj = QKVParallelLinear(
ERROR 06-19 09:53:50 [core.py:515]                     ^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 851, in __init__
ERROR 06-19 09:53:50 [core.py:515]     super().__init__(input_size=input_size,
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 397, in __init__
ERROR 06-19 09:53:50 [core.py:515]     super().__init__(input_size,
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 243, in __init__
ERROR 06-19 09:53:50 [core.py:515]     self.quant_method = quant_config.get_quant_method(self,
ERROR 06-19 09:53:50 [core.py:515]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 94, in get_quant_method
ERROR 06-19 09:53:50 [core.py:515]     scheme = self.get_scheme(layer=layer, layer_name=prefix)
ERROR 06-19 09:53:50 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 489, in get_scheme
ERROR 06-19 09:53:50 [core.py:515]     scheme = self._get_scheme_from_parts(  # type: ignore
ERROR 06-19 09:53:50 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 414, in _get_scheme_from_parts
ERROR 06-19 09:53:50 [core.py:515]     raise NotImplementedError(
ERROR 06-19 09:53:50 [core.py:515] NotImplementedError: No compressed-tensors compatible scheme was found.
Process EngineCore_0:
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
    raise e
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 76, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
    self.collective_rpc("load_model")
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
    self.model_runner.load_model()
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
    self.model = model_loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
    model = initialize_model(vllm_config=vllm_config,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
    return model_class(vllm_config=vllm_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 521, in __init__
    self.model = self._init_model(vllm_config=vllm_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 567, in _init_model
    return LlamaModel(vllm_config=vllm_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 152, in __init__
    old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 346, in __init__
    self.start_layer, self.end_layer, self.layers = make_layers(
                                                    ^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 626, in make_layers
    [PPMissingLayer() for _ in range(start_layer)] + [
                                                     ^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 627, in <listcomp>
    maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 348, in <lambda>
    lambda prefix: layer_type(config=config,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 263, in __init__
    self.self_attn = LlamaAttention(
                     ^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 148, in __init__
    self.qkv_proj = QKVParallelLinear(
                    ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 851, in __init__
    super().__init__(input_size=input_size,
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 397, in __init__
    super().__init__(input_size,
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 243, in __init__
    self.quant_method = quant_config.get_quant_method(self,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 94, in get_quant_method
    scheme = self.get_scheme(layer=layer, layer_name=prefix)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 489, in get_scheme
    scheme = self._get_scheme_from_parts(  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 414, in _get_scheme_from_parts
    raise NotImplementedError(
NotImplementedError: No compressed-tensors compatible scheme was found.
<output truncated due to github char limit>
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

It looks like the quantization config is slightly different from what vllm expects. In vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors:CompressedTensorsConfig._get_scheme_from_parts, the quant type check _is_fp4a4_nvfp4 fails due to the quantization strategy spec:

weight_quant = QuantizationArgs(
    num_bits=4,
    type='float',
    symmetric=True,
    group_size=16,
    strategy='group', # <--- should be 'tensor_group'
    block_structure=None,
    dynamic=False,
    actorder=None,
    observer='minmax',
    observer_kwargs={},
)
input_quant = QuantizationArgs(
    num_bits=4,
    type='float',
    symmetric=True,
    group_size=16,
    strategy='group', # <--- should be 'tensor_group'
    block_structure=None,
    dynamic=False,
    actorder=None,
    observer='minmax',
    observer_kwargs={},
)
quant_format = 'nvfp4-pack-quantized'

I also tested a Qwen 3 8B NVFP4-quantized using llm-compressor. The quant spec seems correctly configured:

weight_quant = QuantizationArgs(
    num_bits=4,
    type='float',
    symmetric=True,
    group_size=16,
    strategy='tensor_group', # differs from above
    block_structure=None,
    dynamic=False,
    actorder=None,
    observer='minmax',
    observer_kwargs={},
)
input_quant = QuantizationArgs(
    num_bits=4,
    type='float',
    symmetric=True,
    group_size=16,
    strategy='tensor_group', # differs from above
    block_structure=None,
    dynamic='local', # differs from above
    actorder=None,
    observer='minmax',
    observer_kwargs={},
)
quant_format = 'nvfp4-pack-quantized'

However the program crashes at a later step when trying to compile:

output & stack trace
<truncated identical output due to github char limit>
INFO 06-19 10:27:33 [gpu_model_runner.py:1600] Loading model from scratch...
WARNING 06-19 10:27:33 [compressed_tensors_w4a4_nvfp4.py:38] Current platform does not support cutlass NVFP4. Running emulations.
INFO 06-19 10:27:34 [cuda.py:252] Using Flash Attention backend on V1 engine.
WARNING 06-19 10:27:34 [compressed_tensors_w4a4_nvfp4.py:38] Current platform does not support cutlass NVFP4. Running emulations.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.62it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.53it/s]

INFO 06-19 10:27:35 [default_loader.py:272] Loading weights took 1.38 seconds
INFO 06-19 10:27:36 [gpu_model_runner.py:1624] Model loading took 6.3905 GiB and 2.158821 seconds
INFO 06-19 10:28:14 [backends.py:462] Using cache directory: /home/ubuntu/.cache/vllm/torch_compile_cache/da6494da1e/rank_0_0 for vLLM's torch.compile
INFO 06-19 10:28:14 [backends.py:472] Dynamo bytecode transform time: 37.87 s
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] Triton compilation failed: triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] def triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1(in_out_ptr0, in_ptr0, in_ptr1, in_ptr2, in_ptr3, out_ptr0, out_ptr1, xnumel, r0_numel, XBLOCK : tl.constexpr):
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     r0_numel = 16
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     R0_BLOCK: tl.constexpr = 16
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     rnumel = r0_numel
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     RBLOCK: tl.constexpr = R0_BLOCK
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     xoffset = tl.program_id(0) * XBLOCK
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     xmask = xindex < xnumel
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     r0_index = tl.arange(0, R0_BLOCK)[None, :]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     r0_offset = 0
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     r0_mask = tl.full([XBLOCK, R0_BLOCK], True, tl.int1)
<truncated low level output due to github char limit>
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp88 = tmp87 > tmp83
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp89 = tl.where(tmp88, tmp45, tmp87)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tl.store(out_ptr1 + (r0_2 + 16*x3), tmp42, xmask)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tl.store(in_out_ptr0 + (r0_2 + 16*x3), tmp89, xmask)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tl.store(out_ptr0 + (x3), tmp16, xmask)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] 
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] metadata: {'signature': {'in_out_ptr0': '*fp32', 'in_ptr0': '*bf16', 'in_ptr1': '*fp32', 'in_ptr2': '*bf16', 'in_ptr3': '*fp32', 'out_ptr0': '*bf16', 'out_ptr1': '*fp32', 'xnumel': 'i32', 'r0_numel': 'i32', 'XBLOCK': 'constexpr'}, 'device': 0, 'constants': {'XBLOCK': 1}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (4,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 2, 'num_stages': 1, 'debug': True, 'cc': 86}
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] Traceback (most recent call last):
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 34, in wrapper
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return fn(*args, **kwargs)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 1043, in to
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return cast(self, dtype, fp_downcast_rounding, bitcast, _builder=_builder)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 34, in wrapper
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return fn(*args, **kwargs)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 1772, in cast
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return semantic.cast(input, dtype, _builder, fp_downcast_rounding)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/semantic.py", line 874, in cast
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return tl.tensor(builder.create_fp_to_fp(input.handle, dst_ty.to_ir(builder), fp_downcast_rounding), dst_ty)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]                                                            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 652, in to_ir
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return builder.get_block_ty(self.element_ty.to_ir(builder), self.shape)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 524, in to_ir
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     raise ValueError(f'type {self} not supported in this architecture. '
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ValueError: type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] 
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] The above exception was the direct cause of the following exception:
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] 
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] Traceback (most recent call last):
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 537, in _precompile_config
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     binary = triton.compile(*compile_args, **compile_kwargs)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 278, in compile
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     module = src.make_ir(options, codegen_fns, module_map, context)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 81, in make_ir
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] triton.compiler.errors.CompilationError: at 45:12:
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp15 = tl.where(xmask, tmp13, float("-inf"))
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp16 = triton_helpers.max2(tmp15, 1)[:, None]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp17 = tmp11.to(tl.float32)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp20 = tmp16.to(tl.float32)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp21 = 0.16666666666666666
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp22 = tmp20 * tmp21
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp23 = tmp19 * tmp22
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp24 = -448.0
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp25 = triton_helpers.maximum(tmp23, tmp24)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp26 = 448.0
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp27 = triton_helpers.minimum(tmp25, tmp26)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]     tmp28 = tmp27.to(tl.float8e4nv)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]             ^
ERROR 06-19 10:28:20 [core.py:515] EngineCore failed to start.
ERROR 06-19 10:28:20 [core.py:515] Traceback (most recent call last):
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-19 10:28:20 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-19 10:28:20 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 83, in __init__
ERROR 06-19 10:28:20 [core.py:515]     self._initialize_kv_caches(vllm_config)
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
ERROR 06-19 10:28:20 [core.py:515]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-19 10:28:20 [core.py:515]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-19 10:28:20 [core.py:515]     output = self.collective_rpc("determine_available_memory")
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-19 10:28:20 [core.py:515]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 06-19 10:28:20 [core.py:515]     return func(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 10:28:20 [core.py:515]     return func(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 205, in determine_available_memory
ERROR 06-19 10:28:20 [core.py:515]     self.model_runner.profile_run()
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2012, in profile_run
ERROR 06-19 10:28:20 [core.py:515]     hidden_states = self._dummy_run(self.max_num_tokens)
ERROR 06-19 10:28:20 [core.py:515]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 10:28:20 [core.py:515]     return func(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1847, in _dummy_run
ERROR 06-19 10:28:20 [core.py:515]     outputs = model(
ERROR 06-19 10:28:20 [core.py:515]               ^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 10:28:20 [core.py:515]     return self._call_impl(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-19 10:28:20 [core.py:515]     return forward_call(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
ERROR 06-19 10:28:20 [core.py:515]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 06-19 10:28:20 [core.py:515]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 239, in __call__
ERROR 06-19 10:28:20 [core.py:515]     output = self.compiled_callable(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
ERROR 06-19 10:28:20 [core.py:515]     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
ERROR 06-19 10:28:20 [core.py:515]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 760, in _compile_fx_inner
ERROR 06-19 10:28:20 [core.py:515]     raise InductorError(e, currentframe()).with_traceback(
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 745, in _compile_fx_inner
ERROR 06-19 10:28:20 [core.py:515]     mb_compiled_graph = fx_codegen_and_compile(
ERROR 06-19 10:28:20 [core.py:515]                         ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1295, in fx_codegen_and_compile
ERROR 06-19 10:28:20 [core.py:515]     return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1197, in codegen_and_compile
ERROR 06-19 10:28:20 [core.py:515]     compiled_fn = graph.compile_to_module().call
ERROR 06-19 10:28:20 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2083, in compile_to_module
ERROR 06-19 10:28:20 [core.py:515]     return self._compile_to_module()
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2130, in _compile_to_module
ERROR 06-19 10:28:20 [core.py:515]     mod = PyCodeCache.load_by_key_path(
ERROR 06-19 10:28:20 [core.py:515]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2747, in load_by_key_path
ERROR 06-19 10:28:20 [core.py:515]     mod = _reload_python_module(key, path)
ERROR 06-19 10:28:20 [core.py:515]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/compile_tasks.py", line 36, in _reload_python_module
ERROR 06-19 10:28:20 [core.py:515]     exec(code, mod.__dict__, mod.__dict__)
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/.cache/vllm/torch_compile_cache/da6494da1e/rank_0_0/inductor_cache/db/cdbabxxysvzdh5a2wghjuc5u2rz5ytt7a2s6eh633mh3elksf5xx.py", line 174, in <module>
ERROR 06-19 10:28:20 [core.py:515]     triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1 = async_compile.triton('triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1', '''
ERROR 06-19 10:28:20 [core.py:515]                                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/async_compile.py", line 346, in triton
ERROR 06-19 10:28:20 [core.py:515]     kernel.precompile(warm_cache_only=False)
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 276, in precompile
ERROR 06-19 10:28:20 [core.py:515]     self._precompile_worker()
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 296, in _precompile_worker
ERROR 06-19 10:28:20 [core.py:515]     compile_results.append(self._precompile_config(c))
ERROR 06-19 10:28:20 [core.py:515]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 537, in _precompile_config
ERROR 06-19 10:28:20 [core.py:515]     binary = triton.compile(*compile_args, **compile_kwargs)
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 278, in compile
ERROR 06-19 10:28:20 [core.py:515]     module = src.make_ir(options, codegen_fns, module_map, context)
ERROR 06-19 10:28:20 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515]   File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 81, in make_ir
ERROR 06-19 10:28:20 [core.py:515]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
ERROR 06-19 10:28:20 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] torch._inductor.exc.InductorError: CompilationError: at 45:12:
ERROR 06-19 10:28:20 [core.py:515]     tmp15 = tl.where(xmask, tmp13, float("-inf"))
ERROR 06-19 10:28:20 [core.py:515]     tmp16 = triton_helpers.max2(tmp15, 1)[:, None]
ERROR 06-19 10:28:20 [core.py:515]     tmp17 = tmp11.to(tl.float32)
ERROR 06-19 10:28:20 [core.py:515]     tmp20 = tmp16.to(tl.float32)
ERROR 06-19 10:28:20 [core.py:515]     tmp21 = 0.16666666666666666
ERROR 06-19 10:28:20 [core.py:515]     tmp22 = tmp20 * tmp21
ERROR 06-19 10:28:20 [core.py:515]     tmp23 = tmp19 * tmp22
ERROR 06-19 10:28:20 [core.py:515]     tmp24 = -448.0
ERROR 06-19 10:28:20 [core.py:515]     tmp25 = triton_helpers.maximum(tmp23, tmp24)
ERROR 06-19 10:28:20 [core.py:515]     tmp26 = 448.0
ERROR 06-19 10:28:20 [core.py:515]     tmp27 = triton_helpers.minimum(tmp25, tmp26)
ERROR 06-19 10:28:20 [core.py:515]     tmp28 = tmp27.to(tl.float8e4nv)
ERROR 06-19 10:28:20 [core.py:515]             ^
ERROR 06-19 10:28:20 [core.py:515] 
ERROR 06-19 10:28:20 [core.py:515] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
ERROR 06-19 10:28:20 [core.py:515] 
Process EngineCore_0:
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
    raise e
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 83, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
    available_gpu_memory = self.model_executor.determine_available_memory()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
    output = self.collective_rpc("determine_available_memory")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 205, in determine_available_memory
    self.model_runner.profile_run()
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2012, in profile_run
    hidden_states = self._dummy_run(self.max_num_tokens)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1847, in _dummy_run
    outputs = model(
              ^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
    hidden_states = self.model(input_ids, positions, intermediate_tensors,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 239, in __call__
    output = self.compiled_callable(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 760, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 745, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1295, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1197, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2083, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2130, in _compile_to_module
    mod = PyCodeCache.load_by_key_path(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2747, in load_by_key_path
    mod = _reload_python_module(key, path)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/compile_tasks.py", line 36, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/home/ubuntu/.cache/vllm/torch_compile_cache/da6494da1e/rank_0_0/inductor_cache/db/cdbabxxysvzdh5a2wghjuc5u2rz5ytt7a2s6eh633mh3elksf5xx.py", line 174, in <module>
    triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1 = async_compile.triton('triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1', '''
                                                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/async_compile.py", line 346, in triton
    kernel.precompile(warm_cache_only=False)
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 276, in precompile
    self._precompile_worker()
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 296, in _precompile_worker
    compile_results.append(self._precompile_config(c))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 537, in _precompile_config
    binary = triton.compile(*compile_args, **compile_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 278, in compile
    module = src.make_ir(options, codegen_fns, module_map, context)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 81, in make_ir
    return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._inductor.exc.InductorError: CompilationError: at 45:12:
    tmp15 = tl.where(xmask, tmp13, float("-inf"))
    tmp16 = triton_helpers.max2(tmp15, 1)[:, None]
    tmp17 = tmp11.to(tl.float32)
    tmp20 = tmp16.to(tl.float32)
    tmp21 = 0.16666666666666666
    tmp22 = tmp20 * tmp21
    tmp23 = tmp19 * tmp22
    tmp24 = -448.0
    tmp25 = triton_helpers.maximum(tmp23, tmp24)
    tmp26 = 448.0
    tmp27 = triton_helpers.minimum(tmp25, tmp26)
    tmp28 = tmp27.to(tl.float8e4nv)
            ^

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
<output truncated due to github char limit>
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions