Open
Description
Your current environment
The output of python collect_env.py
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version : Could not collect
CMake version : version 3.31.0
Libc version : glibc-2.35
==============================
PyTorch Info
==============================
PyTorch version : 2.7.0+cu126
Is debug build : False
CUDA used to build PyTorch : 12.6
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.11.10 (main, Dec 4 2024, 11:59:58) [GCC 11.4.0] (64-bit runtime)
Python platform : Linux-6.8.0-1018-aws-x86_64-with-glibc2.35
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : 12.4.131
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : GPU 0: NVIDIA A10G
Nvidia driver version : 550.127.05
cuDNN version : Could not collect
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R32
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 0
BogoMIPS: 5599.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 128 KiB (4 instances)
L1i cache: 128 KiB (4 instances)
L2 cache: 2 MiB (4 instances)
L3 cache: 16 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.4
[pip3] triton==3.3.0
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-7 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
==============================
Environment Variables
==============================
LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/cuda-12.4/extras/CUPTI/lib64:/usr/local/lib:/usr/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/cuda-12.4/extras/CUPTI/lib64:/usr/local/lib:/usr/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.4/lib:/usr/local/cuda-12.4/lib64:/usr/local/cuda-12.4:/usr/local/cuda-12.4/targets/x86_64-linux/lib/:/usr/local/cuda-12.4/extras/CUPTI/lib64:/usr/local/lib:/usr/lib
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
I am trying to deploy an NVFP4 quantized model as newly supported in 0.9.1. I copied the minimal setup from #18312:
import torch
from vllm import LLM, SamplingParams
prompts = [
"The Swiss Alps are",
"Brad Marchand is",
"The Toronto Maple Leafs are"
]
# Create a sampling params object for greedy sampling
sampling_params = SamplingParams(temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10)
llm = LLM("nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4")
This fails with the following output:
output & stack trace
INFO 06-19 09:53:33 [__init__.py:244] Automatically detected platform cuda.
INFO 06-19 09:53:47 [config.py:823] This model supports multiple tasks: {'score', 'generate', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 06-19 09:53:48 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-19 09:53:49 [core.py:455] Waiting for init message from front-end.
INFO 06-19 09:53:49 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4', speculative_config=None, tokenizer='nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-19 09:53:49 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7cd939b4de10>
INFO 06-19 09:53:50 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 06-19 09:53:50 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 06-19 09:53:50 [gpu_model_runner.py:1595] Starting to load model nm-testing/TinyLlama-1.1B-Chat-v1.0-NVFP4A4...
INFO 06-19 09:53:50 [gpu_model_runner.py:1600] Loading model from scratch...
ERROR 06-19 09:53:50 [core.py:515] EngineCore failed to start.
ERROR 06-19 09:53:50 [core.py:515] Traceback (most recent call last):
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-19 09:53:50 [core.py:515] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-19 09:53:50 [core.py:515] super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 76, in __init__
ERROR 06-19 09:53:50 [core.py:515] self.model_executor = executor_class(vllm_config)
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 06-19 09:53:50 [core.py:515] self._init_executor()
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 06-19 09:53:50 [core.py:515] self.collective_rpc("load_model")
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-19 09:53:50 [core.py:515] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 06-19 09:53:50 [core.py:515] return func(*args, **kwargs)
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
ERROR 06-19 09:53:50 [core.py:515] self.model_runner.load_model()
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
ERROR 06-19 09:53:50 [core.py:515] self.model = model_loader.load_model(
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
ERROR 06-19 09:53:50 [core.py:515] model = initialize_model(vllm_config=vllm_config,
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
ERROR 06-19 09:53:50 [core.py:515] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 521, in __init__
ERROR 06-19 09:53:50 [core.py:515] self.model = self._init_model(vllm_config=vllm_config,
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 567, in _init_model
ERROR 06-19 09:53:50 [core.py:515] return LlamaModel(vllm_config=vllm_config,
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 152, in __init__
ERROR 06-19 09:53:50 [core.py:515] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 346, in __init__
ERROR 06-19 09:53:50 [core.py:515] self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 626, in make_layers
ERROR 06-19 09:53:50 [core.py:515] [PPMissingLayer() for _ in range(start_layer)] + [
ERROR 06-19 09:53:50 [core.py:515] ^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 627, in <listcomp>
ERROR 06-19 09:53:50 [core.py:515] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 348, in <lambda>
ERROR 06-19 09:53:50 [core.py:515] lambda prefix: layer_type(config=config,
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 263, in __init__
ERROR 06-19 09:53:50 [core.py:515] self.self_attn = LlamaAttention(
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 148, in __init__
ERROR 06-19 09:53:50 [core.py:515] self.qkv_proj = QKVParallelLinear(
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 851, in __init__
ERROR 06-19 09:53:50 [core.py:515] super().__init__(input_size=input_size,
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 397, in __init__
ERROR 06-19 09:53:50 [core.py:515] super().__init__(input_size,
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 243, in __init__
ERROR 06-19 09:53:50 [core.py:515] self.quant_method = quant_config.get_quant_method(self,
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 94, in get_quant_method
ERROR 06-19 09:53:50 [core.py:515] scheme = self.get_scheme(layer=layer, layer_name=prefix)
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 489, in get_scheme
ERROR 06-19 09:53:50 [core.py:515] scheme = self._get_scheme_from_parts( # type: ignore
ERROR 06-19 09:53:50 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 09:53:50 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 414, in _get_scheme_from_parts
ERROR 06-19 09:53:50 [core.py:515] raise NotImplementedError(
ERROR 06-19 09:53:50 [core.py:515] NotImplementedError: No compressed-tensors compatible scheme was found.
Process EngineCore_0:
Traceback (most recent call last):
File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
raise e
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 76, in __init__
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__
self._init_executor()
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
self.collective_rpc("load_model")
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
self.model_runner.load_model()
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
self.model = model_loader.load_model(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
model = initialize_model(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
return model_class(vllm_config=vllm_config, prefix=prefix)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 521, in __init__
self.model = self._init_model(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 567, in _init_model
return LlamaModel(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 152, in __init__
old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 346, in __init__
self.start_layer, self.end_layer, self.layers = make_layers(
^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 626, in make_layers
[PPMissingLayer() for _ in range(start_layer)] + [
^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 627, in <listcomp>
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 348, in <lambda>
lambda prefix: layer_type(config=config,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 263, in __init__
self.self_attn = LlamaAttention(
^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 148, in __init__
self.qkv_proj = QKVParallelLinear(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 851, in __init__
super().__init__(input_size=input_size,
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 397, in __init__
super().__init__(input_size,
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 243, in __init__
self.quant_method = quant_config.get_quant_method(self,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 94, in get_quant_method
scheme = self.get_scheme(layer=layer, layer_name=prefix)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 489, in get_scheme
scheme = self._get_scheme_from_parts( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 414, in _get_scheme_from_parts
raise NotImplementedError(
NotImplementedError: No compressed-tensors compatible scheme was found.
<output truncated due to github char limit>
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
It looks like the quantization config is slightly different from what vllm expects. In vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors:CompressedTensorsConfig._get_scheme_from_parts
, the quant type check _is_fp4a4_nvfp4
fails due to the quantization strategy spec:
weight_quant = QuantizationArgs(
num_bits=4,
type='float',
symmetric=True,
group_size=16,
strategy='group', # <--- should be 'tensor_group'
block_structure=None,
dynamic=False,
actorder=None,
observer='minmax',
observer_kwargs={},
)
input_quant = QuantizationArgs(
num_bits=4,
type='float',
symmetric=True,
group_size=16,
strategy='group', # <--- should be 'tensor_group'
block_structure=None,
dynamic=False,
actorder=None,
observer='minmax',
observer_kwargs={},
)
quant_format = 'nvfp4-pack-quantized'
I also tested a Qwen 3 8B NVFP4-quantized using llm-compressor. The quant spec seems correctly configured:
weight_quant = QuantizationArgs(
num_bits=4,
type='float',
symmetric=True,
group_size=16,
strategy='tensor_group', # differs from above
block_structure=None,
dynamic=False,
actorder=None,
observer='minmax',
observer_kwargs={},
)
input_quant = QuantizationArgs(
num_bits=4,
type='float',
symmetric=True,
group_size=16,
strategy='tensor_group', # differs from above
block_structure=None,
dynamic='local', # differs from above
actorder=None,
observer='minmax',
observer_kwargs={},
)
quant_format = 'nvfp4-pack-quantized'
However the program crashes at a later step when trying to compile:
output & stack trace
<truncated identical output due to github char limit>
INFO 06-19 10:27:33 [gpu_model_runner.py:1600] Loading model from scratch...
WARNING 06-19 10:27:33 [compressed_tensors_w4a4_nvfp4.py:38] Current platform does not support cutlass NVFP4. Running emulations.
INFO 06-19 10:27:34 [cuda.py:252] Using Flash Attention backend on V1 engine.
WARNING 06-19 10:27:34 [compressed_tensors_w4a4_nvfp4.py:38] Current platform does not support cutlass NVFP4. Running emulations.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.62it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.53it/s]
INFO 06-19 10:27:35 [default_loader.py:272] Loading weights took 1.38 seconds
INFO 06-19 10:27:36 [gpu_model_runner.py:1624] Model loading took 6.3905 GiB and 2.158821 seconds
INFO 06-19 10:28:14 [backends.py:462] Using cache directory: /home/ubuntu/.cache/vllm/torch_compile_cache/da6494da1e/rank_0_0 for vLLM's torch.compile
INFO 06-19 10:28:14 [backends.py:472] Dynamo bytecode transform time: 37.87 s
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] Triton compilation failed: triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] def triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1(in_out_ptr0, in_ptr0, in_ptr1, in_ptr2, in_ptr3, out_ptr0, out_ptr1, xnumel, r0_numel, XBLOCK : tl.constexpr):
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] r0_numel = 16
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] R0_BLOCK: tl.constexpr = 16
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] rnumel = r0_numel
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] RBLOCK: tl.constexpr = R0_BLOCK
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] xoffset = tl.program_id(0) * XBLOCK
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] xmask = xindex < xnumel
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] r0_index = tl.arange(0, R0_BLOCK)[None, :]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] r0_offset = 0
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] r0_mask = tl.full([XBLOCK, R0_BLOCK], True, tl.int1)
<truncated low level output due to github char limit>
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp88 = tmp87 > tmp83
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp89 = tl.where(tmp88, tmp45, tmp87)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tl.store(out_ptr1 + (r0_2 + 16*x3), tmp42, xmask)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tl.store(in_out_ptr0 + (r0_2 + 16*x3), tmp89, xmask)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tl.store(out_ptr0 + (x3), tmp16, xmask)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] metadata: {'signature': {'in_out_ptr0': '*fp32', 'in_ptr0': '*bf16', 'in_ptr1': '*fp32', 'in_ptr2': '*bf16', 'in_ptr3': '*fp32', 'out_ptr0': '*bf16', 'out_ptr1': '*fp32', 'xnumel': 'i32', 'r0_numel': 'i32', 'XBLOCK': 'constexpr'}, 'device': 0, 'constants': {'XBLOCK': 1}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (4,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 2, 'num_stages': 1, 'debug': True, 'cc': 86}
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] Traceback (most recent call last):
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 34, in wrapper
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] return fn(*args, **kwargs)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 1043, in to
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] return cast(self, dtype, fp_downcast_rounding, bitcast, _builder=_builder)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 34, in wrapper
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] return fn(*args, **kwargs)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 1772, in cast
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] return semantic.cast(input, dtype, _builder, fp_downcast_rounding)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/semantic.py", line 874, in cast
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] return tl.tensor(builder.create_fp_to_fp(input.handle, dst_ty.to_ir(builder), fp_downcast_rounding), dst_ty)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 652, in to_ir
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] return builder.get_block_ty(self.element_ty.to_ir(builder), self.shape)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/language/core.py", line 524, in to_ir
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] raise ValueError(f'type {self} not supported in this architecture. '
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ValueError: type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] The above exception was the direct cause of the following exception:
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] Traceback (most recent call last):
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 537, in _precompile_config
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] binary = triton.compile(*compile_args, **compile_kwargs)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 278, in compile
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] module = src.make_ir(options, codegen_fns, module_map, context)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 81, in make_ir
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] triton.compiler.errors.CompilationError: at 45:12:
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp15 = tl.where(xmask, tmp13, float("-inf"))
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp16 = triton_helpers.max2(tmp15, 1)[:, None]
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp17 = tmp11.to(tl.float32)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp20 = tmp16.to(tl.float32)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp21 = 0.16666666666666666
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp22 = tmp20 * tmp21
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp23 = tmp19 * tmp22
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp24 = -448.0
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp25 = triton_helpers.maximum(tmp23, tmp24)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp26 = 448.0
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp27 = triton_helpers.minimum(tmp25, tmp26)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] tmp28 = tmp27.to(tl.float8e4nv)
[rank0]:E0619 10:28:20.556000 2582540 torch/_inductor/runtime/triton_heuristics.py:539] [0/0] ^
ERROR 06-19 10:28:20 [core.py:515] EngineCore failed to start.
ERROR 06-19 10:28:20 [core.py:515] Traceback (most recent call last):
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-19 10:28:20 [core.py:515] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-19 10:28:20 [core.py:515] super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 83, in __init__
ERROR 06-19 10:28:20 [core.py:515] self._initialize_kv_caches(vllm_config)
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
ERROR 06-19 10:28:20 [core.py:515] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-19 10:28:20 [core.py:515] output = self.collective_rpc("determine_available_memory")
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-19 10:28:20 [core.py:515] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 06-19 10:28:20 [core.py:515] return func(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 10:28:20 [core.py:515] return func(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 205, in determine_available_memory
ERROR 06-19 10:28:20 [core.py:515] self.model_runner.profile_run()
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2012, in profile_run
ERROR 06-19 10:28:20 [core.py:515] hidden_states = self._dummy_run(self.max_num_tokens)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-19 10:28:20 [core.py:515] return func(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1847, in _dummy_run
ERROR 06-19 10:28:20 [core.py:515] outputs = model(
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-19 10:28:20 [core.py:515] return self._call_impl(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-19 10:28:20 [core.py:515] return forward_call(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
ERROR 06-19 10:28:20 [core.py:515] hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 239, in __call__
ERROR 06-19 10:28:20 [core.py:515] output = self.compiled_callable(*args, **kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
ERROR 06-19 10:28:20 [core.py:515] raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 760, in _compile_fx_inner
ERROR 06-19 10:28:20 [core.py:515] raise InductorError(e, currentframe()).with_traceback(
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 745, in _compile_fx_inner
ERROR 06-19 10:28:20 [core.py:515] mb_compiled_graph = fx_codegen_and_compile(
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1295, in fx_codegen_and_compile
ERROR 06-19 10:28:20 [core.py:515] return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1197, in codegen_and_compile
ERROR 06-19 10:28:20 [core.py:515] compiled_fn = graph.compile_to_module().call
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2083, in compile_to_module
ERROR 06-19 10:28:20 [core.py:515] return self._compile_to_module()
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2130, in _compile_to_module
ERROR 06-19 10:28:20 [core.py:515] mod = PyCodeCache.load_by_key_path(
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2747, in load_by_key_path
ERROR 06-19 10:28:20 [core.py:515] mod = _reload_python_module(key, path)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/compile_tasks.py", line 36, in _reload_python_module
ERROR 06-19 10:28:20 [core.py:515] exec(code, mod.__dict__, mod.__dict__)
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/.cache/vllm/torch_compile_cache/da6494da1e/rank_0_0/inductor_cache/db/cdbabxxysvzdh5a2wghjuc5u2rz5ytt7a2s6eh633mh3elksf5xx.py", line 174, in <module>
ERROR 06-19 10:28:20 [core.py:515] triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1 = async_compile.triton('triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1', '''
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/async_compile.py", line 346, in triton
ERROR 06-19 10:28:20 [core.py:515] kernel.precompile(warm_cache_only=False)
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 276, in precompile
ERROR 06-19 10:28:20 [core.py:515] self._precompile_worker()
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 296, in _precompile_worker
ERROR 06-19 10:28:20 [core.py:515] compile_results.append(self._precompile_config(c))
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 537, in _precompile_config
ERROR 06-19 10:28:20 [core.py:515] binary = triton.compile(*compile_args, **compile_kwargs)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 278, in compile
ERROR 06-19 10:28:20 [core.py:515] module = src.make_ir(options, codegen_fns, module_map, context)
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 81, in make_ir
ERROR 06-19 10:28:20 [core.py:515] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
ERROR 06-19 10:28:20 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 10:28:20 [core.py:515] torch._inductor.exc.InductorError: CompilationError: at 45:12:
ERROR 06-19 10:28:20 [core.py:515] tmp15 = tl.where(xmask, tmp13, float("-inf"))
ERROR 06-19 10:28:20 [core.py:515] tmp16 = triton_helpers.max2(tmp15, 1)[:, None]
ERROR 06-19 10:28:20 [core.py:515] tmp17 = tmp11.to(tl.float32)
ERROR 06-19 10:28:20 [core.py:515] tmp20 = tmp16.to(tl.float32)
ERROR 06-19 10:28:20 [core.py:515] tmp21 = 0.16666666666666666
ERROR 06-19 10:28:20 [core.py:515] tmp22 = tmp20 * tmp21
ERROR 06-19 10:28:20 [core.py:515] tmp23 = tmp19 * tmp22
ERROR 06-19 10:28:20 [core.py:515] tmp24 = -448.0
ERROR 06-19 10:28:20 [core.py:515] tmp25 = triton_helpers.maximum(tmp23, tmp24)
ERROR 06-19 10:28:20 [core.py:515] tmp26 = 448.0
ERROR 06-19 10:28:20 [core.py:515] tmp27 = triton_helpers.minimum(tmp25, tmp26)
ERROR 06-19 10:28:20 [core.py:515] tmp28 = tmp27.to(tl.float8e4nv)
ERROR 06-19 10:28:20 [core.py:515] ^
ERROR 06-19 10:28:20 [core.py:515]
ERROR 06-19 10:28:20 [core.py:515] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
ERROR 06-19 10:28:20 [core.py:515]
Process EngineCore_0:
Traceback (most recent call last):
File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/ubuntu/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
raise e
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 83, in __init__
self._initialize_kv_caches(vllm_config)
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
available_gpu_memory = self.model_executor.determine_available_memory()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
output = self.collective_rpc("determine_available_memory")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 205, in determine_available_memory
self.model_runner.profile_run()
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2012, in profile_run
hidden_states = self._dummy_run(self.max_num_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1847, in _dummy_run
outputs = model(
^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
hidden_states = self.model(input_ids, positions, intermediate_tensors,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 239, in __call__
output = self.compiled_callable(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 760, in _compile_fx_inner
raise InductorError(e, currentframe()).with_traceback(
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 745, in _compile_fx_inner
mb_compiled_graph = fx_codegen_and_compile(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1295, in fx_codegen_and_compile
return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1197, in codegen_and_compile
compiled_fn = graph.compile_to_module().call
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2083, in compile_to_module
return self._compile_to_module()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/graph.py", line 2130, in _compile_to_module
mod = PyCodeCache.load_by_key_path(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2747, in load_by_key_path
mod = _reload_python_module(key, path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/compile_tasks.py", line 36, in _reload_python_module
exec(code, mod.__dict__, mod.__dict__)
File "/home/ubuntu/.cache/vllm/torch_compile_cache/da6494da1e/rank_0_0/inductor_cache/db/cdbabxxysvzdh5a2wghjuc5u2rz5ytt7a2s6eh633mh3elksf5xx.py", line 174, in <module>
triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1 = async_compile.triton('triton_per_fused__to_copy_abs_clamp_eq_index_put_lift_fresh_max_mul_reciprocal_where_1', '''
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/async_compile.py", line 346, in triton
kernel.precompile(warm_cache_only=False)
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 276, in precompile
self._precompile_worker()
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 296, in _precompile_worker
compile_results.append(self._precompile_config(c))
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 537, in _precompile_config
binary = triton.compile(*compile_args, **compile_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 278, in compile
module = src.make_ir(options, codegen_fns, module_map, context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/git/polywhirl/.venv_temp/lib/python3.11/site-packages/triton/compiler/compiler.py", line 81, in make_ir
return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._inductor.exc.InductorError: CompilationError: at 45:12:
tmp15 = tl.where(xmask, tmp13, float("-inf"))
tmp16 = triton_helpers.max2(tmp15, 1)[:, None]
tmp17 = tmp11.to(tl.float32)
tmp20 = tmp16.to(tl.float32)
tmp21 = 0.16666666666666666
tmp22 = tmp20 * tmp21
tmp23 = tmp19 * tmp22
tmp24 = -448.0
tmp25 = triton_helpers.maximum(tmp23, tmp24)
tmp26 = 448.0
tmp27 = triton_helpers.minimum(tmp25, tmp26)
tmp28 = tmp27.to(tl.float8e4nv)
^
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
<output truncated due to github char limit>
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.