Skip to content

[Bug]: Qwen3-Next Fails when running Guided Choice #24881

@Blaze-DSP

Description

@Blaze-DSP

🐛 Describe the bug

containers:
  - name: vllm-openai
    image: vllm/vllm-openai:v0.10.2
    imagePullPolicy: IfNotPresent
    env:
      - name: VLLM_FLASH_ATTN_VERSION
        value: "3"
      - name: VLLM_ALLOW_LONG_MAX_MODEL_LEN
        value: "1"
    command:
      - vllm
      - serve
      - /mnt/models/qwen3-next
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
      - --uvicorn-log-level
      - warning
      - --enable-log-requests
      - --enable-log-outputs
      - --served-model-name
      - qwen3-next
      - --trust-remote-code
      - --gpu-memory-utilization
      - "0.9"
      - --enable-prefix-caching
      - --max-model-len
      - "12288"
      - --enable-auto-tool-choice
      - --tool-call-parser
      - hermes
      - --tensor-parallel-size
      - "4"
      - --speculative-config
      - '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

When using guided choice, service breaks showing [CUDA] Illegal Memory Access.

Client:

system_prompt = "You are a helpful assistant. Answer the following question concisely. Give only the final Answer, don't give justifications."
user_prompt = f"""Question: "{transcription}"

Answer: 
"""
import openai

qwen3= openai.AsyncOpenAI(base_url=...,api_key=...)

system_prompt = "You are a helpful assistant. Answer the following question concisely. Give only the final Answer, don't give justifications."
user_prompt = f"""Question: "{transcription}"

Answer: 
"""

req = {
    "model": model,
    "messages": [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": user_prompt,
        }
    ],
    "temperature": 0,
    "max_completion_tokens": 1024,
    "max_tokens": 1024,
    "extra_body": {"guided_choice": ["Valid", "Invalid"]},
}

start = time.perf_counter()
response = await qwen3.chat.completions.create(**req)
print("\nTime:", time.perf_counter() - start)
print("\nResponse:", response.choices[0].message.content)

Logs:

(Worker_TP2 pid=568) INFO 09-15 04:42:14 [custom_all_reduce.py:203] Registering 10392 cuda graph addresses
(Worker_TP3 pid=569) INFO 09-15 04:42:15 [custom_all_reduce.py:203] Registering 10392 cuda graph addresses
(Worker_TP0 pid=566) INFO 09-15 04:42:15 [custom_all_reduce.py:203] Registering 10392 cuda graph addresses
(Worker_TP1 pid=567) INFO 09-15 04:42:15 [custom_all_reduce.py:203] Registering 10392 cuda graph addresses
(Worker_TP2 pid=568) INFO 09-15 04:42:15 [gpu_model_runner.py:3118] Graph capturing finished in 60 secs, took 1.60 GiB
(Worker_TP2 pid=568) INFO 09-15 04:42:15 [gpu_worker.py:391] Free memory on device (78.6/79.2 GiB) on startup. Desired GPU memory utilization is (0.9, 71.28 GiB). Actual usage is 37.99 GiB for weight, 4.71 GiB for peak activation, 1.84 GiB for non-torch memory, and 1.6 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=26848822374` to fit into requested memory, or `--kv-cache-memory=34704236032` to fully utilize gpu memory. Current kv cache memory in use is 28719481958 bytes.
(Worker_TP1 pid=567) INFO 09-15 04:42:15 [gpu_model_runner.py:3118] Graph capturing finished in 60 secs, took 1.60 GiB
(Worker_TP1 pid=567) INFO 09-15 04:42:15 [gpu_worker.py:391] Free memory on device (78.6/79.2 GiB) on startup. Desired GPU memory utilization is (0.9, 71.28 GiB). Actual usage is 37.99 GiB for weight, 4.71 GiB for peak activation, 1.84 GiB for non-torch memory, and 1.6 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=26848822374` to fit into requested memory, or `--kv-cache-memory=34704236032` to fully utilize gpu memory. Current kv cache memory in use is 28719481958 bytes.
(Worker_TP0 pid=566) INFO 09-15 04:42:15 [gpu_model_runner.py:3118] Graph capturing finished in 60 secs, took 1.60 GiB
(Worker_TP0 pid=566) INFO 09-15 04:42:15 [gpu_worker.py:391] Free memory on device (78.6/79.2 GiB) on startup. Desired GPU memory utilization is (0.9, 71.28 GiB). Actual usage is 37.99 GiB for weight, 4.71 GiB for peak activation, 1.84 GiB for non-torch memory, and 1.6 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=26848822374` to fit into requested memory, or `--kv-cache-memory=34704236032` to fully utilize gpu memory. Current kv cache memory in use is 28719481958 bytes.
(Worker_TP3 pid=569) INFO 09-15 04:42:15 [gpu_model_runner.py:3118] Graph capturing finished in 60 secs, took 1.60 GiB
(Worker_TP3 pid=569) INFO 09-15 04:42:15 [gpu_worker.py:391] Free memory on device (78.6/79.2 GiB) on startup. Desired GPU memory utilization is (0.9, 71.28 GiB). Actual usage is 37.99 GiB for weight, 4.71 GiB for peak activation, 1.84 GiB for non-torch memory, and 1.6 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=26848822374` to fit into requested memory, or `--kv-cache-memory=34704236032` to fully utilize gpu memory. Current kv cache memory in use is 28719481958 bytes.
(EngineCore_DP0 pid=433) INFO 09-15 04:42:15 [core.py:218] init engine (profile, create kv cache, warmup model) took 141.72 seconds
(APIServer pid=1) INFO 09-15 04:42:16 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 7491
(APIServer pid=1) INFO 09-15 04:42:16 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
(APIServer pid=1) INFO 09-15 04:42:16 [api_server.py:1692] Supported_tasks: ['generate']
(APIServer pid=1) WARNING 09-15 04:42:16 [__init__.py:1695] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 09-15 04:42:16 [serving_responses.py:130] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 09-15 04:42:16 [serving_responses.py:159] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=1) INFO 09-15 04:42:16 [serving_chat.py:97] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=1) INFO 09-15 04:42:16 [serving_chat.py:137] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 09-15 04:42:16 [serving_completion.py:76] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 09-15 04:42:16 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:36] Available routes are:
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 09-15 04:42:16 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 09-15 04:42:18 [chat_utils.py:538] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 09-15 04:42:18 [logger.py:40] Received request chatcmpl-56b1beb3755646eaba1fa1441c13def0: prompt: '<|im_start|>user\nHello! How are you?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=12274, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None.
(APIServer pid=1) INFO 09-15 04:42:18 [async_llm.py:321] Added request chatcmpl-56b1beb3755646eaba1fa1441c13def0.
(APIServer pid=1) INFO 09-15 04:43:29 [logger.py:71] Generated response chatcmpl-56b1beb3755646eaba1fa1441c13def0: output: "Hello! 😊 I'm doing great—thanks for asking! How about you? I hope you're having a wonderful day! 🌟", output_token_ids: [9707, 0, 26525, 232, 358, 2776, 3730, 2244, 2293, 45493, 369, 10161, 0, 2585, 911, 498, 30, 358, 3900, 498, 2299, 3432, 264, 11117, 1899, 0, 11162, 234, 253, 151645], finish_reason: stop
(APIServer pid=1) INFO 09-15 04:43:29 [logger.py:40] Received request chatcmpl-31fd118f7c904e71bc2ec4946b36e72d: prompt: '<|im_start|>user\nDefine AI in one line.<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=12274, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None.
(APIServer pid=1) INFO 09-15 04:43:29 [async_llm.py:321] Added request chatcmpl-31fd118f7c904e71bc2ec4946b36e72d.
(APIServer pid=1) INFO 09-15 04:43:30 [logger.py:71] Generated response chatcmpl-31fd118f7c904e71bc2ec4946b36e72d: output: 'AI is the simulation of human intelligence in machines that are programmed to think, learn, and perform tasks like reasoning, problem-solving, and decision-making.', output_token_ids: [15469, 374, 279, 19038, 315, 3738, 11229, 304, 12645, 429, 525, 55068, 311, 1744, 11, 3960, 11, 323, 2736, 9079, 1075, 32711, 11, 3491, 98146, 11, 323, 5480, 27746, 13, 151645], finish_reason: stop
(APIServer pid=1) INFO 09-15 04:43:30 [logger.py:40] Received request chatcmpl-efc1583ab16b4c1a965fe60bf5ac6da4: prompt: '<|im_start|>user\nExplain Quantum Mechanics within 100 words.<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=12269, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None.
(APIServer pid=1) INFO 09-15 04:43:30 [async_llm.py:321] Added request chatcmpl-efc1583ab16b4c1a965fe60bf5ac6da4.
(APIServer pid=1) INFO 09-15 04:43:37 [loggers.py:123] Engine 000: Avg prompt throughput: 2.8 tokens/s, Avg generation throughput: 6.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 09-15 04:43:37 [metrics.py:96] SpecDecoding metrics: Mean acceptance length: 2.50, Accepted throughput: 0.45 tokens/s, Drafted throughput: 0.60 tokens/s, Accepted: 36 tokens, Drafted: 48 tokens, Per-position acceptance rate: 0.833, 0.667, Avg Draft acceptance rate: 75.0%
(APIServer pid=1) INFO 09-15 04:43:47 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 09-15 04:43:50 [logger.py:71] Generated response chatcmpl-efc1583ab16b4c1a965fe60bf5ac6da4: output: 'Quantum Mechanics is the physics theory describing nature at atomic and subatomic scales, where particles behave as both waves and particles (wave-particle duality). It uses probability waves (wavefunctions) to predict outcomes, not certainties. Key principles include superposition (objects exist in multiple states at once) and entanglement (linked particles affect each other instantly, regardless of distance). Observing a system collapses its wavefunction into one state. Governed by the Schrödinger equation, it underpins modern tech like lasers, semiconductors, and quantum computers—challenging classical intuition with inherent uncertainty and non-locality.', output_token_ids: [44220, 372, 76823, 374, 279, 21321, 10126, 22692, 6993, 518, 24510, 323, 1186, 6618, 28405, 11, 1380, 18730, 35692, 438, 2176, 16876, 323, 18730, 320, 30398, 2268, 7058, 294, 10733, 568, 1084, 5711, 18927, 16876, 320, 30398, 21409, 8, 311, 7023, 19554, 11, 537, 2777, 61024, 13, 5309, 16170, 2924, 2256, 3487, 320, 19210, 3000, 304, 5248, 5302, 518, 3055, 8, 323, 1197, 524, 986, 320, 43133, 18730, 7802, 1817, 1008, 21818, 11, 15484, 315, 6010, 568, 30843, 287, 264, 1849, 86358, 1181, 12060, 1688, 1119, 825, 1584, 13, 7955, 291, 553, 279, 5016, 131315, 67, 5137, 23606, 11, 432, 1212, 74558, 6481, 13014, 1075, 71375, 11, 5234, 1924, 1058, 1087, 11, 323, 30128, 18495, 2293, 331, 33769, 287, 28824, 56251, 448, 36988, 26826, 323, 2477, 40060, 487, 13, 151645], finish_reason: stop
(APIServer pid=1) INFO 09-15 04:43:57 [loggers.py:123] Engine 000: Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 13.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 09-15 04:43:57 [metrics.py:96] SpecDecoding metrics: Mean acceptance length: 2.34, Accepted throughput: 3.75 tokens/s, Drafted throughput: 5.60 tokens/s, Accepted: 75 tokens, Drafted: 112 tokens, Per-position acceptance rate: 0.821, 0.518, Avg Draft acceptance rate: 67.0%
(APIServer pid=1) INFO 09-15 04:44:07 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 09-15 04:46:28 [logger.py:40] Received request chatcmpl-08edc681bb2a6be90bde32f2623d169f: prompt: '<|im_start|>system\nYou are a helpful assistant. Answer the following question concisely. Give only the final Answer, don\'t give justifications.<|im_end|>\n<|im_start|>user\nQuestion: "Here comes a perfectly valid argument. First of all, whoever is a schoolmate of Sandra is not a stepsister of Priscilla. In consequence, whoever is not a stepsister of Priscilla is a schoolmate of Sandra. Is the argument, given the explicitly stated premises, deductively valid or invalid? Options: Valid Invalid Answer the question."\n\nAnswer: \n<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=['Valid', 'Invalid'], grammar=None, json_object=None, backend=None, backend_was_auto=False, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, whitespace_pattern=None, structural_tag=None), extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None.
(APIServer pid=1) INFO 09-15 04:46:28 [async_llm.py:321] Added request chatcmpl-08edc681bb2a6be90bde32f2623d169f.
[04:46:59] /project/cpp/grammar_matcher.cc:370: Warning: The matcher has terminated after accepting the stop token, but is trying to accept new token with id 151643.
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654] WorkerProc hit an exception.
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654] Traceback (most recent call last):
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 649, in worker_busy_loop
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     output = func(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return func(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     output = self.model_runner.execute_model(scheduler_output,
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return func(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2064, in execute_model
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     model_output = self.model(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                    ^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 119, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self.runnable(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1169, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 312, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     model_output = self.forward(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 930, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     def forward(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return super().__call__(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return fn(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._wrapped_call(self, *args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     raise e
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "<eval_with_key>.98", line 350, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     submod_1 = self.submod_1(getitem, s72, getitem_1);  getitem = submod_1 = None
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._wrapped_call(self, *args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     raise e
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "<eval_with_key>.2", line 5, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     gdn_attention = torch.ops.vllm.gdn_attention(x_3, self_attention_output, 'model.layers.0.linear_attn');  x_3 = self_attention_output = gdn_attention = None
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1243, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._op(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1229, in gdn_attention
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     self._forward(hidden_states=hidden_states, output=output)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 538, in _forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     fused_recurrent_gated_delta_rule(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/fused_recurrent.py", line 352, in fused_recurrent_gated_delta_rule
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     o, final_state = FusedRecurrentFunction.apply(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 576, in apply
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return super().apply(*args, **kwargs)  # type: ignore[misc]
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/fused_recurrent.py", line 246, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     o, final_state = fused_recurrent_gated_delta_rule_fwd(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/fused_recurrent.py", line 194, in fused_recurrent_gated_delta_rule_fwd
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     fused_recurrent_gated_delta_rule_fwd_kernel[grid](
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 390, in <lambda>
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 453, in run
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self.fn.run(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 617, in run
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     ^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 498, in __getattribute__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     self._init_handles()
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 490, in _init_handles
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654] RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654] Traceback (most recent call last):
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 649, in worker_busy_loop
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     output = func(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return func(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     output = self.model_runner.execute_model(scheduler_output,
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return func(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2064, in execute_model
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     model_output = self.model(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                    ^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 119, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self.runnable(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1169, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 312, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     model_output = self.forward(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 930, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     def forward(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return super().__call__(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return fn(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._wrapped_call(self, *args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     raise e
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "<eval_with_key>.98", line 350, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     submod_1 = self.submod_1(getitem, s72, getitem_1);  getitem = submod_1 = None
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._wrapped_call(self, *args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     raise e
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "<eval_with_key>.2", line 5, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     gdn_attention = torch.ops.vllm.gdn_attention(x_3, self_attention_output, 'model.layers.0.linear_attn');  x_3 = self_attention_output = gdn_attention = None
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1243, in __call__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self._op(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1229, in gdn_attention
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     self._forward(hidden_states=hidden_states, output=output)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 538, in _forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     fused_recurrent_gated_delta_rule(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/fused_recurrent.py", line 352, in fused_recurrent_gated_delta_rule
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     o, final_state = FusedRecurrentFunction.apply(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 576, in apply
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return super().apply(*args, **kwargs)  # type: ignore[misc]
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/fused_recurrent.py", line 246, in forward
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     o, final_state = fused_recurrent_gated_delta_rule_fwd(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/fused_recurrent.py", line 194, in fused_recurrent_gated_delta_rule_fwd
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     fused_recurrent_gated_delta_rule_fwd_kernel[grid](
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 390, in <lambda>
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 453, in run
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     return self.fn.run(*args, **kwargs)
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 617, in run
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     ^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 498, in __getattribute__
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     self._init_handles()
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 490, in _init_handles
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654] RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
(Worker_TP2 pid=568) ERROR 09-15 04:47:00 [multiproc_executor.py:654]
...
...
...
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.2) with config: model='/mnt/models/qwen3-next', speculative_config=SpeculativeConfig(method='qwen3_next_mtp', model='/mnt/models/qwen3-next', num_spec_tokens=2), tokenizer='/mnt/models/qwen3-next', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=12288, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen3-next, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null},
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-08edc681bb2a6be90bde32f2623d169f'], resumed_from_preemption=[false], new_token_ids=[], new_block_ids=[null], num_computed_tokens=[117]), num_scheduled_tokens={chatcmpl-08edc681bb2a6be90bde32f2623d169f: 2}, total_num_scheduled_tokens=2, scheduled_spec_decode_tokens={chatcmpl-08edc681bb2a6be90bde32f2623d169f: [151645]}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids={chatcmpl-08edc681bb2a6be90bde32f2623d169f: 0}, grammar_bitmask=array([[ 0,  0,  0, ...,  0,  0,  0],
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [dump_input.py:76]        [-1, -1, -1, ..., -1, -1, -1]], shape=(2, 4748), dtype=int32), kv_connector_metadata=null)
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.001335113484646211, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720] Traceback (most recent call last):
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 711, in run_engine_core
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 738, in run_busy_loop
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]     self._process_engine_step()
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 764, in _process_engine_step
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 292, in step
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]     model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 278, in execute_model_with_error_logging
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]     raise err
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 269, in execute_model_with_error_logging
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]     return model_fn(scheduler_output)
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 176, in execute_model
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]     (output, ) = self.collective_rpc(
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]                  ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 259, in collective_rpc
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]     result = get_response(w, dequeue_timeout,
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 243, in get_response
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720]     raise RuntimeError(
(EngineCore_DP0 pid=433) ERROR 09-15 04:47:00 [core.py:720] RuntimeError: Worker failed with error 'Triton Error [CUDA]: an illegal memory access was encountered', please check the stack trace above for the root cause
(APIServer pid=1) ERROR 09-15 04:47:00 [async_llm.py:485] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 09-15 04:47:00 [async_llm.py:485] Traceback (most recent call last):
(APIServer pid=1) ERROR 09-15 04:47:00 [async_llm.py:485]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 444, in output_handler
(APIServer pid=1) ERROR 09-15 04:47:00 [async_llm.py:485]     outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 09-15 04:47:00 [async_llm.py:485]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 09-15 04:47:00 [async_llm.py:485]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 845, in get_output_async
(APIServer pid=1) ERROR 09-15 04:47:00 [async_llm.py:485]     raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 09-15 04:47:00 [async_llm.py:485] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1) INFO 09-15 04:47:00 [async_llm.py:411] Request chatcmpl-08edc681bb2a6be90bde32f2623d169f failed (engine dead).
(Worker_TP0 pid=566) INFO 09-15 04:47:00 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP0 pid=566) INFO 09-15 04:47:00 [multiproc_executor.py:587] WorkerProc shutting down.
(Worker_TP2 pid=568) INFO 09-15 04:47:00 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP1 pid=567) INFO 09-15 04:47:00 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP2 pid=568) INFO 09-15 04:47:00 [multiproc_executor.py:587] WorkerProc shutting down.
(Worker_TP3 pid=569) INFO 09-15 04:47:00 [multiproc_executor.py:546] Parent process exited, terminating worker
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:621 'an illegal memory access was encountered'
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:621 'an illegal memory access was encountered'
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:621 'an illegal memory access was encountered'
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:621 'an illegal memory access was encountered'
nanobind: leaked 2 instances!
 - leaked instance 0x7fb1880691b8 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked instance 0x7fb194239de8 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
nanobind: leaked 2 types!
 - leaked type "xgrammar.xgrammar_bindings.GrammarMatcher"
 - leaked type "xgrammar.xgrammar_bindings.CompiledGrammar"
nanobind: leaked 16 functions!
 - leaked function "deserialize_json"
 - leaked function ""
 - leaked function "_debug_print_internal_state"
 - leaked function "find_jump_forward_string"
 - leaked function "is_terminated"
 - leaked function "reset"
 - leaked function ""
 - leaked function "accept_string"
 - leaked function "accept_token"
 - leaked function "rollback"
 - leaked function ""
 - leaked function ""
 - leaked function "fill_next_token_bitmask"
 - leaked function "__init__"
 - leaked function "serialize_json"
 - leaked function ""
nanobind: this is likely caused by a reference counting issue in the binding code.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions