Skip to content

[Bug]: Blackwell, no kernel image is available for execution on the device #20193

Open
@twright8

Description

@twright8

Your current environment

The output of python collect_env.py
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : 20.1.6 (++20250528122018+47addd4540b4-1~exp1~20250528002033.124)
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.93
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 5080 Laptop GPU
Nvidia driver version        : 576.40
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.1
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        42 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               24
On-line CPU(s) list:                  0-23
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Core(TM) Ultra 9 275HX
CPU family:                           6
Model:                                198
Thread(s) per core:                   1
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             2
BogoMIPS:                             6144.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                       VT-x
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            1.1 MiB (24 instances)
L1i cache:                            1.5 MiB (24 instances)
L2 cache:                             72 MiB (24 instances)
L3 cache:                             36 MiB (1 instance)
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0+cu128
[pip3] torchaudio==2.7.0+cu128
[pip3] torchvision==0.22.0+cu128
[pip3] transformers==4.53.0
[pip3] triton==3.3.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
CUDA_HOME=/usr/local/cuda-12.8
CUDA_HOME=/usr/local/cuda-12.8
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

Hi there.

I cant get VLLM to work with cuda no matter what i do.

Running on a 5080 mobile. Installed everything on a new virtual environment, python 3.10.

I have tried running qwen 8b and 0.6b fp8 (with quantization set as fp8). It takes a long time to load and hangs

When I try 14b-awq (quantization = "awq_marlin") it fails with this error:


Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 1685.81it/s]
INFO 06-27 17:49:21 [config.py:823] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 06-27 17:49:21 [config.py:3271] Casting torch.float16 to torch.bfloat16.
INFO 06-27 17:49:23 [awq_marlin.py:116] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
WARNING 06-27 17:49:23 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 06-27 17:49:23 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model='/home/tomwright/PycharmProjects/vllm_0.9_torch/models/Qwen_Qwen3-14B-AWQ_local', speculative_config=None, tokenizer='/home/tomwright/PycharmProjects/vllm_0.9_torch/models/Qwen_Qwen3-14B-AWQ_local', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2056, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=/home/tomwright/PycharmProjects/vllm_0.9_torch/models/Qwen_Qwen3-14B-AWQ_local, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}, use_cached_outputs=False, 
WARNING 06-27 17:49:24 [interface.py:376] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 06-27 17:49:24 [cuda.py:256] Using FlashInfer backend.
INFO 06-27 17:49:25 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-27 17:49:25 [model_runner.py:1171] Starting to load model /home/tomwright/PycharmProjects/vllm_0.9_torch/models/Qwen_Qwen3-14B-AWQ_local...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00,  2.52s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00,  2.51s/it]
INFO 06-27 17:49:30 [default_loader.py:272] Loading weights took 5.20 seconds
INFO 06-27 17:49:32 [model_runner.py:1203] Model loading took 9.3620 GiB and 6.456744 seconds
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.12/code.py", line 90, in runcode
[rank0]:     exec(code, self.locals)
[rank0]:   File "<input>", line 81, in <module>
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 243, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 501, in from_engine_args
[rank0]:     return engine_cls.from_vllm_config(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 477, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 268, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 104, in determine_num_available_blocks
[rank0]:     results = self.collective_rpc("determine_num_available_blocks")
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/utils.py", line 2671, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
[rank0]:     self._dummy_run(max_num_batched_tokens, max_num_seqs)
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:                                     ^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, intermediate_tensors,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 173, in __call__
[rank0]:     return self.forward(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 354, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 214, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:                     ^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 133, in forward
[rank0]:     qkv, _ = self.qkv_proj(hidden_states)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 487, in forward
[rank0]:     output_parallel = self.quant_method.apply(self, input_, bias)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 308, in apply
[rank0]:     return apply_awq_marlin_linear(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 456, in apply_awq_marlin_linear
[rank0]:     output = ops.gptq_marlin_gemm(reshaped_x,
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 1054, in gptq_marlin_gemm
[rank0]:     return torch.ops._C.gptq_marlin_gemm(a, c, b_q_weight, b_scales,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

script:

from vllm import LLM, SamplingParams
import os
from typing import Optional, List, Literal
from transformers import AutoTokenizer
from vllm.sampling_params import GuidedDecodingParams
from pydantic import BaseModel, Field
from huggingface_hub import snapshot_download
import pandas as pd
tmp = pd.read_excel("/home/tomwright/PycharmProjects/bertopcing/test.xlsx")
os.environ['VLLM_ATTENTION_BACKEND']="FLASHINFER"
os.environ['VLLM_USE_V1'] = "0"
doc_ls = []

# Filter out Topic -1 and group by "Topic"
grouped_topics = tmp[tmp["Topic"] != -1].groupby("Topic")

for topic_id, group_df in grouped_topics:
    # Filter for representative documents
    rep_docs_df = group_df[group_df["Representative_document"] == True]

    # Get list of "Document" values
    # If no representative documents, this will be an empty list
    representative_documents = rep_docs_df["Document"].tolist()

    # Get ONE LLM value from the group (e.g., the first one)
    # and ensure it's represented as a string version of a list.
    if not group_df.empty:
        llm_value_for_group = group_df["LLM"].iloc[0]

        # Ensure the LLM value is stringified as if it were a list
        if isinstance(llm_value_for_group, list):
            llm_str = str(llm_value_for_group)
        else:
            # If it's not a list (e.g. a string, number), wrap it in a list then stringify
            llm_str = str([llm_value_for_group])

        doc_ls.append({
            "rep_docs": representative_documents,
            "LLM": llm_str  # This is now a string like "['model_name']" or "['item1', 'item2']"
        })

model_id = "Qwen/Qwen3-14B-AWQ" # Or Qwen/Qwen3-8B-FP8 if preferred
tokenizer = AutoTokenizer.from_pretrained(model_id)


local_model_path = f"/home/tomwright/PycharmProjects/vllm_0.9_torch/models/{model_id.replace('/', '_')}_local"

# Download the model snapshot to the specified local directory

snapshot_download(repo_id=model_id, local_dir=local_model_path, local_dir_use_symlinks=False)

t = []
DEFAULT_PROMPT_TEMPLATE = """You are an expert in categorising health related community score cards. I have a topic described by the following keywords: [KEYWORDS]
The topic is further described by the following documents:
[DOCUMENTS]

Based on the information above, please generate a concise, descriptive topic label (max 5 words) that accurately represents the topic.
Topic Label:"""
for x in doc_ls:
    tmp2 = DEFAULT_PROMPT_TEMPLATE
    tmp2= tmp2.replace("[DOCUMENTS]", "\n\n".join(x['rep_docs']))
    tmp2 = tmp2.replace("[KEYWORDS]", x['LLM'])
    t.append(tmp2)





messages =[]
for text in t:
    messages.append([  # Each conversation needs to be a list of messages

        {
            "role": "user",
            "content": text
        }
    ])
prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)

# Rest of your imports and code
model = LLM(local_model_path,
            gpu_memory_utilization=0.9,
            max_model_len=2056,
            dtype='bfloat16',
            quantization='awq_marlin',
            enforce_eager=True,
            #enable_reasoning=True,
            #reasoning_parser="qwen3",
            #kv_cache_dtype="fp8"
            )

sampling_params = SamplingParams(temperature=0.7, max_tokens=30,                top_p=0.8,
            top_k=20,
            min_p=0,
            presence_penalty=1.5,
)

outputs = model.generate(
    prompts=prompts,
    sampling_params=sampling_params,
)
a = []
import json
tmp = json.loads(outputs[0].outputs[0].text)
print(tmp)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions