Open
Description
Your current environment
The output of python collect_env.py
==============================
System Info
==============================
OS : Ubuntu 24.04.2 LTS (x86_64)
GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version : 20.1.6 (++20250528122018+47addd4540b4-1~exp1~20250528002033.124)
CMake version : version 3.28.3
Libc version : glibc-2.39
==============================
PyTorch Info
==============================
PyTorch version : 2.7.0+cu128
Is debug build : False
CUDA used to build PyTorch : 12.8
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] (64-bit runtime)
Python platform : Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : 12.8.93
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 5080 Laptop GPU
Nvidia driver version : 576.40
cuDNN version : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.1
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 42 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) Ultra 9 275HX
CPU family: 6
Model: 198
Thread(s) per core: 1
Core(s) per socket: 24
Socket(s): 1
Stepping: 2
BogoMIPS: 6144.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 1.1 MiB (24 instances)
L1i cache: 1.5 MiB (24 instances)
L2 cache: 72 MiB (24 instances)
L3 cache: 36 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0+cu128
[pip3] torchaudio==2.7.0+cu128
[pip3] torchvision==0.22.0+cu128
[pip3] transformers==4.53.0
[pip3] triton==3.3.0
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
==============================
Environment Variables
==============================
CUDA_HOME=/usr/local/cuda-12.8
CUDA_HOME=/usr/local/cuda-12.8
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
Hi there.
I cant get VLLM to work with cuda no matter what i do.
Running on a 5080 mobile. Installed everything on a new virtual environment, python 3.10.
I have tried running qwen 8b and 0.6b fp8 (with quantization set as fp8). It takes a long time to load and hangs
When I try 14b-awq (quantization = "awq_marlin") it fails with this error:
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 1685.81it/s]
INFO 06-27 17:49:21 [config.py:823] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 06-27 17:49:21 [config.py:3271] Casting torch.float16 to torch.bfloat16.
INFO 06-27 17:49:23 [awq_marlin.py:116] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
WARNING 06-27 17:49:23 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 06-27 17:49:23 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model='/home/tomwright/PycharmProjects/vllm_0.9_torch/models/Qwen_Qwen3-14B-AWQ_local', speculative_config=None, tokenizer='/home/tomwright/PycharmProjects/vllm_0.9_torch/models/Qwen_Qwen3-14B-AWQ_local', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2056, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=/home/tomwright/PycharmProjects/vllm_0.9_torch/models/Qwen_Qwen3-14B-AWQ_local, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}, use_cached_outputs=False,
WARNING 06-27 17:49:24 [interface.py:376] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 06-27 17:49:24 [cuda.py:256] Using FlashInfer backend.
INFO 06-27 17:49:25 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-27 17:49:25 [model_runner.py:1171] Starting to load model /home/tomwright/PycharmProjects/vllm_0.9_torch/models/Qwen_Qwen3-14B-AWQ_local...
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:02<00:02, 2.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00, 2.52s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00, 2.51s/it]
INFO 06-27 17:49:30 [default_loader.py:272] Loading weights took 5.20 seconds
INFO 06-27 17:49:32 [model_runner.py:1203] Model loading took 9.3620 GiB and 6.456744 seconds
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.12/code.py", line 90, in runcode
[rank0]: exec(code, self.locals)
[rank0]: File "<input>", line 81, in <module>
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 243, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 501, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 477, in from_vllm_config
[rank0]: return cls(
[rank0]: ^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 268, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 104, in determine_num_available_blocks
[rank0]: results = self.collective_rpc("determine_num_available_blocks")
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/utils.py", line 2671, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
[rank0]: self._dummy_run(max_num_batched_tokens, max_num_seqs)
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
[rank0]: hidden_states = self.model(input_ids, positions, intermediate_tensors,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 173, in __call__
[rank0]: return self.forward(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 354, in forward
[rank0]: hidden_states, residual = layer(
[rank0]: ^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 214, in forward
[rank0]: hidden_states = self.self_attn(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 133, in forward
[rank0]: qkv, _ = self.qkv_proj(hidden_states)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 487, in forward
[rank0]: output_parallel = self.quant_method.apply(self, input_, bias)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 308, in apply
[rank0]: return apply_awq_marlin_linear(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 456, in apply_awq_marlin_linear
[rank0]: output = ops.gptq_marlin_gemm(reshaped_x,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 1054, in gptq_marlin_gemm
[rank0]: return torch.ops._C.gptq_marlin_gemm(a, c, b_q_weight, b_scales,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/tomwright/PycharmProjects/vllm_try/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
[rank0]: return self._op(*args, **(kwargs or {}))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
script:
from vllm import LLM, SamplingParams
import os
from typing import Optional, List, Literal
from transformers import AutoTokenizer
from vllm.sampling_params import GuidedDecodingParams
from pydantic import BaseModel, Field
from huggingface_hub import snapshot_download
import pandas as pd
tmp = pd.read_excel("/home/tomwright/PycharmProjects/bertopcing/test.xlsx")
os.environ['VLLM_ATTENTION_BACKEND']="FLASHINFER"
os.environ['VLLM_USE_V1'] = "0"
doc_ls = []
# Filter out Topic -1 and group by "Topic"
grouped_topics = tmp[tmp["Topic"] != -1].groupby("Topic")
for topic_id, group_df in grouped_topics:
# Filter for representative documents
rep_docs_df = group_df[group_df["Representative_document"] == True]
# Get list of "Document" values
# If no representative documents, this will be an empty list
representative_documents = rep_docs_df["Document"].tolist()
# Get ONE LLM value from the group (e.g., the first one)
# and ensure it's represented as a string version of a list.
if not group_df.empty:
llm_value_for_group = group_df["LLM"].iloc[0]
# Ensure the LLM value is stringified as if it were a list
if isinstance(llm_value_for_group, list):
llm_str = str(llm_value_for_group)
else:
# If it's not a list (e.g. a string, number), wrap it in a list then stringify
llm_str = str([llm_value_for_group])
doc_ls.append({
"rep_docs": representative_documents,
"LLM": llm_str # This is now a string like "['model_name']" or "['item1', 'item2']"
})
model_id = "Qwen/Qwen3-14B-AWQ" # Or Qwen/Qwen3-8B-FP8 if preferred
tokenizer = AutoTokenizer.from_pretrained(model_id)
local_model_path = f"/home/tomwright/PycharmProjects/vllm_0.9_torch/models/{model_id.replace('/', '_')}_local"
# Download the model snapshot to the specified local directory
snapshot_download(repo_id=model_id, local_dir=local_model_path, local_dir_use_symlinks=False)
t = []
DEFAULT_PROMPT_TEMPLATE = """You are an expert in categorising health related community score cards. I have a topic described by the following keywords: [KEYWORDS]
The topic is further described by the following documents:
[DOCUMENTS]
Based on the information above, please generate a concise, descriptive topic label (max 5 words) that accurately represents the topic.
Topic Label:"""
for x in doc_ls:
tmp2 = DEFAULT_PROMPT_TEMPLATE
tmp2= tmp2.replace("[DOCUMENTS]", "\n\n".join(x['rep_docs']))
tmp2 = tmp2.replace("[KEYWORDS]", x['LLM'])
t.append(tmp2)
messages =[]
for text in t:
messages.append([ # Each conversation needs to be a list of messages
{
"role": "user",
"content": text
}
])
prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
# Rest of your imports and code
model = LLM(local_model_path,
gpu_memory_utilization=0.9,
max_model_len=2056,
dtype='bfloat16',
quantization='awq_marlin',
enforce_eager=True,
#enable_reasoning=True,
#reasoning_parser="qwen3",
#kv_cache_dtype="fp8"
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=30, top_p=0.8,
top_k=20,
min_p=0,
presence_penalty=1.5,
)
outputs = model.generate(
prompts=prompts,
sampling_params=sampling_params,
)
a = []
import json
tmp = json.loads(outputs[0].outputs[0].text)
print(tmp)