Closed as not planned
Description
I was playing simple example:
from vllm.entrypoints.llm import LLM
from vllm.sampling_params import SamplingParams
MODEL_NAME = "TheBloke/vicuna-7B-v1.5-16k-GPTQ"
llm = LLM(model=MODEL_NAME, quantization="gptq", gpu_memory_utilization=0.5)
sampling_params = SamplingParams(
max_tokens=600,
top_p=0.95,
temperature=0.7,
presence_penalty=0.5,
frequency_penalty=0.5,
)
outputs = llm.generate(
[
"""A chat between a curious user and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the user's questions.\n\n
USER: Write Python script converting RGB image to grayscale.\n
ASSISTANT: """,
],
sampling_params=sampling_params,
)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r},\nGenerated text: {generated_text!r}")
print()
and I received random output. Either random character or few words looped endlessly.
When I switched backed to 4k model, everything was fine. I'm not sure if its related to GPTQ or 16k vicuna in general.
I'm not able to check unquantized or AWQ quantized model as it doesn't fit on my GPU during inference.
Relevant to GPTQ: #916
What inspired me to check model with shorter context (the last answer): #590
Metadata
Metadata
Assignees
Labels
No labels