Bugged interference on Vicuna 7B GPTQ 16k

I was playing simple example:
```
from vllm.entrypoints.llm import LLM
from vllm.sampling_params import SamplingParams

MODEL_NAME = "TheBloke/vicuna-7B-v1.5-16k-GPTQ"

llm = LLM(model=MODEL_NAME, quantization="gptq", gpu_memory_utilization=0.5)

sampling_params = SamplingParams(
    max_tokens=600,
    top_p=0.95,
    temperature=0.7,
    presence_penalty=0.5,
    frequency_penalty=0.5,
)

outputs = llm.generate(
    [
        """A chat between a curious user and an artificial intelligence assistant.
        The assistant gives helpful, detailed, and polite answers to the user's questions.\n\n
        USER: Write Python script converting RGB image to grayscale.\n
        ASSISTANT: """,
    ],
    sampling_params=sampling_params,
)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r},\nGenerated text: {generated_text!r}")
    print()
```

and I received random output. Either random character or few words looped endlessly.
When I switched backed to 4k model, everything was fine. I'm not sure if its related to GPTQ or 16k vicuna in general.
I'm not able to check unquantized or AWQ quantized model as it doesn't fit on my GPU during inference.
Relevant to GPTQ: https://github.com/vllm-project/vllm/pull/916
What inspired me to check model with shorter context (the last answer): https://github.com/vllm-project/vllm/issues/590

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bugged interference on Vicuna 7B GPTQ 16k #1213

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Bugged interference on Vicuna 7B GPTQ 16k #1213

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions