Skip to content

Bugged interference on Vicuna 7B GPTQ 16k #1213

Closed as not planned
Closed as not planned
@pweglik

Description

@pweglik

I was playing simple example:

from vllm.entrypoints.llm import LLM
from vllm.sampling_params import SamplingParams

MODEL_NAME = "TheBloke/vicuna-7B-v1.5-16k-GPTQ"

llm = LLM(model=MODEL_NAME, quantization="gptq", gpu_memory_utilization=0.5)

sampling_params = SamplingParams(
    max_tokens=600,
    top_p=0.95,
    temperature=0.7,
    presence_penalty=0.5,
    frequency_penalty=0.5,
)

outputs = llm.generate(
    [
        """A chat between a curious user and an artificial intelligence assistant.
        The assistant gives helpful, detailed, and polite answers to the user's questions.\n\n
        USER: Write Python script converting RGB image to grayscale.\n
        ASSISTANT: """,
    ],
    sampling_params=sampling_params,
)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r},\nGenerated text: {generated_text!r}")
    print()

and I received random output. Either random character or few words looped endlessly.
When I switched backed to 4k model, everything was fine. I'm not sure if its related to GPTQ or 16k vicuna in general.
I'm not able to check unquantized or AWQ quantized model as it doesn't fit on my GPU during inference.
Relevant to GPTQ: #916
What inspired me to check model with shorter context (the last answer): #590

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions