Skip to content

Potential degredation in sampling/too repetitive #712

Closed as not planned
Closed as not planned
@syskn

Description

@syskn

Moving from Discussions #471

Hi guys, great work!

I have been experimenting with the library for several weeks, and immediately noticed that sampled tokens (with the same temperature and such) are significantly more deterministic with Vllm vs. HF Transformers using the same models - with temperature lower than 0.7, often the first 5-10 sampled tokens are exactly same across few different generations, even recreating the original text in the datasets verbatim, like there is some greedy decoding going on (when it is not). This unfortunately leads to a significant repetition issue I've never seen with HF.

The issue is not related to special tokens such as <s></s>. I modified vllm so that it never generates those special tokens like HF's bad_words_ids, but the issue persists (those special tokens will also make the inference quality significantly worse especially with non-chat prompts, but it is a different issue).

In the mean time, I have also been checking and modifying codebase to see if there is any discrepancy between the sampling process, but I am not sure about the difference. I'm suspecting either cuda kernels or partitioning/block in PagedAttention.

#590: Related topic with actual examples. The topic is about GPTJ, but the issue is there with other architectures such as Llama and NeoX.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions