Potential degredation in sampling/too repetitive

### Moving from Discussions https://github.com/vllm-project/vllm/discussions/471

Hi guys, great work!

I have been experimenting with the library for several weeks, and immediately noticed that sampled tokens (with the same temperature and such) are significantly more deterministic with Vllm vs. HF Transformers using the same models - with temperature lower than 0.7, often the first 5-10 sampled tokens are exactly same across few different generations, even recreating the original text in the datasets verbatim, like there is some greedy decoding going on (when it is not). This unfortunately leads to a significant repetition issue I've never seen with HF.

The issue is not related to special tokens such as `<s></s>`. I modified vllm so that it never generates those special tokens like HF's bad_words_ids, but the issue persists (those special tokens will also make the inference quality significantly worse especially with non-chat prompts, but it is a different issue).

In the mean time, I have also been checking and modifying codebase to see if there is any discrepancy between the sampling process, but I am not sure about the difference. I'm suspecting either cuda kernels or partitioning/block in PagedAttention.

https://github.com/vllm-project/vllm/issues/590: Related topic with actual examples. The topic is about GPTJ, but the issue is there with other architectures such as Llama and NeoX.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Potential degredation in sampling/too repetitive #712

Moving from Discussions #471

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Potential degredation in sampling/too repetitive #712

Description

Moving from Discussions #471

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions