Question about implementation of top-k sampling (5.3.2 Top-k sampling) #326

labdmitriy · 2024-08-19T18:23:57Z

labdmitriy
Aug 19, 2024

There is the following code snippet used in the book as top-k sampling implementation:

new_logits = torch.where(
condition=next_token_logits < top_logits[-1],
input=torch.tensor(float('-inf')),
other=next_token_logits)

Could you please help with these questions:

Is there any difference to use -torch.inf instead of float('-inf')?
Is there any chance that some probabilities will have exactly the same logits in real cases and using current implementation all of them will be included in the answer, so the actual number of selected logits will be greater than k?
Probably we can avoid this situation using this implementation:

new_logits = torch.full_like(next_token_logits, -torch.inf)
new_logits[top_pos] = next_token_logits[top_pos]
new_logits

Also probably it is a little bit faster:

Thank you.

Answered by rasbt

Aug 21, 2024

Good call. I definitely could have used torch.inf. It's just muscle memory at this point because I believe it didn't exist in early versions of PyTorch (pre-2.0 or so).

Overall, I like your alternative implementation. Thanks for sharing that! I think the fact that it doesn't allow duplication can be seen as a pro or con.

In my implementation, if you have a top 3 setting and duplicates like in

[0.412314, 0.412314, -0.5, 0.1 0.2, 1.0, 0.8, ...]

it will not strictly be top 3 anymore but top 3+, e.g.,

[0.412314, 0.412314, 1.0, 0.8, ...]

and the sampling will treat both tokens with equal probability. In your implementation, it will choose one over the other (I think the one that has the lower…

View full answer

shenxiangzhuang · 2024-08-20T03:27:59Z

shenxiangzhuang
Aug 20, 2024

Is there any difference to use -torch.inf instead of float('-inf')?

I think there maybe no difference between them, you can test this simply in REPL:

>>> torch.inf == math.inf == np.inf == float('inf')
True
>>> -torch.inf == -math.inf == -np.inf == -float('inf')
True

1 reply

rasbt Aug 21, 2024
Maintainer

Good question. I'd say that -torch.inf looks perhaps a bit more elegant but I used -float('inf') because I am more used to it (in earlier 1.x PyTorch versions, -torch.inf didn't exist, yet). But based on your observation, it may not matter anyways 😅

rasbt · 2024-08-21T01:45:00Z

rasbt
Aug 21, 2024
Maintainer

Good call. I definitely could have used torch.inf. It's just muscle memory at this point because I believe it didn't exist in early versions of PyTorch (pre-2.0 or so).

Overall, I like your alternative implementation. Thanks for sharing that! I think the fact that it doesn't allow duplication can be seen as a pro or con.

In my implementation, if you have a top 3 setting and duplicates like in

[0.412314, 0.412314, -0.5, 0.1 0.2, 1.0, 0.8, ...]

it will not strictly be top 3 anymore but top 3+, e.g.,

[0.412314, 0.412314, 1.0, 0.8, ...]

and the sampling will treat both tokens with equal probability. In your implementation, it will choose one over the other (I think the one that has the lower index due to the topk function.)

I wouldn't say one behavior is "more correct" or "better" than the other; it's just based on what you want. In any case, I added a code comment with your alternative to the notebook in case readers want to follow this discussion here. Thanks for sharing this!

2 replies

labdmitriy Aug 21, 2024
Author

@rasbt Thanks a lot for your answer!
Did you consider to include top-p (nucleus) sampling as an alternative sampling method? As I understand this method is used pretty often because it allows to choose different number of top samples based on current probability distribution of the possible tokens.

rasbt Aug 21, 2024
Maintainer

@labdmitriy Yes, I considered it but for space reasons I could only cover one or the other not both. So, I opted for top k instead of top p. I mention top p sampling in the Further Reading section though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about implementation of top-k sampling (5.3.2 Top-k sampling) #326

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question about implementation of top-k sampling (5.3.2 Top-k sampling) #326

labdmitriy Aug 19, 2024

Replies: 2 comments · 3 replies

shenxiangzhuang Aug 20, 2024

rasbt Aug 21, 2024 Maintainer

rasbt Aug 21, 2024 Maintainer

labdmitriy Aug 21, 2024 Author

rasbt Aug 21, 2024 Maintainer

labdmitriy
Aug 19, 2024

Replies: 2 comments 3 replies

shenxiangzhuang
Aug 20, 2024

rasbt Aug 21, 2024
Maintainer

rasbt
Aug 21, 2024
Maintainer

labdmitriy Aug 21, 2024
Author

rasbt Aug 21, 2024
Maintainer