-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
While the on-GPU sampling is neat, moving logits to the CPU might still be faster.
One idea would be to have logitsample(f::Function, logits), falling back to logitsamplel(f(logits)), with specialized methods like logitsample(::Top_pk, logits) with better time complexity using a partial sort.
Some rough benchmarks show that logitsample ∘ Top_p(0.5) on 100k logits takes ~2 micro milliseconds on an A6000, which sets an upper limit on inference speed.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels