Closed
Description
After #114 , the server decodes the running sequences every step. This leads to significant overhead, especially when the slow tokenizer is used (e.g., LLaMA).
# opt-13b inference latency (bs 8, input 32, output 128)
Avg latency: 3.57 seconds
Tokenizer (fast): 0.14 seconds
# llama-13b inference latency (bs 8, input 32, output 128)
Avg latency: 5.28 seconds
Tokenizer (slow): 1.97 seconds