You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a few questions about the inference efficiency of deepseek v2
1.
In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8.
Are all the storage and computation performed in fp8 ? Does this harm the performance of the model?
2.
On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput
exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of
DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens
per second.
Is this throughput achieved using testing request of 128K context length? Can we reproduce it using vllm-project/vllm#4650
The text was updated successfully, but these errors were encountered:
Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun
Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun
What‘s the average inference context length to achieve the claimed throughput in the paper? @luofuli
I have a few questions about the inference efficiency of deepseek v2
1.
Are all the storage and computation performed in fp8 ? Does this harm the performance of the model?
2.
Is this throughput achieved using testing request of 128K context length? Can we reproduce it using vllm-project/vllm#4650
The text was updated successfully, but these errors were encountered: