Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce inference benchmark mentioned in the paper #21

Open
zhouheyun opened this issue May 11, 2024 · 4 comments
Open

Reproduce inference benchmark mentioned in the paper #21

zhouheyun opened this issue May 11, 2024 · 4 comments

Comments

@zhouheyun
Copy link

zhouheyun commented May 11, 2024

I have a few questions about the inference efficiency of deepseek v2
1.

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8.

Are all the storage and computation performed in fp8 ? Does this harm the performance of the model?
2.

On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput
exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of
DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens
per second.

Is this throughput achieved using testing request of 128K context length? Can we reproduce it using vllm-project/vllm#4650

@luofuli
Copy link
Contributor

luofuli commented May 14, 2024

Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun

@zhouheyun
Copy link
Author

zhouheyun commented May 14, 2024

Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun

What‘s the average inference context length to achieve the claimed throughput in the paper? @luofuli

@luofuli
Copy link
Contributor

luofuli commented May 27, 2024

32K context length @zhouheyun

@ArtificialZeng
Copy link

vllm-project/vllm#4650

How much tokens/s this open source version can achieve? on 8*H800 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants