Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference speed and memory usage of Qwen1.5-14b #12015

Open
WeiguangHan opened this issue Sep 4, 2024 · 3 comments
Open

Inference speed and memory usage of Qwen1.5-14b #12015

WeiguangHan opened this issue Sep 4, 2024 · 3 comments
Assignees

Comments

@WeiguangHan
Copy link
Contributor

I have tested the inference speed and memory usage of Qwen1.5-14b on my machine using the example in ipex-llm. The peek cpu usage to load Qwen1.5-14b in 4-bit is about 24GB. The peek GPU usage is about 10GB. The Inference speed is about 4~5 token/s. I export the environment variables set SYCL_CACHE_PERSISTENT=1 and set BIGDL_LLM_XMX_DISABLED=1. Does the inference speed and CPU/GPU memory usage meet the expectation? I think the CPU peak usage is too high and the speed is a little slow.

device
Intel(R) Core(TM) Ultra 7 155H 3.80 GHz
32.0 GB (31.6 GB 可用)
image

env
intel-extension-for-pytorch 2.1.10+xpu
torch 2.1.0a0+cxx11.abi
transformers 4.44.2

@JinheTang
Copy link
Contributor

Hi @WeiguangHan , we will take a look at this issue and try to reproduce it first. We'll let you know if there's any progress.

@JinheTang
Copy link
Contributor

JinheTang commented Sep 10, 2024

Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.

The CPU usage when running qwen1.5 example script turned out pretty normal:
image
given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using save.py) model is about 10GB. The inference speed is 9.2 tokens/sec when n-predict is set to default 32:
image

Also, pls note that it is recommended to run performance evaluation with the all-in-one benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one).
Reference config:
image
and below is the demo output on our machine:

,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A

@WeiguangHan
Copy link
Contributor Author

Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.

The CPU usage when running qwen1.5 example script turned out pretty normal: image given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using save.py) model is about 10GB. The inference speed is 9.2 tokens/sec when n-predict is set to default 32: image

Also, pls note that it is recommended to run performance evaluation with the all-in-one benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one). Reference config: image and below is the demo output on our machine:

,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A

Thanks a lot. The CPU of my computer is Ultra 7 155H. It should have a better performance theoretically. I will try it again according to your instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants