Inference speed and memory usage of Qwen1.5-14b #12015

WeiguangHan · 2024-09-04T11:33:59Z

I have tested the inference speed and memory usage of Qwen1.5-14b on my machine using the example in ipex-llm. The peek cpu usage to load Qwen1.5-14b in 4-bit is about 24GB. The peek GPU usage is about 10GB. The Inference speed is about 4~5 token/s. I export the environment variables set SYCL_CACHE_PERSISTENT=1 and set BIGDL_LLM_XMX_DISABLED=1. Does the inference speed and CPU/GPU memory usage meet the expectation? I think the CPU peak usage is too high and the speed is a little slow.

device
Intel(R) Core(TM) Ultra 7 155H 3.80 GHz
32.0 GB (31.6 GB 可用)

env
intel-extension-for-pytorch 2.1.10+xpu
torch 2.1.0a0+cxx11.abi
transformers 4.44.2

The text was updated successfully, but these errors were encountered:

JinheTang · 2024-09-06T01:42:39Z

Hi @WeiguangHan , we will take a look at this issue and try to reproduce it first. We'll let you know if there's any progress.

JinheTang · 2024-09-10T07:53:04Z

Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.

The CPU usage when running qwen1.5 example script turned out pretty normal:

given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using save.py) model is about 10GB. The inference speed is 9.2 tokens/sec when n-predict is set to default 32:

Also, pls note that it is recommended to run performance evaluation with the all-in-one benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one).
Reference config:

and below is the demo output on our machine:

,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A

WeiguangHan · 2024-09-11T02:47:40Z

Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.

The CPU usage when running qwen1.5 example script turned out pretty normal: given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using save.py) model is about 10GB. The inference speed is 9.2 tokens/sec when n-predict is set to default 32:

Also, pls note that it is recommended to run performance evaluation with the all-in-one benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one). Reference config: and below is the demo output on our machine:
,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A

Thanks a lot. The CPU of my computer is Ultra 7 155H. It should have a better performance theoretically. I will try it again according to your instructions.

qiuxin2012 added the user issue label Sep 5, 2024

glorysdj assigned JinheTang Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference speed and memory usage of Qwen1.5-14b #12015

Inference speed and memory usage of Qwen1.5-14b #12015

WeiguangHan commented Sep 4, 2024

JinheTang commented Sep 6, 2024

JinheTang commented Sep 10, 2024 •

edited by rnwang04

Loading

WeiguangHan commented Sep 11, 2024

Inference speed and memory usage of Qwen1.5-14b #12015

Inference speed and memory usage of Qwen1.5-14b #12015

Comments

WeiguangHan commented Sep 4, 2024

JinheTang commented Sep 6, 2024

JinheTang commented Sep 10, 2024 • edited by rnwang04 Loading

WeiguangHan commented Sep 11, 2024

JinheTang commented Sep 10, 2024 •

edited by rnwang04

Loading