Unexplainable bf16 performance drop when using numactl to bind specific cores

### Describe the issue

Hi,

I am using ipex to apply bf16 to the SpeechT5 model. I use both `ipex.optimize(model,dtype=torch.bfloat16)` and `with torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16, cache_enabled=True):` in the code for the bf16 setup. I find that when I just run the script without using numactl. bf16 truly gets better performance than the fp32, as follows:

* fp32
![image](https://github.com/intel/intel-extension-for-pytorch/assets/39623753/c8165280-1498-4c3d-a79c-3cf9ff30d964)


* bf16
![image](https://github.com/intel/intel-extension-for-pytorch/assets/39623753/330d2a48-2928-4de1-88bf-167b5c79b109)

However, when I use `numactl -m 0 -C 0-13` to run the script, bf16 has worse performance than fp32

* fp32

![image](https://github.com/intel/intel-extension-for-pytorch/assets/39623753/30b9264a-2459-437f-98aa-c94f3dc9f9f6)

* bf16

![image](https://github.com/intel/intel-extension-for-pytorch/assets/39623753/8c457bd2-9893-4c92-9944-d88cae851a48)

Could you please give me some hints about that phenomenon? Does ipex bf16 has negative optimization than fp32 when the program is bound to specific cores?

Also, since the cases using numactl (bound specific cores) seem to be far better than the default run (bound to all cores), we are looking forward to your suggestions on how to get a performance gain using ipex bf16 under the numactl cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexplainable bf16 performance drop when using numactl to bind specific cores #387

Describe the issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexplainable bf16 performance drop when using numactl to bind specific cores #387

Description

Describe the issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions