Description
Describe the issue
Hi,
I am using ipex to apply bf16 to the SpeechT5 model. I use both ipex.optimize(model,dtype=torch.bfloat16)
and with torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16, cache_enabled=True):
in the code for the bf16 setup. I find that when I just run the script without using numactl. bf16 truly gets better performance than the fp32, as follows:
However, when I use numactl -m 0 -C 0-13
to run the script, bf16 has worse performance than fp32
- fp32
- bf16
Could you please give me some hints about that phenomenon? Does ipex bf16 has negative optimization than fp32 when the program is bound to specific cores?
Also, since the cases using numactl (bound specific cores) seem to be far better than the default run (bound to all cores), we are looking forward to your suggestions on how to get a performance gain using ipex bf16 under the numactl cases.