Skip to content

Unexplainable bf16 performance drop when using numactl to bind specific cores #387

Open
@Spycsh

Description

@Spycsh

Describe the issue

Hi,

I am using ipex to apply bf16 to the SpeechT5 model. I use both ipex.optimize(model,dtype=torch.bfloat16) and with torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16, cache_enabled=True): in the code for the bf16 setup. I find that when I just run the script without using numactl. bf16 truly gets better performance than the fp32, as follows:

  • fp32
    image

  • bf16
    image

However, when I use numactl -m 0 -C 0-13 to run the script, bf16 has worse performance than fp32

  • fp32

image

  • bf16

image

Could you please give me some hints about that phenomenon? Does ipex bf16 has negative optimization than fp32 when the program is bound to specific cores?

Also, since the cases using numactl (bound specific cores) seem to be far better than the default run (bound to all cores), we are looking forward to your suggestions on how to get a performance gain using ipex bf16 under the numactl cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions