Skip to content

OpenBLAS threading affinity #7197

Closed
@luotao1

Description

@luotao1

The OpenBLAS inference benchmark seems wrong in IntelOptimizedPaddle.md.
In all three networks, when BatchSize increases, the images/second decreases.

  • VGG-19
BatchSize 1 2 4 8 16
OpenBLAS 1.07 1.08 1.06 0.88 0.65
  • ResNet-50
BatchSize 1 2 4 8 16
OpenBLAS 3.35 3.19 3.09 2.55 1.96
  • GoogLeNet
BatchSize 1 2 4 8 16
OpenBLAS 12.04 11.31 10.00 9.07 4.34

Possible Reason

Why images/second increases with BatchSize increasing in training benchmark?

OPENBLAS_NUM_THREADS * trainer_count = core number

The minimum BatchSize used in training is 64, which is larger than core number (40 in the experiment). Thus, we export OPENBLAS_NUM_THREADS=1 and trainer_count=40.

However, in inference, the BatchSize is smaller than core number. For example, when BatchSize=2, we export OPENBLAS_NUM_THREADS=20 and trainer_count=2. Which may cause the conflict in thread affinity.

How could I disable OpenBLAS threading affinity on runtime?

You can define the OPENBLAS_MAIN_FREE or GOTOBLAS_MAIN_FREE environment variable to disable threading affinity on runtime. For example, before the running, export OPENBLAS_MAIN_FREE=1
Alternatively, you can disable affinity feature with enabling NO_AFFINITY=1 in Makefile.rule. https://github.com/xianyi/OpenBLAS/wiki/Faq#no_affinity

Thus, I export OPENBLAS_MAIN_FREE=1, and test VGG inference, the result speedups:

BatchSize 1 2 4 8 16
OpenBLAS 1.07->1.08 1.08->1.99 1.06->3.64 0.88->3.57 0.65->2.27

@tensor-tang Can you help double check this result?

Solution

If OpenBLAS threading affinity affects the elapsed time, should we auto set it in the program like MKL does?

@tensor-tang Can you give some suggestion about this?

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions