Description
The OpenBLAS inference benchmark seems wrong in IntelOptimizedPaddle.md.
In all three networks, when BatchSize increases, the images/second decreases.
- VGG-19
BatchSize | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
OpenBLAS | 1.07 | 1.08 | 1.06 | 0.88 | 0.65 |
- ResNet-50
BatchSize | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
OpenBLAS | 3.35 | 3.19 | 3.09 | 2.55 | 1.96 |
- GoogLeNet
BatchSize | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
OpenBLAS | 12.04 | 11.31 | 10.00 | 9.07 | 4.34 |
Possible Reason
Why images/second increases with BatchSize increasing in training benchmark?
OPENBLAS_NUM_THREADS * trainer_count = core number
The minimum BatchSize used in training is 64, which is larger than core number (40 in the experiment). Thus, we export OPENBLAS_NUM_THREADS=1
and trainer_count=40
.
However, in inference, the BatchSize is smaller than core number. For example, when BatchSize=2
, we export OPENBLAS_NUM_THREADS=20
and trainer_count=2
. Which may cause the conflict in thread affinity.
How could I disable OpenBLAS threading affinity on runtime?
You can define the OPENBLAS_MAIN_FREE or GOTOBLAS_MAIN_FREE environment variable to disable threading affinity on runtime. For example, before the running, export OPENBLAS_MAIN_FREE=1
Alternatively, you can disable affinity feature with enabling NO_AFFINITY=1
in Makefile.rule. https://github.com/xianyi/OpenBLAS/wiki/Faq#no_affinity
Thus, I export OPENBLAS_MAIN_FREE=1
, and test VGG inference, the result speedups:
BatchSize | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
OpenBLAS | 1.07->1.08 | 1.08->1.99 | 1.06->3.64 | 0.88->3.57 | 0.65->2.27 |
@tensor-tang Can you help double check this result?
Solution
If OpenBLAS threading affinity affects the elapsed time, should we auto set it in the program like MKL does?
- Should we add
export OPENBLAS_MAIN_FREE
in paddle/scripts/submit_local.sh.in and python/paddle/v2/init.py , or directly addNO_AFFINITY=1
in openblas.cmake ? - Does
OPENBLAS_MAIN_FREE
related with hyperthreading, likeKMP_AFFINITY
does?
@tensor-tang Can you give some suggestion about this?