OpenBLAS threading affinity

The OpenBLAS inference benchmark seems wrong in [IntelOptimizedPaddle.md](https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/IntelOptimizedPaddle.md#inference).
In all three networks, when BatchSize increases, the images/second decreases.  
- VGG-19

| BatchSize | 1     | 2     | 4     | 8     | 16    |
|-----------|-------|-------|-------|-------|-------|
| OpenBLAS  | 1.07  | 1.08  | 1.06  | 0.88  | 0.65  |

- ResNet-50

| BatchSize | 1     | 2      | 4      | 8      | 16     |
|-----------|-------|--------|--------|--------|--------|
| OpenBLAS  | 3.35  | 3.19   | 3.09   | 2.55   | 1.96   |

- GoogLeNet

| BatchSize | 1      | 2      | 4      | 8      | 16     |
|-----------|--------|--------|--------|--------|--------|
| OpenBLAS  | 12.04  | 11.31  | 10.00  | 9.07   | 4.34   |

## Possible Reason

> Why images/second increases with  BatchSize increasing in training benchmark?

```
OPENBLAS_NUM_THREADS * trainer_count = core number
```
The minimum BatchSize used in training is 64, which is larger than core number (40 in the experiment). Thus, we `export OPENBLAS_NUM_THREADS=1` and `trainer_count=40`.

However, in inference, the BatchSize is smaller than core number. For example, when `BatchSize=2`, we `export OPENBLAS_NUM_THREADS=20` and `trainer_count=2`. Which may cause the conflict in thread affinity.

> How could I disable OpenBLAS threading affinity on runtime?

You can define the OPENBLAS_MAIN_FREE or GOTOBLAS_MAIN_FREE environment variable to disable threading affinity on runtime. For example, before the running, `export OPENBLAS_MAIN_FREE=1`
Alternatively, you can disable affinity feature with enabling `NO_AFFINITY=1` in Makefile.rule. https://github.com/xianyi/OpenBLAS/wiki/Faq#no_affinity

Thus, I `export OPENBLAS_MAIN_FREE=1`, and test VGG inference, the result speedups:

| BatchSize | 1     | 2     | 4     | 8     | 16    |
|-----------|-------|-------|-------|-------|-------|
| OpenBLAS  | 1.07->1.08  | 1.08->1.99  | 1.06->3.64  | 0.88->3.57  | 0.65->2.27  |

@tensor-tang Can you help double check this result?

## Solution
If OpenBLAS threading affinity affects the elapsed time, should we auto set it in the program like MKL does?
- Should we add  `export OPENBLAS_MAIN_FREE` in [paddle/scripts/submit_local.sh.in](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/submit_local.sh.in#L46) and [python/paddle/v2/__init__.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/__init__.py) , or directly add `NO_AFFINITY=1` in [openblas.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/external/openblas.cmake#L69) ?
- Does `OPENBLAS_MAIN_FREE` related with hyperthreading, like `KMP_AFFINITY` does?

@tensor-tang Can you give some suggestion about this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenBLAS threading affinity #7197

Possible Reason

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenBLAS threading affinity #7197

Description

Possible Reason

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions