Threaded MKL for paddle

I read the cblas.cmake in the paddle and found that paddle make use of libmkl_sequential.so which means that all the matrix operation on CPU are done by one core (in one trainer). This could be reasonable when using common server nodes(128G + 12 cores)。However,  I am currently using the CPU of Intel Phi which contains 256 cores. The 128G memory cannot hold 256 trainers if I want make use of all computing resources. 

Hence, I refer to libmkl_intel_thread.so (by changing the cmake file to parallel the GEMM operation of paddle, such that I can obtain a 100% cpu usage while holding 10 trainers.  Unfortunately, the training process (e.g. 1h / pass, 100%cpu) is much slower than using libmkl_sequential.so on 10 trainers. (0.5h / pass, 5%cpu). This result is nearly ridiculous within my understanding. Could any one help me check out this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Threaded MKL for paddle #2379

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Threaded MKL for paddle #2379

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions