Skip to content

Threaded MKL for paddle #2379

@wanglovesyang

Description

@wanglovesyang

I read the cblas.cmake in the paddle and found that paddle make use of libmkl_sequential.so which means that all the matrix operation on CPU are done by one core (in one trainer). This could be reasonable when using common server nodes(128G + 12 cores)。However, I am currently using the CPU of Intel Phi which contains 256 cores. The 128G memory cannot hold 256 trainers if I want make use of all computing resources.

Hence, I refer to libmkl_intel_thread.so (by changing the cmake file to parallel the GEMM operation of paddle, such that I can obtain a 100% cpu usage while holding 10 trainers. Unfortunately, the training process (e.g. 1h / pass, 100%cpu) is much slower than using libmkl_sequential.so on 10 trainers. (0.5h / pass, 5%cpu). This result is nearly ridiculous within my understanding. Could any one help me check out this problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    User用于标记用户问题

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions