Skip to content

LoD implement hurt performance #7713

@dzhwinter

Description

@dzhwinter

Now our LoD implement is based on

template <typename T>
using Vector = std::vector<T>;
#else
template <typename T>
using Vector = thrust::host_vector<
T, thrust::system::cuda::experimental::pinned_allocator<T>>;
#endif

  1. LoD Vector hurt performance
    To be compatible with GPU, the vector is allocated in page-locked memory, the GPU device can access it directly. However, the pinned memory Not fit our needs well.
    https://www.cs.virginia.edu/~mwb7w/cuda_support/pinned_tradeoff.html
    Only >= 16MB, the pinned memory will have a higher speed compare with CudaMalloc, but our LoD always contains a small piece of data, will pay 10-100x higher price.

  2. LoD should support multi-device.
    The reason LoD should be accessible in GPU device is some operator use the LoD position pointer.
    https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/sequence_padding.cu#L200
    To improve the performance, we can custom the pinned_allocator to reach a higher speed(pinned_allocator use ). But If we want to support Arm, AMD device, the shared address cannot be visiable between all the devices. So we need a different interface per device.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions