-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
Now our LoD implement is based on
Paddle/paddle/framework/lod_tensor.h
Lines 35 to 41 in 8e08b0a
| template <typename T> | |
| using Vector = std::vector<T>; | |
| #else | |
| template <typename T> | |
| using Vector = thrust::host_vector< | |
| T, thrust::system::cuda::experimental::pinned_allocator<T>>; | |
| #endif |
-
LoD Vector hurt performance
To be compatible with GPU, the vector is allocated in page-locked memory, the GPU device can access it directly. However, the pinned memory Not fit our needs well.
https://www.cs.virginia.edu/~mwb7w/cuda_support/pinned_tradeoff.html
Only >= 16MB, the pinned memory will have a higher speed compare withCudaMalloc, but our LoD always contains a small piece of data, will pay 10-100x higher price. -
LoD should support multi-device.
The reason LoD should be accessible in GPU device is some operator use the LoD position pointer.
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/sequence_padding.cu#L200
To improve the performance, we can custom the pinned_allocator to reach a higher speed(pinned_allocator use ). But If we want to support Arm, AMD device, the shared address cannot be visiable between all the devices. So we need a different interface per device.