LoD implement hurt performance

Now our LoD implement is based on 
https://github.com/PaddlePaddle/Paddle/blob/8e08b0a2c48ad85b84d9cd44e8b5fab41cadd2cf/paddle/framework/lod_tensor.h#L35-L41

1. LoD Vector hurt performance
To be compatible with GPU, the vector is allocated in page-locked memory, the GPU device can access it directly.  However, the pinned memory Not fit our needs well.
https://www.cs.virginia.edu/~mwb7w/cuda_support/pinned_tradeoff.html
Only >= 16MB, the pinned memory will have a higher speed compare with `CudaMalloc`, but our LoD always contains a small piece of data, will pay 10-100x higher price.

2. LoD should support multi-device.
The reason LoD should be accessible in GPU device is some operator use the LoD position pointer.
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/sequence_padding.cu#L200
To improve the performance, we can custom the pinned_allocator to reach a higher speed(pinned_allocator use ). But If we want to support Arm, AMD device, the shared address cannot be visiable between all the devices. So we need a different interface per device.

	template <typename T>
	using Vector = std::vector<T>;
	#else
	template <typename T>
	using Vector = thrust::host_vector<
	T, thrust::system::cuda::experimental::pinned_allocator<T>>;
	#endif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LoD implement hurt performance #7713

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LoD implement hurt performance #7713

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions