-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Repair nccl op test #8575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repair nccl op test #8575
Conversation
@@ -121,27 +134,11 @@ class NCCLTester : public ::testing::Test { | |||
std::vector<p::DeviceContext *> dev_ctxs; | |||
f::Scope g_scope; | |||
std::mutex mu; | |||
std::vector<int> gpu_list; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data members of classes should be with a trailing underscore.
refer : Variable Names
auto op = f::OpRegistry::CreateOp(*op1); | ||
VLOG(1) << "invoke NCCLInitOp."; | ||
op->Run(g_scope, cpu_place); | ||
VLOG(1) << "NCCLInitOp finished."; | ||
} | ||
|
||
int GetGPUData(int gpu_id) { return gpu_id + 42; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is necessary, because
Simply set GPU data = gpu_id won't expose the incorrect linking error. More specifically, Paddle might be compiled with NCCL1.3 header while dynamically linked with NCCL2.so.
@@ -97,7 +108,7 @@ class NCCLTester : public ::testing::Test { | |||
send_tensor->Resize(kDims); | |||
send_tensor->mutable_data<T>(kDims, place); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe line 108 is unnecessary.
@@ -97,7 +108,7 @@ class NCCLTester : public ::testing::Test { | |||
send_tensor->Resize(kDims); | |||
send_tensor->mutable_data<T>(kDims, place); | |||
|
|||
std::vector<T> send_vector(f::product(kDims), gpu_id); | |||
std::vector<T> send_vector(f::product(kDims), GetGPUData(gpu_id)); | |||
paddle::framework::TensorFromVector<T>(send_vector, *ctx, send_tensor); | |||
ctx->Wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to synchronize here? I think the copying will synchronize the GPU and CPU in line 179.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copying is cudaMemcpyAsync
. Looks like we need to add a wait to line 179...
Paddle/paddle/fluid/memory/memcpy.cc
Lines 30 to 36 in 0d49b92
template <> | |
void Copy<platform::CPUPlace, platform::CUDAPlace>( | |
platform::CPUPlace dst_place, void* dst, platform::CUDAPlace src_place, | |
const void* src, size_t num, cudaStream_t stream) { | |
platform::SetDeviceId(src_place.device); | |
platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyDeviceToHost, stream); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so because the memory is pageable in CPU side, the copying doesn't return immediately until the copy has completed.
The current CUDA Runtime Documentation states:
Asynchronous(Memcpy):
- For transfers from device memory to pageable host memory, the function will return only once the copy has completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
fix #8582