I was profiling the performance of multiple zfp compressions/decompressions and noticed that the newer (unreleased) zfp code contains a "device init" function that is called whenever the execution mode is set for a zfp stream.
|
// warm up device by launching dummy kernel |
|
bool device_init() |
|
{ |
|
bool success = true; |
|
ErrorCheck error; |
|
|
|
// allocate device memory |
|
unsigned int* d_word = NULL; |
|
cudaMalloc(&d_word, sizeof(*d_word)); |
|
success &= error.check("zfp device init - cudaMalloc"); |
|
|
|
// launch a dummy kernel |
|
device_init_kernel<<<1, 1>>>(d_word); |
|
success &= error.check("zfp device init - kernel"); |
|
|
|
// allocate host memory |
|
unsigned int* h_word = NULL; |
|
cudaMallocHost(&h_word, sizeof(*h_word)); |
|
success &= error.check("zfp device init - cudaMallocHost"); |
|
|
|
// copy from device to host |
|
if (success) { |
|
cudaMemcpy(h_word, d_word, sizeof(*h_word), cudaMemcpyDeviceToHost); |
|
success &= error.check("zfp device init - cudaMemcpy"); |
|
success &= (*h_word == ZFP_MAGIC); |
|
} |
|
|
|
// free host memory |
|
cudaFreeHost(h_word); |
|
success &= error.check("zfp device init - cudaFreeHost"); |
|
|
|
// free device memory |
|
cudaFree(d_word); |
|
success &= error.check("zfp device init - cudaFree"); |
|
|
|
return success; |
|
} |
called from here
|
zfp_bool |
|
zfp_internal_cuda_init(zfp_exec_params_cuda* params) |
|
{ |
|
// ensure GPU word size equals CPU word size |
|
if (sizeof(Word) != sizeof(bitstream_word)) |
|
return false; |
|
|
|
// perform expensive query of device properties only once |
|
static bool initialized = false; |
|
static cudaDeviceProp prop; |
|
if (!initialized && cudaGetDeviceProperties(&prop, 0) != cudaSuccess) |
|
return zfp_false; |
|
initialized = true; |
|
|
|
// cache device properties |
|
params->processors = prop.multiProcessorCount; |
|
params->grid_size[0] = prop.maxGridSize[0]; |
|
params->grid_size[1] = prop.maxGridSize[1]; |
|
params->grid_size[2] = prop.maxGridSize[2]; |
|
|
|
// launch device warm-up kernel |
|
return (zfp_bool)zfp::cuda::internal::device_init(); |
|
} |
Here is a list of events from a profile I took that shows two different compression calls:
For our use case, the data to be compressed is relatively small (8MB-32MB), so the compression call itself only runs for ~40us, while the other calls related to this device init code incur up to 100s of microseconds of latency. I was able to resolve this by only calling zfp::cuda::internal::device_init the first time zfp_internal_cuda_init is called. I think this is an acceptable approach, since I believe the device init function is just making sure that the application has correctly enabled the CUDA runtime and that the application is able to run the kernels. After device init is succeeds once, it is likely that it will succeed throughout the course of the application. Do I have a misconception about the purpose of this code or is my suggested code change valid?
I was profiling the performance of multiple zfp compressions/decompressions and noticed that the newer (unreleased) zfp code contains a "device init" function that is called whenever the execution mode is set for a zfp stream.
zfp/src/cuda/device.cuh
Lines 17 to 53 in cccbb9d
called from here
zfp/src/cuda/interface.cu
Lines 20 to 42 in cccbb9d
Here is a list of events from a profile I took that shows two different compression calls:
For our use case, the data to be compressed is relatively small (8MB-32MB), so the compression call itself only runs for ~40us, while the other calls related to this device init code incur up to 100s of microseconds of latency. I was able to resolve this by only calling
zfp::cuda::internal::device_initthe first timezfp_internal_cuda_initis called. I think this is an acceptable approach, since I believe the device init function is just making sure that the application has correctly enabled the CUDA runtime and that the application is able to run the kernels. After device init is succeeds once, it is likely that it will succeed throughout the course of the application. Do I have a misconception about the purpose of this code or is my suggested code change valid?