Skip to content

Large overhead due to device init code #293

@ndcontini

Description

@ndcontini

I was profiling the performance of multiple zfp compressions/decompressions and noticed that the newer (unreleased) zfp code contains a "device init" function that is called whenever the execution mode is set for a zfp stream.

zfp/src/cuda/device.cuh

Lines 17 to 53 in cccbb9d

// warm up device by launching dummy kernel
bool device_init()
{
bool success = true;
ErrorCheck error;
// allocate device memory
unsigned int* d_word = NULL;
cudaMalloc(&d_word, sizeof(*d_word));
success &= error.check("zfp device init - cudaMalloc");
// launch a dummy kernel
device_init_kernel<<<1, 1>>>(d_word);
success &= error.check("zfp device init - kernel");
// allocate host memory
unsigned int* h_word = NULL;
cudaMallocHost(&h_word, sizeof(*h_word));
success &= error.check("zfp device init - cudaMallocHost");
// copy from device to host
if (success) {
cudaMemcpy(h_word, d_word, sizeof(*h_word), cudaMemcpyDeviceToHost);
success &= error.check("zfp device init - cudaMemcpy");
success &= (*h_word == ZFP_MAGIC);
}
// free host memory
cudaFreeHost(h_word);
success &= error.check("zfp device init - cudaFreeHost");
// free device memory
cudaFree(d_word);
success &= error.check("zfp device init - cudaFree");
return success;
}

called from here

zfp_bool
zfp_internal_cuda_init(zfp_exec_params_cuda* params)
{
// ensure GPU word size equals CPU word size
if (sizeof(Word) != sizeof(bitstream_word))
return false;
// perform expensive query of device properties only once
static bool initialized = false;
static cudaDeviceProp prop;
if (!initialized && cudaGetDeviceProperties(&prop, 0) != cudaSuccess)
return zfp_false;
initialized = true;
// cache device properties
params->processors = prop.multiProcessorCount;
params->grid_size[0] = prop.maxGridSize[0];
params->grid_size[1] = prop.maxGridSize[1];
params->grid_size[2] = prop.maxGridSize[2];
// launch device warm-up kernel
return (zfp_bool)zfp::cuda::internal::device_init();
}

Here is a list of events from a profile I took that shows two different compression calls:

Image

For our use case, the data to be compressed is relatively small (8MB-32MB), so the compression call itself only runs for ~40us, while the other calls related to this device init code incur up to 100s of microseconds of latency. I was able to resolve this by only calling zfp::cuda::internal::device_init the first time zfp_internal_cuda_init is called. I think this is an acceptable approach, since I believe the device init function is just making sure that the application has correctly enabled the CUDA runtime and that the application is able to run the kernels. After device init is succeeds once, it is likely that it will succeed throughout the course of the application. Do I have a misconception about the purpose of this code or is my suggested code change valid?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions