Large overhead due to device init code

I was profiling the performance of multiple zfp compressions/decompressions and noticed that the newer (unreleased) zfp code contains a "device init" function that is called whenever the execution mode is set for a zfp stream.

https://github.com/llnl/zfp/blob/cccbb9d5e69a2998f63eb06153ca3b7b63901b56/src/cuda/device.cuh#L17-L53

called from here
https://github.com/llnl/zfp/blob/cccbb9d5e69a2998f63eb06153ca3b7b63901b56/src/cuda/interface.cu#L20-L42

Here is a list of events from a profile I took that shows two different compression calls:

<img width="556" height="593" alt="Image" src="https://github.com/user-attachments/assets/41321a65-050e-4514-87ff-89f8762708ac" />

For our use case, the data to be compressed is relatively small (8MB-32MB), so the compression call itself only runs for ~40us, while the other calls related to this device init code incur up to 100s of microseconds of latency. I was able to resolve this by only calling `zfp::cuda::internal::device_init` the first time `zfp_internal_cuda_init` is called. I think this is an acceptable approach, since I believe the device init function is just making sure that the application has correctly enabled the CUDA runtime and that the application is able to run the kernels. After device init is succeeds once, it is likely that it will succeed throughout the course of the application. Do I have a misconception about the purpose of this code or is my suggested code change valid?

	// warm up device by launching dummy kernel
	bool device_init()
	{
	bool success = true;
	ErrorCheck error;

	// allocate device memory
	unsigned int* d_word = NULL;
	cudaMalloc(&d_word, sizeof(*d_word));
	success &= error.check("zfp device init - cudaMalloc");

	// launch a dummy kernel
	device_init_kernel<<<1, 1>>>(d_word);
	success &= error.check("zfp device init - kernel");

	// allocate host memory
	unsigned int* h_word = NULL;
	cudaMallocHost(&h_word, sizeof(*h_word));
	success &= error.check("zfp device init - cudaMallocHost");

	// copy from device to host
	if (success) {
	cudaMemcpy(h_word, d_word, sizeof(*h_word), cudaMemcpyDeviceToHost);
	success &= error.check("zfp device init - cudaMemcpy");
	success &= (*h_word == ZFP_MAGIC);
	}

	// free host memory
	cudaFreeHost(h_word);
	success &= error.check("zfp device init - cudaFreeHost");

	// free device memory
	cudaFree(d_word);
	success &= error.check("zfp device init - cudaFree");

	return success;
	}

	zfp_bool
	zfp_internal_cuda_init(zfp_exec_params_cuda* params)
	{
	// ensure GPU word size equals CPU word size
	if (sizeof(Word) != sizeof(bitstream_word))
	return false;

	// perform expensive query of device properties only once
	static bool initialized = false;
	static cudaDeviceProp prop;
	if (!initialized && cudaGetDeviceProperties(&prop, 0) != cudaSuccess)
	return zfp_false;
	initialized = true;

	// cache device properties
	params->processors = prop.multiProcessorCount;
	params->grid_size[0] = prop.maxGridSize[0];
	params->grid_size[1] = prop.maxGridSize[1];
	params->grid_size[2] = prop.maxGridSize[2];

	// launch device warm-up kernel
	return (zfp_bool)zfp::cuda::internal::device_init();
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large overhead due to device init code #293

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Large overhead due to device init code #293

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions