A lightweight CLI tool to measure CPU-side latency of CUDA Driver Virtual Memory Management (VMM) APIs, including:
cuMemGetAllocationGranularitycuMemAddressReserve/cuMemAddressFreecuMemCreate/cuMemReleasecuMemMap/cuMemUnmapcuMemSetAccess
Outputs per-API statistics per size: avg, p50, p95. Console shows a table; optional CSV appends records.
- Linux with NVIDIA driver (providing
libcuda.so). - CUDA Toolkit headers (
cuda.h) for build. If you don’t havenvcc, the fallback g++ build links-lcudaand still requires headers. - CUDA 11+ recommended; device must support VMM.
Preferred: nvcc (if available)
cd /data/workspace/vmm
makeIf nvcc is not available, use g++ fallback (headers and library required):
make gpp
# If headers/libraries are not in default paths:
make gpp INCLUDES="-I/usr/local/cuda/include" LDFLAGS="-L/usr/local/cuda/lib64 -lcuda"You can inspect detected CUDA paths:
make print-vars./bin/vmm_bench [options]Console prints a table like:
API | avg (us) | p50 (us) | p95 (us)
------------------------------+--------------+--------------+--------------
cuMemGetAllocationGranularity | 0.030 | 0.030 | 0.030
cuMemAddressReserve | 0.210 | 0.210 | 0.230
...
--device NSelect CUDA device id (default: 0)--iters IIterations per test (default: 100)--warmup WWarmup iterations per API per size (default: 10)--threads TConcurrent threads for multi-thread benchmark (default: 1)--min-granularityUse minimum granularity (default: recommended granularity)--csv pathAppend CSV output topath(created if not exists)--sizes listComma-separated sizes. Supports k/m/g suffix, e.g.1m,8m,64m--helpShow usage
Single-thread, default recommended granularity:
./bin/vmm_bench --iters 200 --warmup 20 --sizes 1m,8m,64m --csv results.csvUse minimum granularity:
./bin/vmm_bench --min-granularity --sizes 2m,32m --iters 50Multi-thread (4 threads):
./bin/vmm_bench --threads 4 --iters 100 --warmup 10 --sizes 2m,8m,32m --csv mt_results.csvCSV columns: device,size_bytes,api,avg_us,p50_us,p95_us
- Time unit is microseconds (us) in the current version.
- Records are appended; header written once when file is created.
- Sizes are automatically aligned to the selected granularity.
- Granularity is derived from
cuMemGetAllocationGranularityusing eitherRECOMMENDED(default) orMINIMUM(with--min-granularity). - VMM constraints require mapping sizes/addresses/offsets to be multiples of the minimum granularity.
- The tool uses the device primary context; each thread calls
cuCtxSetCurrent(ctx)before VMM operations. - Multi-threaded VMM API calls may contend inside the driver; expect distributions to widen. Consider increasing warmup and iterations for stable stats.
- If you need strict isolation, a future variant can create per-thread independent contexts.
nvcc: No such file or directory- Install CUDA Toolkit or use g++ fallback:
make gpp - Ensure headers in
CUDA_INCLUDE_DIR(e.g.,/usr/local/cuda/include) and libraries inCUDA_LIB_DIR(e.g.,/usr/local/cuda/lib64).
- Install CUDA Toolkit or use g++ fallback:
fatal error: cuda.h: No such file or directory- Install CUDA Toolkit headers or pass include path:
INCLUDES="-I/usr/local/cuda/include"
- Install CUDA Toolkit headers or pass include path:
CUDA Driver API error 201: invalid device context- Occurs if destroying a primary context incorrectly. The tool retains and releases the primary context correctly via
cuDevicePrimaryCtxRetain/Release.
- Occurs if destroying a primary context incorrectly. The tool retains and releases the primary context correctly via
- Runtime cannot find
libcuda.so- Ensure NVIDIA driver is installed; set
LD_LIBRARY_PATHto include the driver library directory.
- Ensure NVIDIA driver is installed; set
- This tool measures CPU-side latency of API calls, not kernel execution time.
cuMemMaprequirescuMemSetAccessto make memory accessible; both calls are measured separately.cuMemUnmapmust unmap an entire previously mapped contiguous range.
MIT (or your preferred license)