Skip to content

staryxchen/cuda_vmm_api_bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA VMM API Benchmark

A lightweight CLI tool to measure CPU-side latency of CUDA Driver Virtual Memory Management (VMM) APIs, including:

  • cuMemGetAllocationGranularity
  • cuMemAddressReserve / cuMemAddressFree
  • cuMemCreate / cuMemRelease
  • cuMemMap / cuMemUnmap
  • cuMemSetAccess

Outputs per-API statistics per size: avg, p50, p95. Console shows a table; optional CSV appends records.

Requirements

  • Linux with NVIDIA driver (providing libcuda.so).
  • CUDA Toolkit headers (cuda.h) for build. If you don’t have nvcc, the fallback g++ build links -lcuda and still requires headers.
  • CUDA 11+ recommended; device must support VMM.

Build

Preferred: nvcc (if available)

cd /data/workspace/vmm
make

If nvcc is not available, use g++ fallback (headers and library required):

make gpp
# If headers/libraries are not in default paths:
make gpp INCLUDES="-I/usr/local/cuda/include" LDFLAGS="-L/usr/local/cuda/lib64 -lcuda"

You can inspect detected CUDA paths:

make print-vars

Run

./bin/vmm_bench [options]

Console prints a table like:

API                           |     avg (us) |     p50 (us) |     p95 (us)
------------------------------+--------------+--------------+--------------
cuMemGetAllocationGranularity |        0.030 |        0.030 |        0.030
cuMemAddressReserve           |        0.210 |        0.210 |        0.230
...

CLI Options

  • --device N Select CUDA device id (default: 0)
  • --iters I Iterations per test (default: 100)
  • --warmup W Warmup iterations per API per size (default: 10)
  • --threads T Concurrent threads for multi-thread benchmark (default: 1)
  • --min-granularity Use minimum granularity (default: recommended granularity)
  • --csv path Append CSV output to path (created if not exists)
  • --sizes list Comma-separated sizes. Supports k/m/g suffix, e.g. 1m,8m,64m
  • --help Show usage

Examples

Single-thread, default recommended granularity:

./bin/vmm_bench --iters 200 --warmup 20 --sizes 1m,8m,64m --csv results.csv

Use minimum granularity:

./bin/vmm_bench --min-granularity --sizes 2m,32m --iters 50

Multi-thread (4 threads):

./bin/vmm_bench --threads 4 --iters 100 --warmup 10 --sizes 2m,8m,32m --csv mt_results.csv

CSV Format

CSV columns: device,size_bytes,api,avg_us,p50_us,p95_us

  • Time unit is microseconds (us) in the current version.
  • Records are appended; header written once when file is created.

Sizes and Granularity

  • Sizes are automatically aligned to the selected granularity.
  • Granularity is derived from cuMemGetAllocationGranularity using either RECOMMENDED (default) or MINIMUM (with --min-granularity).
  • VMM constraints require mapping sizes/addresses/offsets to be multiples of the minimum granularity.

Concurrency Notes

  • The tool uses the device primary context; each thread calls cuCtxSetCurrent(ctx) before VMM operations.
  • Multi-threaded VMM API calls may contend inside the driver; expect distributions to widen. Consider increasing warmup and iterations for stable stats.
  • If you need strict isolation, a future variant can create per-thread independent contexts.

Troubleshooting

  • nvcc: No such file or directory
    • Install CUDA Toolkit or use g++ fallback: make gpp
    • Ensure headers in CUDA_INCLUDE_DIR (e.g., /usr/local/cuda/include) and libraries in CUDA_LIB_DIR (e.g., /usr/local/cuda/lib64).
  • fatal error: cuda.h: No such file or directory
    • Install CUDA Toolkit headers or pass include path: INCLUDES="-I/usr/local/cuda/include"
  • CUDA Driver API error 201: invalid device context
    • Occurs if destroying a primary context incorrectly. The tool retains and releases the primary context correctly via cuDevicePrimaryCtxRetain/Release.
  • Runtime cannot find libcuda.so
    • Ensure NVIDIA driver is installed; set LD_LIBRARY_PATH to include the driver library directory.

Notes

  • This tool measures CPU-side latency of API calls, not kernel execution time.
  • cuMemMap requires cuMemSetAccess to make memory accessible; both calls are measured separately.
  • cuMemUnmap must unmap an entire previously mapped contiguous range.

License

MIT (or your preferred license)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published