This project demonstrates how to overlap CUDA memory transfers and kernel execution using:
- Multiple CUDA streams
- Pinned (page-locked) host memory
- Asynchronous
cudaMemcpyAsync - A simple SAXPY-like compute (
z = a*x + b)
The goal is to show how PCIe transfers, kernel compute, and host/device synchronization can run concurrently to maximize GPU utilization.
streams-and-pinned-mem/
│── CMakeLists.txt
│── overlap_streams.cu
│── README.md ← (this file)
│── scripts/
│ └── check_cuda_streams_status.sh
│── build/ (generated)
Each stream executes operations in order, but different streams can run in parallel:
- Independent compute and memcpy paths
- Helps hide PCIe transfer latency
- Enables multi-chunk pipelining
Pinned memory allows:
- True asynchronous DMA transfers
- Higher PCIe bandwidth
- Required for overlap with kernel execution
Allocated using:
cudaHostAlloc(&h_x, N*sizeof(float), cudaHostAllocDefault);The program uses N streams, each responsible for a chunk:
H2D copy → Kernel → D2H copy
All streams operate concurrently, creating a pipeline.
Stream 0: [H2D]----[Compute]-------[D2H]
Stream 1: [H2D]----[Compute]-------[D2H]
Stream 2: [H2D]----[Compute]-------[D2H]
Stream 3: [H2D]----[Compute]-------[D2H]
Result: PCIe transfers and kernels run at the same time, improving throughput.
The compute is intentionally simple:
z[i] = a * x[i] + b;This allows the demo to focus on stream behavior, not algorithm complexity.
- Linux (WSL2 Ubuntu recommended)
- NVIDIA GPU + driver
- CUDA Toolkit installed system-wide (
/usr/local/cuda)
rm -rf build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j./build/overlap_streamsIncluded under scripts/check_cuda_streams_status.sh:
- Detects
nvcc - Detects GPU compute capability
- Confirms pinned memory support
- Prints all CUDA runtime library versions
- Warns if conda CUDA overrides system CUDA
Run:
bash scripts/check_cuda_streams_status.shSystem CUDA is almost always safer:
which nvcc
# should be /usr/local/cuda/bin/nvcchash -rUse:
nvprof ./build/overlap_streamsor Nsight Systems.
- NVIDIA CUDA Programming Guide
- “Streams and Concurrency” — official CUDA samples
- Nsight Systems Profiling Tutorials
Samuel Huang
GitHub: FlosMume
MIT License