Skip to content

CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.

Notifications You must be signed in to change notification settings

FlosMume/cpp-cuda-deepvision-rtx-starter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepVision-RTX Starter

This is an enhanced README combining:

  • GitHub-friendly formatting
  • Technical deep dive
  • ASCII diagrams
  • Rendered PNG diagrams
  • Profiling workflow
  • Architecture explanations
  • GPU hardware reasoning

Architecture Diagram (SVG)

Architecture Diagram

Streams Overlap Diagram (SVG)

Streams Timeline

ASCII Architecture Diagram

   +-------------------+      PCIe / DMA      +---------------------+
   |      CPU Host     |  ----------------->  |     GPU Global      |
   |  (Pinned Memory)  |  <-----------------  |      Memory         |
   +-------------------+                      +---------------------+
            |                                           |
            | Launch Kernels                            |
            v                                           v
      +--------------+                          +------------------+
      | CUDA Driver  |                          |  SMs (Ada 8.9)   |
      | Runtime API  |                          |  Warps / Threads |
      +--------------+                          +------------------+

ASCII Streams Overlap Diagram

Time →
----------------------------------------------------------------------
H2D Stream 0: [========== copy =========]
H2D Stream 1:         [========== copy =========]
Compute Stream 2:             [==== kernel ====]
D2H Stream 0:                           [==== copy ====]
----------------------------------------------------------------------

Build Instructions

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/deepvision_rtx

Profiling Instructions (Nsight Systems & Nsight Compute)

nsys profile -o nsys_report ./build/deepvision_rtx
ncu --set full ./build/deepvision_rtx

Project Structure

src/
   main.cpp            # Entry point
   main.cu             # Separate experimental demo
   conv_kernels.cu     # SAXPY + blur3x3 kernels
   conv_kernels.cuh    # Kernel declarations
   utils/
       check_cuda.hpp  # Error-checking utilities

GPU Architecture Notes (RTX 4070 SUPER - Ada 8.9)

  • SM count: 46
  • Warp size: 32
  • Max threads/block: 1024
  • Memory bandwidth: 504 GB/s
  • Concurrent copy/compute supported
  • Best performance achieved when:
    • You use pinned memory
    • H2D and D2H overlap with compute
    • Kernels maintain good occupancy

Roadmap

  • Add shared-memory tiled blur
  • Add constant-memory kernel variants
  • Add half-precision path (FP16)
  • Add Tensor Core WMMA version
  • Add occupancy analysis + roofline plot
  • Add Nsight Compute performance tables
  • Add multi-kernel pipelines
  • Compare against cuDNN for 3×3 conv

About

CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published