Welcome! This repository is my personal sandbox where I am tracking my progress as I learn high-performance computing and GPU acceleration from scratch.
Note
Disclaimer for the Records: This is an educational, work-in-progress repository built for learning and personal tracking. It is not a professional production reference or enterprise-grade code library. I am exploring the foundational concepts, breaking things, and documenting what I learn along the way!
- The Goal: Learn how to move data from the CPU to the GPU and add millions of numbers in parallel.
- Core Concepts Mastered:
- Allocating VRAM using
cudaMallocand cleaning it up cleanly usingcudaFree. - Shipping data back and forth across the PCIe bus using
cudaMemcpy. - Writing my first custom GPU kernel (
__global__) and calculating global indexes usingblockIdx.x * blockDim.x + threadIdx.x.
- Allocating VRAM using
- Why 256 threads? I learned that the GPU processes things in chunks of 32 threads (called a "Warp"). Setting my block size to 256 keeps the hardware happy and aligned!
To make sure I don't forget how the scaling logic works, here is the math I'm tracking for grid configurations:
If I have an array size
This ensures that if I have leftover elements, an extra block is automatically created to handle them, and the if (i < n) boundary check inside my kernel keeps threads from touching bad memory.
- IDE: Microsoft Visual Studio (using the CUDA Runtime templates)
- Target Architecture: x64 Debug/Release
- Important Fix: If Visual Studio complains about duplicate symbols (
LNK2005), remember to right-click the defaultkernel.cuplaceholder file and select Exclude From Build!
- Implement a 2D Matrix Multiplication kernel.
- Explore Shared Memory allocation to make memory access even faster.
- Measure actual execution times between a standard CPU
forloop and my GPU kernels.