Local CUDA workbench for developing, testing, and benchmarking Tensara-style GPU kernel solutions before submitting them to the platform.
The repo is intentionally organized around a simple loop:
- implement one or more CUDA kernels for a problem
- expose a Tensara-compatible
extern "C" solution(...)entry point - verify correctness against small expected cases and generated CPU references
- benchmark representative input sizes and launch configurations
- summarize the useful findings in a per-problem results file
This is not a general CUDA library. Each problem file is a self-contained
local harness for one Tensara problem. The harness code exists to make
iteration fast: it can launch different kernel variants behind the same
exported solution routine, run CPU-backed verification, and collect local
timing data.
Current problem files:
P1_1D_CONVOLUTIONS.cu: 1D same-padding convolution / cross-correlation.P3_RELU.cu: elementwise ReLU over a row-major matrix.P4_MVM.cu: matrix-vector multiplication over a row-major matrix.
Detailed correctness and benchmark notes live next to each problem:
Each problem follows the same broad structure:
- CPU reference implementation for correctness checks.
- One or more CUDA kernel implementations.
- A Tensara-facing
extern "C"launcher that receives device pointers. - A local host-side runner that handles allocation, copies, timing, and checks.
- A default correctness-oriented run.
- A heavier
--skip-cpubenchmark run for larger sizes and launch sweeps.
The exported solution(...) function should stay close to what Tensara
expects: it should launch device work using the provided device pointers, not
own the full host allocation or verification flow. Local-only testing belongs
in the harness around it.
Raw run logs are kept as .txt files:
*_with_cpu.txt: CPU-backed correctness-oriented runs.*_skip_cpu.txt: larger benchmark-oriented runs where expensive CPU checks are skipped.
The result tables use these status labels:
cpu=PASS: CPU output matched a hard-coded expected answer.cpu=REF: CPU output was generated and used as the GPU verification reference.cpu=SKIP: CPU reference generation was skipped.gpu=PASS: GPU output matched the expected output or CPU reference.gpu=SKIP: GPU verification was skipped.
The markdown result files summarize the raw logs instead of duplicating every row. They are the place to record which variants are correct, which launch shapes are promising, and which benchmark rows look noisy or suspicious.
The latest saved logs cover CPU-backed and skip-CPU runs for all current CUDA problem files:
P1_1D_CONVOLUTIONS.cubstride_cis the strongest current heavy-run kernel.- It wins 59 of 66 comparable skip-CPU configurations.
- Best large-filter row:
web_2usesbstride_c = 1.135 ms.
P3_RELU.cufloat4is correct on odd shapes and scalar tail cases.- It wins 78 of 83 comparable skip-CPU configurations.
- Best Tensara-size row:
8192 x 8192usesfloat4 = 2.983 ms.
P4_MVM.cuwarpis the strongest current matrix-vector kernel.- It wins 65 of 80 comparable skip-CPU configurations.
- Best Tensara-size row:
4096 x 4096useswarp = 0.365 ms.
Local timings are useful for iteration, but they are not a substitute for Tensara leaderboard measurements. Treat them as directional data:
- compare kernel variants under the same harness and input set
- check odd sizes and tail cases, especially for vectorized kernels
- rerun suspicious rows before drawing conclusions
- prefer correctness evidence from CPU-backed runs before trusting benchmark-only results
The current local benchmark environment used for the saved result files is an NVIDIA GeForce RTX 3050 Laptop GPU.
The repository has been developed with Codex assistance for harness structure, test generation, benchmark organization, and documentation. Kernel strategy and implementation details should still be reviewed against the CUDA code and the raw result logs before submission.