TensaraCudaProblems

Local CUDA workbench for developing, testing, and benchmarking Tensara-style GPU kernel solutions before submitting them to the platform.

The repo is intentionally organized around a simple loop:

implement one or more CUDA kernels for a problem
expose a Tensara-compatible extern "C" solution(...) entry point
verify correctness against small expected cases and generated CPU references
benchmark representative input sizes and launch configurations
summarize the useful findings in a per-problem results file

Scope

This is not a general CUDA library. Each problem file is a self-contained local harness for one Tensara problem. The harness code exists to make iteration fast: it can launch different kernel variants behind the same exported solution routine, run CPU-backed verification, and collect local timing data.

Current problem files:

P1_1D_CONVOLUTIONS.cu: 1D same-padding convolution / cross-correlation.
P3_RELU.cu: elementwise ReLU over a row-major matrix.
P4_MVM.cu: matrix-vector multiplication over a row-major matrix.

Detailed correctness and benchmark notes live next to each problem:

Harness Pattern

Each problem follows the same broad structure:

CPU reference implementation for correctness checks.
One or more CUDA kernel implementations.
A Tensara-facing extern "C" launcher that receives device pointers.
A local host-side runner that handles allocation, copies, timing, and checks.
A default correctness-oriented run.
A heavier --skip-cpu benchmark run for larger sizes and launch sweeps.

The exported solution(...) function should stay close to what Tensara expects: it should launch device work using the provided device pointers, not own the full host allocation or verification flow. Local-only testing belongs in the harness around it.

Result Files

Raw run logs are kept as .txt files:

*_with_cpu.txt: CPU-backed correctness-oriented runs.
*_skip_cpu.txt: larger benchmark-oriented runs where expensive CPU checks are skipped.

The result tables use these status labels:

cpu=PASS: CPU output matched a hard-coded expected answer.
cpu=REF: CPU output was generated and used as the GPU verification reference.
cpu=SKIP: CPU reference generation was skipped.
gpu=PASS: GPU output matched the expected output or CPU reference.
gpu=SKIP: GPU verification was skipped.

The markdown result files summarize the raw logs instead of duplicating every row. They are the place to record which variants are correct, which launch shapes are promising, and which benchmark rows look noisy or suspicious.

Current Snapshot

The latest saved logs cover CPU-backed and skip-CPU runs for all current CUDA problem files:

P1_1D_CONVOLUTIONS.cu
- bstride_c is the strongest current heavy-run kernel.
- It wins 59 of 66 comparable skip-CPU configurations.
- Best large-filter row: web_2 uses bstride_c = 1.135 ms.
P3_RELU.cu
- float4 is correct on odd shapes and scalar tail cases.
- It wins 78 of 83 comparable skip-CPU configurations.
- Best Tensara-size row: 8192 x 8192 uses float4 = 2.983 ms.
P4_MVM.cu
- warp is the strongest current matrix-vector kernel.
- It wins 65 of 80 comparable skip-CPU configurations.
- Best Tensara-size row: 4096 x 4096 uses warp = 0.365 ms.

Local Benchmarking Notes

Local timings are useful for iteration, but they are not a substitute for Tensara leaderboard measurements. Treat them as directional data:

compare kernel variants under the same harness and input set
check odd sizes and tail cases, especially for vectorized kernels
rerun suspicious rows before drawing conclusions
prefer correctness evidence from CPU-backed runs before trusting benchmark-only results

The current local benchmark environment used for the saved result files is an NVIDIA GeForce RTX 3050 Laptop GPU.

Development Notes

The repository has been developed with Codex assistance for harness structure, test generation, benchmark organization, and documentation. Kernel strategy and implementation details should still be reviewed against the CUDA code and the raw result logs before submission.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
P1_1D_CONVOLUTIONS.cu		P1_1D_CONVOLUTIONS.cu
P1_1D_CONVOLUTIONS_RESULTS.md		P1_1D_CONVOLUTIONS_RESULTS.md
P3_RELU.cu		P3_RELU.cu
P3_RESULT_RESULTS.md		P3_RESULT_RESULTS.md
P4_MVM.cu		P4_MVM.cu
P4_MVM_OPTIMIZATION_NOTES.md		P4_MVM_OPTIMIZATION_NOTES.md
P4_MVM_RESULTS.md		P4_MVM_RESULTS.md
README.md		README.md
a.out		a.out
p1		p1
p1_skip_cpu.txt		p1_skip_cpu.txt
p1_with_cpu.txt		p1_with_cpu.txt
p3		p3
p3_skip_cpu.txt		p3_skip_cpu.txt
p3_with_cpu.txt		p3_with_cpu.txt
p4		p4
p4_skip_cpu.txt		p4_skip_cpu.txt
p4_with_cpu.txt		p4_with_cpu.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensaraCudaProblems

Scope

Harness Pattern

Result Files

Current Snapshot

Local Benchmarking Notes

Development Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TensaraCudaProblems

Scope

Harness Pattern

Result Files

Current Snapshot

Local Benchmarking Notes

Development Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages