kernel-optimization

Star

Here are 10 public repositories matching this topic...

RightNow-AI / autokernel

Sponsor

Star

Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.

gpu cuda pytorch triton kernel-optimization autoresearch

Updated Mar 19, 2026
Python

AMD-AGI / GEAK

Star

Generating Efficient AI-Centric Kernels

python gpu triton hip brain llm kernel-optimization

Updated Jun 13, 2026
Python

WecoAI / weco-cli

Star

Production-Grade Autoresearch. Ideal for GPU kernels, ML model development, feature engineering, prompt engineering, and other optimizable code.

machine-learning code-generation code-optimization prompt-engineering kernel-optimization

Updated Jun 12, 2026
Python

AMD CDNA/RDNA (MI300 gfx942 / MI350 gfx950 / RDNA4 gfx1201) GPU kernel optimization knowledge base, packaged as a Claude Code skill. 7,400+ merged-PR references + 53 ISA-grounded synthesis pages. Inspired by MIT Han Lab's KernelWiki.

hip gemm rocm gpu-kernels cdna amd-gpu mi300 flash-attention kernel-optimization mi350

Updated Jun 11, 2026
Python

AICL-Lab / diy-flash-attention

Star

Learn Triton by building FlashAttention from scratch — V2 kernels, persistent threads, mask DSL, profiling toolkit, bilingual docs

tutorial cuda pytorch triton educational attention-mechanism gpu-programming forward-pass flash-attention kernel-optimization online-softmax

Updated May 25, 2026
Python

IntelLabs / Triton8

Star

Automatic Triton kernel generation and optimization for Intel GPU, powered by Claude Code.

triton code-generation gpu-computing intel-gpu xpu llm-agents kernel-optimization

Updated May 12, 2026
Python

0sec-labs / noeris

Star

Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).

benchmarking cuda pytorch triton autotuning gemma gpu-kernels github-actions kernel-fusion llm-training llm-inference kernel-optimization

Updated May 27, 2026
Python

mohamedabbouda / Flash-Attention-GPU-Kernel

Star

Triton FlashAttention kernel with PyTorch autograd, correctness tests, and GPU benchmarks.

cuda transformers pytorch triton gpu-programming flashattention kernel-optimization

Updated May 28, 2026
Python

ssmall256 / mps-kernels-skill

Star

Skill pack for custom PyTorch MPS kernels on Apple Silicon (examples, tests, and optimization patterns).

python machine-learning deep-learning metal gpu pytorch mps apple-silicon kernel-optimization metal-shading-language pytorch-mps

Updated Feb 16, 2026
Python

Teascented-swimmingstroke954 / autokernel

Star

Optimize PyTorch GPU kernels by autonomously profiling, extracting, and improving Triton or CUDA C++ code for better performance and efficiency.

rust reinforcement-learning kernel deep-learning gpu cuda configuration pytorch triton halide tensor kconfig tvm kernel-optimization autoresearch

Updated Jun 13, 2026
Python

Improve this page

Add a description, image, and links to the kernel-optimization topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kernel-optimization topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel-optimization

Here are 10 public repositories matching this topic...

RightNow-AI / autokernel

AMD-AGI / GEAK

WecoAI / weco-cli

jhinpan / ROCmKernelWiki

AICL-Lab / diy-flash-attention

IntelLabs / Triton8

0sec-labs / noeris

mohamedabbouda / Flash-Attention-GPU-Kernel

ssmall256 / mps-kernels-skill

Teascented-swimmingstroke954 / autokernel

Improve this page

Add this topic to your repo