cherhh

cher cherhh

rubbish

15 followers · 393 following

Lists (6)

Sort

Stars

14 stars written in Cuda

Clear filter

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 25,644 2,949 Updated Oct 2, 2024

DefTruth / CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 2,339 249 Updated Feb 7, 2025

Tony-Tan / CUDA_Freshman

Cuda 2,310 449 Updated Jan 16, 2024

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 2,088 215 Updated Feb 18, 2025

efeslab / Nanoflow

A throughput-oriented high-performance serving framework for LLMs

Cuda 737 29 Updated Sep 21, 2024

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 699 61 Updated Dec 30, 2024

NVIDIA / multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 610 119 Updated Oct 30, 2024

baidu-research / baidu-allreduce

Cuda 579 115 Updated Apr 6, 2018

QINZHAOYU / CudaSteps

基于《cuda编程-基础与实践》（樊哲勇著）的cuda学习之路。

Cuda 276 60 Updated Jan 15, 2024

SJTU-IPADS / reef

REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU scheduling.

Cuda 90 9 Updated Dec 24, 2022

luliyucoordinate / cute-flash-attention

Implement Flash Attention using Cute.

Cuda 69 3 Updated Dec 17, 2024

microsoft / TileFusion

TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.

Cuda 55 5 Updated Feb 18, 2025

weishengying / cutlass_flash_atten_fp8

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 53 3 Updated Aug 12, 2024

tbozinis / simple-convolution

This is a simple convolution implementation both for CPU_only and GPU_only (using CUDA)

Cuda 2 Updated Jan 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly