Skip to content
View cherhh's full-sized avatar

Block or report cherhh

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
14 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 25,644 2,949 Updated Oct 2, 2024

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 2,339 249 Updated Feb 7, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 2,088 215 Updated Feb 18, 2025

A throughput-oriented high-performance serving framework for LLMs

Cuda 737 29 Updated Sep 21, 2024

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 699 61 Updated Dec 30, 2024

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 610 119 Updated Oct 30, 2024

基于《cuda编程-基础与实践》(樊哲勇 著)的cuda学习之路。

Cuda 276 60 Updated Jan 15, 2024

REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU scheduling.

Cuda 90 9 Updated Dec 24, 2022

Implement Flash Attention using Cute.

Cuda 69 3 Updated Dec 17, 2024

TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.

Cuda 55 5 Updated Feb 18, 2025

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 53 3 Updated Aug 12, 2024

This is a simple convolution implementation both for CPU_only and GPU_only (using CUDA)

Cuda 2 Updated Jan 14, 2019