Skip to content
View Tom-CaoZH's full-sized avatar
👋
Focusing
👋
Focusing

Block or report Tom-CaoZH

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

Python 622 36 Updated Mar 15, 2025

[ICLR 2025] DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Jupyter Notebook 15 1 Updated Mar 13, 2025

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 412 48 Updated Sep 11, 2024

Kernel Tuner

Python 324 53 Updated Mar 13, 2025

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 745 47 Updated Mar 14, 2025

A lightweight data processing framework built on DuckDB and 3FS.

Python 4,224 354 Updated Mar 5, 2025

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 7,976 724 Updated Mar 13, 2025

Expert Parallelism Load Balancer

Python 1,071 158 Updated Feb 27, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,597 261 Updated Mar 10, 2025

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Cuda 294 11 Updated Mar 14, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 4,969 498 Updated Mar 14, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 7,181 632 Updated Mar 14, 2025

FlashMLA: Efficient MLA decoding kernels

C++ 11,290 791 Updated Mar 1, 2025

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

6,759 201 Updated Mar 4, 2025

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 230 18 Updated Mar 13, 2025

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 12,819 860 Updated Mar 15, 2025

NCCL Tests

Cuda 1,031 266 Updated Mar 15, 2025

Optimized primitives for collective multi-GPU communication

C++ 3,550 878 Updated Mar 15, 2025

PyTorch library for cost-effective, fast and easy serving of MoE models.

Python 145 12 Updated Mar 5, 2025

Janus-Series: Unified Multimodal Understanding and Generation Models

Python 16,725 2,199 Updated Feb 1, 2025

Fully open reproduction of DeepSeek-R1

Python 22,804 2,054 Updated Mar 15, 2025

Automatically split your PyTorch models on multiple GPUs for training & inference

Python 650 41 Updated Jan 2, 2024

🌵 A responsive, clean and simple theme for Hexo.

Stylus 3,305 798 Updated Aug 13, 2024

[CVPR 2025] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Python 471 12 Updated Mar 11, 2025

Sky-T1: Train your own O1 preview model within $450

Python 3,120 315 Updated Mar 12, 2025

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)

Python 1,303 143 Updated Mar 13, 2025
Next