Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 412 48 Updated Sep 11, 2024

KernelTuner / kernel_tuner

Kernel Tuner

Python 324 53 Updated Mar 13, 2025

bytedance / flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 745 47 Updated Mar 14, 2025

deepseek-ai / smallpond

A lightweight data processing framework built on DuckDB and 3FS.

Python 4,224 354 Updated Mar 5, 2025

deepseek-ai / 3FS

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 7,976 724 Updated Mar 13, 2025

deepseek-ai / EPLB

Expert Parallelism Load Balancer

Python 1,071 158 Updated Feb 27, 2025

deepseek-ai / DualPipe

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,597 261 Updated Mar 10, 2025

thu-ml / SpargeAttn

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Cuda 294 11 Updated Mar 14, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 4,969 498 Updated Mar 14, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 7,181 632 Updated Mar 14, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient MLA decoding kernels

C++ 11,290 791 Updated Mar 1, 2025

MoonshotAI / Moonlight

955 40 Updated Feb 28, 2025

deepseek-ai / open-infra-index

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

6,759 201 Updated Mar 4, 2025

ScalingIntelligence / KernelBench

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 230 18 Updated Mar 13, 2025

kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 12,819 860 Updated Mar 15, 2025

NVIDIA / nccl-tests

NCCL Tests

Cuda 1,031 266 Updated Mar 15, 2025

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

C++ 3,550 878 Updated Mar 15, 2025

EfficientMoE / MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

Python 145 12 Updated Mar 5, 2025

deepseek-ai / Janus

Janus-Series: Unified Multimodal Understanding and Generation Models

Python 16,725 2,199 Updated Feb 1, 2025

huggingface / open-r1

Fully open reproduction of DeepSeek-R1

Python 22,804 2,054 Updated Mar 15, 2025

bytedance / UI-TARS

2,834 171 Updated Feb 17, 2025

BlackSamorez / tensor_parallel

Automatically split your PyTorch models on multiple GPUs for training & inference

Python 650 41 Updated Jan 2, 2024

probberechts / hexo-theme-cactus

🌵 A responsive, clean and simple theme for Hexo.

Stylus 3,305 798 Updated Aug 13, 2024

deepseek-ai / DeepSeek-R1

86,430 11,144 Updated Feb 24, 2025

hustvl / LightningDiT

[CVPR 2025] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Python 471 12 Updated Mar 11, 2025

NovaSky-AI / SkyThought

Sky-T1: Train your own O1 preview model within $450

Python 3,120 315 Updated Mar 12, 2025

opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)

Python 1,303 143 Updated Mar 13, 2025

deepseek-ai / DeepSeek-V3

Python 92,182 14,953 Updated Feb 24, 2025

Zhang Cao Tom-CaoZH

Lists (5)

CXL

data_structures

eBPF

kv-stores

RDMA

Starred repositories

Awesome Lists