Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 411 48 Updated Sep 11, 2024

KernelTuner / kernel_tuner

Kernel Tuner

Python 324 53 Updated Mar 13, 2025

bytedance / flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 741 47 Updated Mar 14, 2025

deepseek-ai / smallpond

A lightweight data processing framework built on DuckDB and 3FS.

Python 4,219 353 Updated Mar 5, 2025

deepseek-ai / 3FS

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 7,975 724 Updated Mar 13, 2025

deepseek-ai / EPLB

Expert Parallelism Load Balancer

Python 1,071 158 Updated Feb 27, 2025

deepseek-ai / DualPipe

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,596 261 Updated Mar 10, 2025

thu-ml / SpargeAttn

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Cuda 293 11 Updated Mar 14, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 4,966 498 Updated Mar 14, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 7,179 630 Updated Mar 14, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient MLA decoding kernels

C++ 11,289 791 Updated Mar 1, 2025

MoonshotAI / Moonlight

955 40 Updated Feb 28, 2025

deepseek-ai / open-infra-index

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

6,759 201 Updated Mar 4, 2025

ScalingIntelligence / KernelBench

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 229 18 Updated Mar 13, 2025

kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 12,805 858 Updated Mar 14, 2025

NVIDIA / nccl-tests

NCCL Tests

Cuda 1,030 266 Updated Mar 15, 2025

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

C++ 3,550 878 Updated Mar 15, 2025

EfficientMoE / MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

Python 145 12 Updated Mar 5, 2025

deepseek-ai / Janus

Janus-Series: Unified Multimodal Understanding and Generation Models

Python 16,717 2,199 Updated Feb 1, 2025

huggingface / open-r1

Fully open reproduction of DeepSeek-R1

Python 22,800 2,051 Updated Mar 14, 2025

bytedance / UI-TARS

2,833 171 Updated Feb 17, 2025

BlackSamorez / tensor_parallel

Automatically split your PyTorch models on multiple GPUs for training & inference

Python 650 41 Updated Jan 2, 2024

probberechts / hexo-theme-cactus

🌵 A responsive, clean and simple theme for Hexo.

Stylus 3,305 798 Updated Aug 13, 2024

deepseek-ai / DeepSeek-R1

86,389 11,140 Updated Feb 24, 2025

hustvl / LightningDiT

[CVPR 2025] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Python 471 12 Updated Mar 11, 2025

NovaSky-AI / SkyThought

Sky-T1: Train your own O1 preview model within $450

Python 3,119 315 Updated Mar 12, 2025

opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)

Python 1,303 143 Updated Mar 13, 2025

deepseek-ai / DeepSeek-V3

Python 92,160 14,946 Updated Feb 24, 2025

Zhang Cao Tom-CaoZH

Lists (5)

CXL

data_structures

eBPF

kv-stores

RDMA

Starred repositories

Awesome Lists