kuozhang

Follow

Kuo Zhang kuozhang

Follow

11 followers · 16 following

Achievements

Achievements

Stars

fundamentalvision / Deformable-DETR

Deformable DETR: Deformable Transformers for End-to-End Object Detection.

Python 3,449 551 Updated May 16, 2024

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,014 506 Updated Mar 16, 2025

NVIDIA / multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 657 119 Updated Feb 21, 2025

deepseek-ai / DualPipe

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,607 263 Updated Mar 10, 2025

NVIDIA / CUDALibrarySamples

CUDA Library Samples

Cuda 1,828 368 Updated Mar 18, 2025

volcengine / verl

verl: Volcano Engine Reinforcement Learning for LLMs

Python 5,153 501 Updated Mar 19, 2025

muriloboratto / NCCL

Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.

32 7 Updated Aug 28, 2023

microsoft / mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 311 45 Updated Mar 19, 2025

Azure / msccl

Microsoft Collective Communication Library

60 6 Updated Nov 23, 2024

leimao / CUTLASS-Examples

CUTLASS and CuTe Examples

Cuda 41 4 Updated Jan 4, 2025

alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

Python 939 139 Updated Mar 19, 2025

aliyun / SimAI

C++ 447 60 Updated Feb 28, 2025

zugexiaodui / torch_flops

A library for calculating the FLOPs in the forward() process based on torch.fx

Python 99 3 Updated Sep 5, 2024

xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python 1,585 154 Updated Mar 19, 2025

karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Python 40,168 6,608 Updated Dec 9, 2024

NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs

Jupyter Notebook 833 113 Updated Mar 18, 2025

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Python 13,371 2,737 Updated Mar 19, 2025

NVIDIA / Cosmos-Tokenizer

A suite of image and video neural tokenizers

Jupyter Notebook 1,576 74 Updated Feb 11, 2025

hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible

Python 40,625 4,486 Updated Mar 19, 2025

lucidrains / speculative-decoding

Explorations into some recent techniques surrounding speculative decoding

Python 248 19 Updated Dec 22, 2024

ssbuild / qwen_finetuning

qwen models finetuning

Python 92 9 Updated Mar 9, 2025

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Python 12,123 1,293 Updated Mar 19, 2025

Cambricon / triton-linalg

Development repository for the Triton-Linalg conversion

C++ 179 18 Updated Feb 7, 2025

microsoft / triton-shared

Shared Middle-Layer for Triton Compilation

MLIR 232 55 Updated Mar 11, 2025

mlfoundations / MINT-1T

MINT-1T: A one trillion token multimodal interleaved dataset.

804 20 Updated Jul 31, 2024

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 2,413 253 Updated Mar 18, 2025

DerryHub / BEVFormer_tensorrt

BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).

Python 466 76 Updated Nov 20, 2023

benfred / py-spy

Sampling profiler for Python programs

Rust 13,399 450 Updated Feb 6, 2025

dabochen / spreadsheet-is-all-you-need

A nanoGPT pipeline packed in a spreadsheet

2,106 126 Updated Jun 17, 2024

huggingface / nanotron

Minimalistic large language model 3D-parallelism training

Python 1,694 165 Updated Mar 19, 2025