Skip to content
View kuozhang's full-sized avatar

Block or report kuozhang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Deformable DETR: Deformable Transformers for End-to-End Object Detection.

Python 3,449 551 Updated May 16, 2024

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,014 506 Updated Mar 16, 2025

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 657 119 Updated Feb 21, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,607 263 Updated Mar 10, 2025

CUDA Library Samples

Cuda 1,828 368 Updated Mar 18, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 5,153 501 Updated Mar 19, 2025

Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.

32 7 Updated Aug 28, 2023

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 311 45 Updated Mar 19, 2025

Microsoft Collective Communication Library

60 6 Updated Nov 23, 2024

CUTLASS and CuTe Examples

Cuda 41 4 Updated Jan 4, 2025

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

Python 939 139 Updated Mar 19, 2025
C++ 447 60 Updated Feb 28, 2025

A library for calculating the FLOPs in the forward() process based on torch.fx

Python 99 3 Updated Sep 5, 2024

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python 1,585 154 Updated Mar 19, 2025

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Python 40,168 6,608 Updated Dec 9, 2024

Scalable data pre processing and curation toolkit for LLMs

Jupyter Notebook 833 113 Updated Mar 18, 2025

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Python 13,371 2,737 Updated Mar 19, 2025

A suite of image and video neural tokenizers

Jupyter Notebook 1,576 74 Updated Feb 11, 2025

Making large AI models cheaper, faster and more accessible

Python 40,625 4,486 Updated Mar 19, 2025

Explorations into some recent techniques surrounding speculative decoding

Python 248 19 Updated Dec 22, 2024

qwen models finetuning

Python 92 9 Updated Mar 9, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 12,123 1,293 Updated Mar 19, 2025

Development repository for the Triton-Linalg conversion

C++ 179 18 Updated Feb 7, 2025

Shared Middle-Layer for Triton Compilation

MLIR 232 55 Updated Mar 11, 2025

MINT-1T: A one trillion token multimodal interleaved dataset.

804 20 Updated Jul 31, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 2,413 253 Updated Mar 18, 2025

BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).

Python 466 76 Updated Nov 20, 2023

Sampling profiler for Python programs

Rust 13,399 450 Updated Feb 6, 2025

A nanoGPT pipeline packed in a spreadsheet

2,106 126 Updated Jun 17, 2024

Minimalistic large language model 3D-parallelism training

Python 1,694 165 Updated Mar 19, 2025
Next