Skip to content
View kuozhang's full-sized avatar

Block or report kuozhang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
260 results for source starred repositories
Clear filter

Deformable DETR: Deformable Transformers for End-to-End Object Detection.

Python 3,454 553 Updated May 16, 2024

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,051 518 Updated Mar 16, 2025

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 663 119 Updated Feb 21, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,631 276 Updated Mar 10, 2025

CUDA Library Samples

Cuda 1,838 369 Updated Mar 21, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 5,417 520 Updated Mar 21, 2025

Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.

32 7 Updated Aug 28, 2023

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 312 45 Updated Mar 21, 2025

Microsoft Collective Communication Library

60 6 Updated Nov 23, 2024

CUTLASS and CuTe Examples

Cuda 42 4 Updated Jan 4, 2025

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

Python 948 139 Updated Mar 21, 2025
C++ 450 62 Updated Mar 20, 2025

A library for calculating the FLOPs in the forward() process based on torch.fx

Python 99 4 Updated Sep 5, 2024

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python 1,622 161 Updated Mar 20, 2025

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Python 40,229 6,616 Updated Dec 9, 2024

Scalable data pre processing and curation toolkit for LLMs

Jupyter Notebook 844 115 Updated Mar 21, 2025

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Python 13,419 2,740 Updated Mar 21, 2025

Making large AI models cheaper, faster and more accessible

Python 40,642 4,487 Updated Mar 21, 2025

Explorations into some recent techniques surrounding speculative decoding

Python 249 20 Updated Dec 22, 2024

qwen models finetuning

Python 93 9 Updated Mar 9, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 12,252 1,318 Updated Mar 21, 2025

Development repository for the Triton-Linalg conversion

C++ 180 18 Updated Feb 7, 2025

Shared Middle-Layer for Triton Compilation

MLIR 232 56 Updated Mar 11, 2025

MINT-1T: A one trillion token multimodal interleaved dataset.

804 20 Updated Jul 31, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 2,442 256 Updated Mar 19, 2025

BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).

Python 466 76 Updated Nov 20, 2023

Sampling profiler for Python programs

Rust 13,408 450 Updated Feb 6, 2025

A nanoGPT pipeline packed in a spreadsheet

2,108 127 Updated Jun 17, 2024

Minimalistic large language model 3D-parallelism training

Python 1,704 165 Updated Mar 21, 2025

Lightning fast C++/CUDA neural network framework

C++ 3,932 483 Updated Jan 27, 2025
Next