Stars
Deformable DETR: Deformable Transformers for End-to-End Object Detection.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
verl: Volcano Engine Reinforcement Learning for LLMs
Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.
MSCCL++: A GPU-driven communication stack for scalable AI applications
The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
A library for calculating the FLOPs in the forward() process based on torch.fx
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Scalable data pre processing and curation toolkit for LLMs
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
A suite of image and video neural tokenizers
Making large AI models cheaper, faster and more accessible
Explorations into some recent techniques surrounding speculative decoding
SGLang is a fast serving framework for large language models and vision language models.
Development repository for the Triton-Linalg conversion
Shared Middle-Layer for Triton Compilation
MINT-1T: A one trillion token multimodal interleaved dataset.
FlashInfer: Kernel Library for LLM Serving
BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).
A nanoGPT pipeline packed in a spreadsheet
Minimalistic large language model 3D-parallelism training