🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
-
Updated
Sep 7, 2024 - Python
🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
Slicing a PyTorch Tensor Into Parallel Shards
LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving
Decentralized LLMs fine-tuning and inference with offloading
Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
JORA: JAX Tensor-Parallel LoRA Library (ACL 2024)
A distributed training framework for large language models powered by Lightning.
Fast and easy distributed model training examples.
Tensor Parallelism with JAX + Shard Map
GPU Memory Calculator for LLM Training - Calculate GPU memory requirements for training Large Language Models with support for multiple training engines including PyTorch DDP, DeepSpeed ZeRO, Megatron-LM, and FSDP.
Multi-GPU tensor/context parallel diffusion on AMD ROCm — with the patch that makes it actually work.
Trains a 7B-parameter GPT model using NVIDIA Megatron-LM with full 3D parallelism across a 64-GPU InfiniBand cluster. Communication is profiled at multiple levels: PyTorch Profiler traces, Nsight Systems captures, a dedicated NCCL C++ benchmark, a Rust GPU memory monitor.
Enable multi-GPU tensor and context parallelism for diffusion models on AMD ROCm with patched torch code that fixes runtime crashes.
Communication-efficient Tensor Parallelism for GPT-2
TensorRT-LLM vs vLLM controlled head-to-head on H100 — 12 studies including a knob-by-knob waterfall reproducing NVIDIA's published 27.7k tok/s (100.3%) and attributing the gap to real serving, plus NVFP4 W4A4 serving on Blackwell sm_120.
Training Qwen3 to solve Wordle using SFT and GRPO
NCCL collective benchmarks on an 8×H100 NVSwitch host — busbw vs link budget, NVLS/Ring/Tree, small-message latency floors (eager vs CUDA Graph vs symmetric memory), and the TP-decode comms ceiling they imply. Includes a quiet-box rerun methodology for attribution.
Add a description, image, and links to the tensor-parallelism topic page so that developers can more easily learn about it.
To associate your repository with the tensor-parallelism topic, visit your repo's landing page and select "manage topics."