Highlights
- Pro
Lists (12)
Sort Name ascending (A-Z)
Stars
yaof20 / verl
Forked from volcengine/verlverl: Volcano Engine Reinforcement Learning for LLMs
Implementation for FP8/INT8 Rollout for RL training without performence drop.
HoliTom: Holistic Token Merging for Fast Video Large Language Models
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Official repo for "Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge" ICLR2025
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…
Code for paper: Unraveling the Shift of Visual Information Flow in MLLMs: From Phased Interaction to Efficient Inference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
Distributed Compiler based on Triton for Parallel Systems
CUDA Python: Performance meets Productivity
This is a Chinese translation of the CUDA programming guide
Curated collection of papers in MoE model inference
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
[ICLR 2025] Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
Efficient Mixture of Experts for LLM Paper List
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
[EMNLP 2024] CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
Bringing BERT into modernity via both architecture changes and scaling
This is the official Python version of CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation.
CCKS2023-PromptCBLUE: Code implement of TianChi completition