[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,123 58 Updated Aug 11, 2025

deep-spin / MF2

Python 3 Updated Jul 25, 2025

ustc-hyin / HiMAP

Code for paper: Unraveling the Shift of Visual Information Flow in MLLMs: From Phased Interaction to Efficient Inference

Python 11 Updated Jun 7, 2025

snowflakedb / ArcticInference

ArcticInference: vLLM plugin for high-throughput, low-latency inference

Python 233 33 Updated Sep 4, 2025

ByteDance-Seed / Triton-distributed

Distributed Compiler based on Triton for Parallel Systems

Python 1,091 93 Updated Sep 5, 2025

AlibabaPAI / FLASHNN

Python 98 8 Updated Sep 9, 2024

NVIDIA / cuda-python

CUDA Python: Performance meets Productivity

Python 2,956 203 Updated Sep 7, 2025

HeKun-NVIDIA / CUDA-Programming-Guide-in-Chinese

This is a Chinese translation of the CUDA programming guide

1,669 247 Updated Nov 13, 2024

MoE-Inf / awesome-moe-inference

Curated collection of papers in MoE model inference

253 10 Updated Aug 1, 2025

NiuTrans / ABigSurveyOfLLMs

A collection of 150+ surveys on LLMs

325 24 Updated Feb 19, 2025

apache / singa

a distributed deep learning platform

C++ 3,532 1,272 Updated Sep 5, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient MLA kernels

C++ 11,719 897 Updated Aug 27, 2025

kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 15,007 1,076 Updated Sep 5, 2025

arcee-ai / DistillKit

An Open Source Toolkit For LLM Distillation

Python 721 93 Updated Jul 8, 2025

junzhang-zj / LoRAM

[ICLR 2025] Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models

Python 73 2 Updated Mar 29, 2025

pprp / Awesome-Efficient-MoE

Efficient Mixture of Experts for LLM Paper List

Python 123 5 Updated Sep 7, 2025

efeslab / fiddler

[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration

Python 230 24 Updated Nov 18, 2024

junhuihe-hjh / CHESS

[EMNLP 2024] CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification

Python 3 Updated Jun 5, 2025

AnswerDotAI / ModernBERT

Bringing BERT into modernity via both architecture changes and scaling

Python 1,506 122 Updated Jun 30, 2025

zju-stu-lizheng / -riscv-

浙江大学计算机组成riscv——实验部分（vivado2020）

VHDL 16 6 Updated Jan 13, 2022

wangqinsi1 / CoreInfer

This is the official Python version of CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation.

Jupyter Notebook 17 2 Updated Oct 25, 2024

tablegpt / tablegpt-agent

A pre-built agent for TableGPT2.

Python 608 55 Updated Aug 28, 2025

we1k / Randeng-MLT-PromptCBLUE

CCKS2023-PromptCBLUE: Code implement of TianChi completition

Python 20 2 Updated Feb 27, 2024

Zheng Li zju-stu-lizheng

Highlights

Lists (12)

BERT

CUDA

entity matching

LLM Accelerate

Mechine Learning

MoE

Others

RL

Schema matching

VLM

科研经验

课程与总结

Stars