旧时王谢堂前燕,飞入寻常百姓家。这或许正是大语言模型高效推理的意义所在:通过优化大模型推理,将性能和效率提升至极致,让大模型不再只是少数人(富人、科技巨头公司)的专属。为实现将简单易用、高效、低成本的大模型推理服务带给每个人而不懈奋斗。
-
MIT Song Han | Model Compression and Acceleration Techniques for AI Computing
-
Washington University | CSE 5610: Large Language Models (2025 Fall)
-
University of Pennsylvania | CIS 7000: Large Language Models
-
California Institute of Technology | Large Language Models for Reasoning
-
SGLang - A Fast Serving Framework for Large Language Models and Vision Language Models
-
Nano-vLLM - A Lightweight vLLM Implementation Built from Scratch
-
llm-d: A Kubernetes-Native High-Performance Distributed LLM Inference Framework
-
Xinference - Deploy AI Models Fast And Seamless Enterprise Ready
-
Tencent KsanaLLM - A High Performance and Easy-To-Use Engine for LLM Inference and Serving
-
LMDeploy - A Toolkit for Compressing, Deploying, and Serving LLMs
-
NVIDIA TensorRT-LLM - A Tensorrt Toolbox for Optimized Large Language Model Inference
-
DeepSeek - The Path to Open-Sourcing the DeepSeek Inference Engine
-
RTP-LLM: Alibaba's High-Performance LLM Inference Engine for Diverse Applications
-
NVIDIA Dynamo - A Datacenter Scale Distributed Inference Serving Framework
-
MLC LLM - Universal LLM Deployment Engine with ML Compilation
-
一文读懂 llama.cpp/vLLM/SGLang/FastTransformer/TensorRT/TGI/MindIE 大模型推理引擎(附图)
-
一文读懂大模型推理服务平台(Xinference/Ollama/GPUStack/KServe/Triton/LMDeploy)