🚀 Efficient-LLM-Inference

旧时王谢堂前燕，飞入寻常百姓家。这或许正是大语言模型高效推理的意义所在：通过优化大模型推理，将性能和效率提升至极致，让大模型不再只是少数人（富人、科技巨头公司）的专属。为实现将简单易用、高效、低成本的大模型推理服务带给每个人而不懈奋斗。

❤️ 课程

🎮 视频教程

🛞 开源项目

📝 博客

💻 方法论文

Chen, M., Hui, B., Cui, Z., Yang, J., Liu, D., Sun, J., ... & Liu, Z. (2025). Parallel Scaling Law for Language Models. arXiv preprint arXiv: 2505.10475.
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2025). EAGLE-3: Scaling Up Inference Acceleration of Large Language Models via Training-Time Test. arXiv preprint arXiv: 2503.01840.
Liao, B., Xu, Y., Dong, H., Li, J., Monz, C., Savarese, S., ... & Xiong, C. (2025). Reward-Guided Speculative Decoding for Efficient LLM Reasoning. arXiv preprint arXiv: 2501.19324.
Zhang, T., Sui, Y., Zhong, S., Chaudhary, V., Hu, X., & Shrivastava, A. (2025). 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float. arXiv preprint arXiv: 2504.11651.
Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y., Wang, S., ... & Ceze, L. (2025). FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. arXiv preprint arXiv: 2501.01005.
Xu, J., Pan, J., Zhou, Y., Chen, S., Li, J., Lian, Y., ... & Dai, G. (2025). SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting. arXiv preprint arXiv: 2504.08850.
Prabhu, R., Nayak, A., Mohan, J., Ramjee, R., & Panwar, A. (2025). vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1: 1133-1150.
Huang, Y., Wan, L. J., Ye, H., Jha, M., Wang, J., Li, Y., ... & Chen, D. (2024). Invited: New Solutions on LLM Acceleration, Optimization, and Application. In Proceedings of the 61st ACM/IEEE Design Automation Conference (DAC), 1-4.
Chen, Y., & Huang, G. (2024). GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments. arXiv preprint arXiv: 2412.04788.
Dong, J., Feng, B., Guessous, D., Liang, Y., & He, H. (2024). Flex Attention: A Programming Model for Generating Optimized Attention Kernels. arXiv preprint arXiv: 2412.05496.
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., ... & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP), 611-626.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Efficient-LLM-Inference

❤️ 课程

🎮 视频教程

🛞 开源项目

📝 博客

💻 方法论文

💡 综述论文

About

Uh oh!

Releases

Packages

License

LLMs-Inference/Efficient-LLM-Inference

Folders and files

Latest commit

History

Repository files navigation

🚀 Efficient-LLM-Inference

❤️ 课程

🎮 视频教程

🛞 开源项目

📝 博客

💻 方法论文

💡 综述论文

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages