-
Northwestern Polytechnical University
-
01:39
(UTC -12:00)
Stars
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Model Compression Toolbox for Large Language Models and Diffusion Models
每个人都能看懂的大模型知识分享,LLMs春/秋招大模型面试前必看,让你和面试官侃侃而谈
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
My learning notes/codes for ML SYS.
This is a Chinese translation of the CUDA programming guide
Optimize softmax in triton in many cases
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
TinyChatEngine: On-Device LLM Inference Library
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Video+code lecture on building nanoGPT from scratch
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSy…
TinySTL is a subset of STL(cut some containers and algorithms) and also a superset of STL(add some other containers and algorithms)
「C/C++学习+面试指南」一份涵盖大部分 C++ 程序员所需要掌握的知识。入门、进阶、深入、校招、社招,准备 C++ 学习& 面试,首选 CppGuide!
A curated list for Efficient Large Language Models
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (p…