Homepage: https://iscaconf.org/isca2024/
Paper list: https://www.iscaconf.org/isca2024/program/
- Splitwise: Efficient Generative LLM Inference Using Phase Splitting
- Microsoft
- Best Paper Award
- MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition
- Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization
- LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference
- ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
- Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
- MSRA
- Pre-gating function to alleviate the dynamic nature of sparse expert activation. -> Address the large memory footprint.
- Heterogeneous Acceleration Pipeline for Recommendation System Training [arXiv]
- UBC & GaTech
- Hotline: a runtime framework.
- Utilize CPU main memory for non-popular embeddings and GPUs’ HBM for popular embeddings.
- Fragment a mini-batch into popular and non-popular micro-batches (μ-batches).
- Cambricon-D: Full-Network Differential Acceleration for Diffusion Models
- ICT, CAS
- The first processor design to address Diffusion Model acceleration.
- Mitigate additional memory accesses, while maintaining the concise computation from differential computing.
- DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics
- Intel Accelerator Ecosystem: An SoC-Oriented Perspective
- Intel
- Industry Session