ISCA 2024

Meta Info

Homepage: https://iscaconf.org/isca2024/

Paper list: https://www.iscaconf.org/isca2024/program/

Papers

Large Language Models (LLMs)

Splitwise: Efficient Generative LLM Inference Using Phase Splitting
- Microsoft
- Best Paper Award
MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization
LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

Mixture-of-Experts (MoEs)

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
- MSRA
- Pre-gating function to alleviate the dynamic nature of sparse expert activation. -> Address the large memory footprint.

Recommendation Models

Heterogeneous Acceleration Pipeline for Recommendation System Training [arXiv]
- UBC & GaTech
- Hotline: a runtime framework.
- Utilize CPU main memory for non-popular embeddings and GPUs’ HBM for popular embeddings.
- Fragment a mini-batch into popular and non-popular micro-batches (μ-batches).

Diffusion Models

Cambricon-D: Full-Network Differential Acceleration for Diffusion Models
- ICT, CAS
- The first processor design to address Diffusion Model acceleration.
- Mitigate additional memory accesses, while maintaining the concise computation from differential computing.

Video Analytics

DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics

Accelerators

Intel Accelerator Ecosystem: An SoC-Oriented Perspective
- Intel
- Industry Session