-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Description
UCM aims to accelerate reasoning for long sequences, encompassing table lookup instead of KV computation in the Prefill phase, sparsification in the Decode phase, and a PD (Prefill-Decode) disaggregated architecture centered on KVCache for large-scale scenarios.
The first version of UCM has achieved the basic goal of sparsification acceleration for long sequences and successfully implemented a heterogeneous PD Disaggregation example. In Q4, we will successively release long-sequence inference acceleration features to further enhance inference performance, reduce inference costs, and address issues such as long sequences being "unable to be inferred" or "slow to be inferred".
Core
- CacheBlend
- Prefill KVCache Offload
- Model Window Extrapolation
- Sparse
- Spare Attention Framework Optimization
- GSA Optimization
- KVComp Optimization
- KVStar Optimization
- PD Disaggregation
- Heterogeneous Optimization
- PD Scheduler
- Store
- UCM Store V1 framework
- CacheStore
- PosixStore
- PipelineStore
- Scatter Gather IO
- GPU Direct Storage
- NPU Direct Storage
Others
- Docs Optimization
- Tools
- Observability:Metrics monitoring via Prometheus
- Tools for KVStore: bandwidth measurement tool
- Benchmark & Test
- Mooncake Trace and more dataset for PD test
- benchmark for sparse performance and accuracy
- Support performance benchmarks — LLMPerf
Metadata
Metadata
Assignees
Labels
No labels