Enhanced vLLM with advanced CUDA kernel optimizations for memory management, fragmentation reduction, and query-aware cache selection.
| Optimization | Improvement | Description |
|---|---|---|
| Query-Aware Cache Selection | 30-50% throughput | Intelligent cache management based on query characteristics |
| Fragmentation Reduction | 40-60% reduction | Advanced techniques to minimize memory fragmentation |
| Best-Fit Allocation | 20-30% space saved | Optimized block allocation strategies |
| Memory Utilization | >96% (vs 60-80%) | Efficient memory usage with continuous block allocation |
| FlashAttention Integration | 2-3x speedup | Combined FlashAttention with PagedAttention |
The fragmentation metric combines spatial and temporal fragmentation:
where spatial fragmentation is:
temporal fragmentation considers access patterns:
and access-based fragmentation:
where:
-
$r_i$ : size of$i$ -th free block run,$\bar{r} = \frac{1}{N}\sum_{i=1}^{N} r_i$ -
$d_i$ : distance from optimal location,$\sigma$ : spatial decay parameter -
$A_{t,j}$ : access frequency of block$j$ at time$t$ ,$w_t$ : temporal weights -
$p_k$ : probability of accessing cache level$k$ ,$L_k$ : latency distribution -
$\alpha, \beta, \gamma$ : weighting coefficients
Multi-objective optimization for cache selection:
where utility function:
with feature vector:
cost function:
and reward function:
where:
-
$q$ : query,$c$ : cache candidate,$\mathcal{C}$ : candidate set -
$\mathbf{w}$ : learnable weights,$\lambda_i, \alpha_i$ : trade-off parameters -
$\gamma$ : discount factor,$r_i$ : reward at step$i$
Multi-constraint optimization for block allocation:
subject to constraints:
with gradient penalty:
where:
-
$b$ : block allocation,$\mathcal{B}$ : feasible allocations -
$w_i$ : feature weights,$\lambda$ : regularization coefficient -
$d(b, j)$ : distance between blocks,$\alpha$ : temporal decay
csrc/tinyserve_cache_kernels.cucsrc/tinyserve_kernels.hcsrc/cache.hcsrc/cache_kernels.cucsrc/torch_bindings.cpp
-
Copy TinyServe files to vLLM:
cd <vllm_root> cp csrc/tinyserve_cache_kernels.cu csrc/ cp csrc/tinyserve_kernels.h csrc/
-
Apply modifications to vLLM files:
csrc/cache.h: Add the content frompatches/cache.h.patchto the end of the filecsrc/cache_kernels.cu: Add the content frompatches/cache_kernels.cu.patchto the end of the filecsrc/torch_bindings.cpp: Add the content frompatches/torch_bindings.cpp.patchinside theTORCH_LIBRARYblock (aftercp_gather_indexer_k_quant_cacheregistration)
-
Rebuild vLLM:
cd <vllm_root> pip install -e . --no-build-isolation
Note: The patch files contain only the modifications needed. Simply copy the content from each patch file to the corresponding location in the vLLM source code.
- CUDA 11.8+ and compatible GPU
- Python 3.8+
- PyTorch 2.0+
- vLLM installed with TinyServe modifications
After integration, TinyServe optimizations are automatically enabled. The cache operations will use TinyServe's optimized kernels when available.
To compare performance with baseline vLLM:
# Run baseline vLLM
python -m vllm.entrypoints.api_server \
--model <model_name> \
--tensor-parallel-size 1
# Run TinyServe-enhanced vLLM (same command, optimizations are automatic)
python -m vllm.entrypoints.api_server \
--model <model_name> \
--tensor-parallel-size 1TinyServe provides enhanced cache metrics. Monitor memory utilization and fragmentation through vLLM's existing monitoring tools. The optimizations work transparently with vLLM's cache management system.
For reproducible experiments:
-
Environment Setup:
# Install dependencies pip install -r requirements.txt pip install -e . --no-build-isolation
-
Run Benchmarks:
# Throughput benchmark python benchmarks/benchmark_throughput.py --model <model_name> # Latency benchmark python benchmarks/benchmark_latency.py --model <model_name>
-
Compare Results:
- Memory utilization: Check GPU memory usage with
nvidia-smi - Throughput: Compare requests/second
- Latency: Compare P50/P99 latencies
- Memory utilization: Check GPU memory usage with
| Metric | Baseline | TinyServe-vLLM | Improvement |
|---|---|---|---|
| Memory Utilization | 60-80% | >96% | +20-36% |
| Fragmentation | High | Low | -40-60% |
| Throughput | 1x | 1.3-1.5x | +30-50% |
| Allocation Efficiency | 70-80% | 90-95% | +20-25% |
@inproceedings{liu2025tinyserve,
title={TinyServe: Query-Aware Cache Selection for Efficient LLM Serving},
author={Liu, Dong and Yu, Yanxuan},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={12529--12537},
year={2025}
}Developed by Dong Liu and Yanxuan Yu at FastLM.ai