Releases: intel/llm-scaler
Releases · intel/llm-scaler
llm-scaler-vllm beta release 0.10.0-b2
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.10.0-b2
What’s new
-
vLLM:
- Bug fixe for sym_int4 online quantization on Multi-modal models
llm-scaler-vllm beta release 0.10.0-b1
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.10.0-b1
What’s new
-
vLLM:
- Upgrade vLLM to 0.10.0 version
- Supports async scheduling with option --async-scheduling
- Changing the support of embedding/reranker models to V1 engine
- Supporting pipeline parallsim with mp/ray backend
- Enable internvl3-8b model
- Enable MiniCPM-v-4 model
- Enable InternVL3_5-8B
llm-scaler-vllm beta release 0.9.0-b3
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.9.0-b3
What’s new
-
vLLM:
- Enable whisper model
- Enable GLM-4.5-Air
- Optimize vLLM memory usage by updating profile_run logic
- Enable/Optimize pipeline parallelism with Ray backend
- Enable GLM-4.1V-9B-Thinking for image input
- Enable model dots.ocr
llm-scaler-vllm PV release 1.0
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:1.0
- Docker Image: intel/llm-scaler-platform:1.0
What’s new
-
vLLM:
- Performance optimization of TPOP for long input length (>4K): up to 1.8x perf for 40K seq length on 32B KPI model, and 4.2x perf for 40K seq length on 70B KPI model.
- Performance optimizations with ~10% output throughput improvement for 8B-32B KPI models compared to last drop.
- New feature: By-layer online quantization to reduce the required GPU memory
- New feature: PP (pipeline parallelism) support in vLLM (experimental)
- New feature: torch.compile (experimental)
- New feature: speculative decoding (experimental)
- Support for embedding, rerank model
- Enhanced multi-modal model support
- Performance improvements
- Maximum length auto-detecting
- Data parallelism support
- Bug fixes
-
OneCCL:
- OneCCL benchmark tool enablement
-
XPU Manager:
- GPU Power
- GPU Firmware update
- GPU Diagnostic
- GPU Memory Bandwidth
-
BKC:
- Implemented an offline installer to ensure a consistent environment and eliminate slow download speeds from the global Ubuntu PPA repository
llm-scaler-vllm beta release 0.2.0-b2
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.2.0-b2
What’s new
- llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:
- int4/fp8 online quantization
- Support for embedding, rerank model
- Enhanced multi-modal model support
- Performance improvements
- Maximum length auto-detecting
- Data parallelism support
- Fixed performance degradation issue
- Fixed multi-modal OOM issue
- Fixed MiniCPM wrong output issue
llm-scaler-vllm beta release 0.2.0-b1
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.2.0-b1
What’s new
-
llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:
- Support for encoder models such as BGE-M3
- Added Embedding and Rerank interfaces for enhanced downstream capabilities
- Integrated Qwen2.5-VL with FP8/FP16 support for multi-modal generation
- Automatic detection of maximum supported sequence length when a large
max-context-length
is specified - Added support for Qwen3 series models, including our fix on Qwen3's RMSNorm
- Broader multi-modal model support
- Data parallelism with verified near-linear scaling
- Symmetric int4 online quantization
- FP8 online quantization on CPU
- Communication support for both SHM (shared memory) and P2P (peer-to-peer) modes
Verified Features
- Encoder and multi-modal models verified, including BGE-M3, Qwen2.5-VL, and Qwen3
- Data parallelism tested with near-linear scaling across multiple GPUs
- Verified FP8 and sym-int4 online quantization, including FP8 on CPU
- Validated Qwen3 RMSNorm fix in both encoder and decoder paths
- SHM and P2P support verified independently; automatic detection of SHM or P2P mode also confirmed
llm-scaler-vllm pre-production release 0.2.0
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.2.0-pre-release
What’s new
- oneCCL reduces the buffer size and published official release in github.
- GQA kernel brings up-to 30% improvement for models.
- Bugfix for OOM issues exposed by stress test (more tests are ongoing).
- Support 70B FP8 TP4 in offline mode.
- DeepSeek-v2-lite accuracy fix.
- Other bugfixes.
Verified Features
- Refresh the KPI functionality and performance on 4x and 8x BMG e211 system. All KPI models now meet the goal. Add FP8 performance of DS-Distilled-LLaMA 70B model measured on 4xBMG w/ TP4 under offline mode.
- FP8 functionality test for 32K-8K(ISL/OSL) on DS-Distilled-Qwen32B model on 4xBMG w/ TP4.
- Verified model list for FP8 functionality.