Releases · intel/llm-scaler · GitHub

21 Aug 15:02

liu-shaojun

llm-scaler-vllm beta release 0.10.0-b2 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.10.0-b2

What’s new

vLLM:
- Bug fixe for sym_int4 online quantization on Multi-modal models

Assets 2

05 Sep 01:50

liu-shaojun

llm-scaler-vllm beta release 0.10.0-b1 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.10.0-b1

What’s new

vLLM:
- Upgrade vLLM to 0.10.0 version
- Supports async scheduling with option --async-scheduling
- Changing the support of embedding/reranker models to V1 engine
- Supporting pipeline parallsim with mp/ray backend
- Enable internvl3-8b model
- Enable MiniCPM-v-4 model
- Enable InternVL3_5-8B

Assets 2

05 Sep 01:42

liu-shaojun

llm-scaler-vllm beta release 0.9.0-b3 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.9.0-b3

What’s new

vLLM:
- Enable whisper model
- Enable GLM-4.5-Air
- Optimize vLLM memory usage by updating profile_run logic
- Enable/Optimize pipeline parallelism with Ray backend
- Enable GLM-4.1V-9B-Thinking for image input
- Enable model dots.ocr

Assets 2

09 Aug 03:14

liu-shaojun

llm-scaler-vllm PV release 1.0 Latest

Latest

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:1.0
Docker Image: intel/llm-scaler-platform:1.0

What’s new

vLLM:
- Performance optimization of TPOP for long input length (>4K): up to 1.8x perf for 40K seq length on 32B KPI model, and 4.2x perf for 40K seq length on 70B KPI model.
- Performance optimizations with ~10% output throughput improvement for 8B-32B KPI models compared to last drop.
- New feature: By-layer online quantization to reduce the required GPU memory
- New feature: PP (pipeline parallelism) support in vLLM (experimental)
- New feature: torch.compile (experimental)
- New feature: speculative decoding (experimental)
- Support for embedding, rerank model
- Enhanced multi-modal model support
- Performance improvements
- Maximum length auto-detecting
- Data parallelism support
- Bug fixes
OneCCL:
- OneCCL benchmark tool enablement
XPU Manager:
- GPU Power
- GPU Firmware update
- GPU Diagnostic
- GPU Memory Bandwidth
BKC:
- Implemented an offline installer to ensure a consistent environment and eliminate slow download speeds from the global Ubuntu PPA repository

Assets 2

25 Jul 07:05

liu-shaojun

llm-scaler-vllm beta release 0.2.0-b2 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.2.0-b2

What’s new

llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:
- int4/fp8 online quantization
- Support for embedding, rerank model
- Enhanced multi-modal model support
- Performance improvements
- Maximum length auto-detecting
- Data parallelism support
- Fixed performance degradation issue
- Fixed multi-modal OOM issue
- Fixed MiniCPM wrong output issue

Assets 2

11 Jul 02:31

liu-shaojun

llm-scaler-vllm beta release 0.2.0-b1 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.2.0-b1

What’s new

llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:
- Support for encoder models such as BGE-M3
- Added Embedding and Rerank interfaces for enhanced downstream capabilities
- Integrated Qwen2.5-VL with FP8/FP16 support for multi-modal generation
- Automatic detection of maximum supported sequence length when a large max-context-length is specified
- Added support for Qwen3 series models, including our fix on Qwen3's RMSNorm
- Broader multi-modal model support
- Data parallelism with verified near-linear scaling
- Symmetric int4 online quantization
- FP8 online quantization on CPU
- Communication support for both SHM (shared memory) and P2P (peer-to-peer) modes

Verified Features

Encoder and multi-modal models verified, including BGE-M3, Qwen2.5-VL, and Qwen3
Data parallelism tested with near-linear scaling across multiple GPUs
Verified FP8 and sym-int4 online quantization, including FP8 on CPU
Validated Qwen3 RMSNorm fix in both encoder and decoder paths
SHM and P2P support verified independently; automatic detection of SHM or P2P mode also confirmed

Assets 2

04 Jul 02:25

glorysdj

vllm-0.2.0-pre-release

llm-scaler-vllm pre-production release 0.2.0 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.2.0-pre-release

What’s new

oneCCL reduces the buffer size and published official release in github.
GQA kernel brings up-to 30% improvement for models.
Bugfix for OOM issues exposed by stress test (more tests are ongoing).
Support 70B FP8 TP4 in offline mode.
DeepSeek-v2-lite accuracy fix.
Other bugfixes.

Verified Features

Refresh the KPI functionality and performance on 4x and 8x BMG e211 system. All KPI models now meet the goal. Add FP8 performance of DS-Distilled-LLaMA 70B model measured on 4xBMG w/ TP4 under offline mode.
FP8 functionality test for 32K-8K(ISL/OSL) on DS-Distilled-Qwen32B model on 4xBMG w/ TP4.
Verified model list for FP8 functionality.

Assets 2