Skip to content

Releases: intel/llm-scaler

llm-scaler-vllm beta release 0.10.0-b2

21 Aug 15:02
Compare
Choose a tag to compare
Pre-release

Highlights

Resources

What’s new

  • vLLM:

    • Bug fixe for sym_int4 online quantization on Multi-modal models

llm-scaler-vllm beta release 0.10.0-b1

05 Sep 01:50
c04b5f5
Compare
Choose a tag to compare
Pre-release

Highlights

Resources

What’s new

  • vLLM:

    • Upgrade vLLM to 0.10.0 version
    • Supports async scheduling with option --async-scheduling
    • Changing the support of embedding/reranker models to V1 engine
    • Supporting pipeline parallsim with mp/ray backend
    • Enable internvl3-8b model
    • Enable MiniCPM-v-4 model
    • Enable InternVL3_5-8B

llm-scaler-vllm beta release 0.9.0-b3

05 Sep 01:42
8249baf
Compare
Choose a tag to compare
Pre-release

Highlights

Resources

What’s new

  • vLLM:

    • Enable whisper model
    • Enable GLM-4.5-Air
    • Optimize vLLM memory usage by updating profile_run logic
    • Enable/Optimize pipeline parallelism with Ray backend
    • Enable GLM-4.1V-9B-Thinking for image input
    • Enable model dots.ocr

llm-scaler-vllm PV release 1.0

09 Aug 03:14
84f3771
Compare
Choose a tag to compare

Highlights

Resources

What’s new

  • vLLM:

    • Performance optimization of TPOP for long input length (>4K): up to 1.8x perf for 40K seq length on 32B KPI model, and 4.2x perf for 40K seq length on 70B KPI model.
    • Performance optimizations with ~10% output throughput improvement for 8B-32B KPI models compared to last drop.
    • New feature: By-layer online quantization to reduce the required GPU memory
    • New feature: PP (pipeline parallelism) support in vLLM (experimental)
    • New feature: torch.compile (experimental)
    • New feature: speculative decoding (experimental)
    • Support for embedding, rerank model
    • Enhanced multi-modal model support
    • Performance improvements
    • Maximum length auto-detecting
    • Data parallelism support
    • Bug fixes
  • OneCCL:

    • OneCCL benchmark tool enablement
  • XPU Manager:

    • GPU Power
    • GPU Firmware update
    • GPU Diagnostic
    • GPU Memory Bandwidth
  • BKC:

    • Implemented an offline installer to ensure a consistent environment and eliminate slow download speeds from the global Ubuntu PPA repository

llm-scaler-vllm beta release 0.2.0-b2

25 Jul 07:05
5a58f3f
Compare
Choose a tag to compare
Pre-release

Highlights

Resources

What’s new

  • llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:
    • int4/fp8 online quantization
    • Support for embedding, rerank model
    • Enhanced multi-modal model support
    • Performance improvements
    • Maximum length auto-detecting
    • Data parallelism support
    • Fixed performance degradation issue
    • Fixed multi-modal OOM issue
    • Fixed MiniCPM wrong output issue

llm-scaler-vllm beta release 0.2.0-b1

11 Jul 02:31
fa68e67
Compare
Choose a tag to compare
Pre-release

Highlights

Resources

What’s new

  • llm-scaler-vllm: Developed a customized downstream version of vLLM with the following key features:

    • Support for encoder models such as BGE-M3
    • Added Embedding and Rerank interfaces for enhanced downstream capabilities
    • Integrated Qwen2.5-VL with FP8/FP16 support for multi-modal generation
    • Automatic detection of maximum supported sequence length when a large max-context-length is specified
    • Added support for Qwen3 series models, including our fix on Qwen3's RMSNorm
    • Broader multi-modal model support
    • Data parallelism with verified near-linear scaling
    • Symmetric int4 online quantization
    • FP8 online quantization on CPU
    • Communication support for both SHM (shared memory) and P2P (peer-to-peer) modes

Verified Features

  • Encoder and multi-modal models verified, including BGE-M3, Qwen2.5-VL, and Qwen3
  • Data parallelism tested with near-linear scaling across multiple GPUs
  • Verified FP8 and sym-int4 online quantization, including FP8 on CPU
  • Validated Qwen3 RMSNorm fix in both encoder and decoder paths
  • SHM and P2P support verified independently; automatic detection of SHM or P2P mode also confirmed

llm-scaler-vllm pre-production release 0.2.0

04 Jul 02:25
Compare
Choose a tag to compare

Highlights

Resources

What’s new

  • oneCCL reduces the buffer size and published official release in github.
  • GQA kernel brings up-to 30% improvement for models.
  • Bugfix for OOM issues exposed by stress test (more tests are ongoing).
  • Support 70B FP8 TP4 in offline mode.
  • DeepSeek-v2-lite accuracy fix.
  • Other bugfixes.

Verified Features

  • Refresh the KPI functionality and performance on 4x and 8x BMG e211 system. All KPI models now meet the goal. Add FP8 performance of DS-Distilled-LLaMA 70B model measured on 4xBMG w/ TP4 under offline mode.
  • FP8 functionality test for 32K-8K(ISL/OSL) on DS-Distilled-Qwen32B model on 4xBMG w/ TP4.
  • Verified model list for FP8 functionality.