Skip to content

Releases: vllm-project/aibrix

v0.4.1

19 Aug 05:31
f0c65c2
Compare
Choose a tag to compare

Automatically generated release for tag v0.4.1.

What's Changed

Full Changelog: v0.4.0...v0.4.1

v0.4.0

05 Aug 19:51
24eaefc
Compare
Choose a tag to compare

🚀 New Features Highlights

  • Prefill/Decode (P/D) Disaggregation Support: Introduces StormService and RoleSet CRDs to enable fine-grained orchestration of P/D roles, along with routing to unlock disaggregated inference at scale. (#1209, #1226, #1229, #1256, #1258, #1259, #1268, #1280, #1309, #1311, #1354, #1355, #1377, #1399, #1402)
  • KVCache V1 Connector Optimizations: Delivers a major refactor with v1 Connector integration, CUDA kernel separation from vllm downstream, compact memory layout, connector integration for PrisDB and InfiniStore(/w TCP), tunable block sizes, RDMA auto-detection support and few performance optimizations to boost throughput and deployment density. ( #1174, #1194, #1247, #1274, #1276, #1278, #1286, #1287, #1288, #1295, #1303, #1312, #1318)
  • KV Event Synchronization: Introduces remote tokenizer support to ensure tokenization consistency between client and server and implements a comprehensive KV cache event synchronization system that shares KV cache state between vLLM instances and aibrix gateway for improved prefix caching efficiency (#1307, #1328, #1349, #1362)
  • Multi-Engine Deployment Support: Adds unified regression test suites and Helm values to support heterogeneous backends including vLLM, SGLang, and Dynamo, enabling flexible model deployment across engines. (#1293, #1319, #1322, #1341, #1346)

📊 Feature Enhancements

🌐 Gateway Enhancements

  • SLO-aware router with profile support (#1192, #1305, #1368)
  • Adds custom inference port and metrics port support (#1140, #1313).
  • Make httproute timeout configurable and checks missing httproute before request start(#1212, #1344).
  • Adds metrics server support and adds ready-to-use sample dashboard (#1211).

☁️ Control Plane Improvements

  • Enhance the CRD existence check and improve webhook support (#1170, #1187).
  • Ensure cache sync before starting controller reconcile and resync object on component restarts (#1146, #1219).
  • Use worker pool management for periodic metrics update (#1096)

📦 Installation & Tooling & CI

  • Adds Helm Chart support with helm standard labels and probes (#1323, #1331, #1343).
  • Supports multi-arch (AMD, ARM) Docker builds and refactors release pipelines (#1315, #1317, #1324, #1325).
  • Improves kind development workflow and supports port-forward via Makefile, support override IMAGE_TAG and disable docker push workflow in forked repo(#1210, #1274, #1301).

🐞 Bug Fixes

  • Fixes incorrect request count, out-of-index errors, and race conditions in AIBrix router(#1246, #1262, #1305).
  • Fix Prefix cache chained hashing issue and optimize to O(N) via block-hash. (#1218, #1262)
  • Fixes completion body parsing and complex content bugs (#1145, #1160).
  • Fixes legacy autoscaling annotation misconfigurations (#1173).
  • Fixes image replacement issues in Kustomize (#1165).
  • Fixes e2e test flakiness with wait.PollUntilContextTimeout (#1214).
  • Add read lock for h.histogram (#1147)

📚 Documentation Updates

New Contributors

What's Changed

Full Changelog: v0.3.0...v0.4.0

Read more

v0.3.0

21 May 21:42
ecc3529
Compare
Choose a tag to compare

Automatically generated release for tag v0.3.0.

🚀 New Features Highlights

  • AIBrix KVCache Offloading Framework: Introduces a pluggable multi-tier KVCache architecture with support for DRAM and remote backends, enabling efficient offloading of KV states to reduce GPU memory pressure and increase deployment density. (#1057, #1061, #1062, #1063, #1064, #1068, #1069, #1080, #1107)
  • New KVCache orchestration API: Refactors the orchestration layer to support distributed hashing based caching solutions. (#971, #984, #985, #1037, #1055, #1071, #1114)
  • Prefix Cache and Load aware Routing: Uses hash token-based prefix matching and load awareness to reduce latency by increasing prefix cache hit rate and routing efficiency (#838, #774, #933, #1067)
  • Preble Routing (ICLR’25): An implementation of Preble, it balances KV cache reuse and GPU load by comparing prefix lengths and computing prompt-aware cost scores for optimal routing. (#678, #719, #730, #1024)
  • Fairness-oriented Routing (OSDI’24 VTC): Introduces the vtc-basic router with Windowed Adaptive Fairness Routing, which dynamically tracks token usage and ensures fair load distribution across pods. (#964, #1011, #1065)

📊 Feature Enhancements

Gateway Enhancements

  • Support for OpenAI-compatible APIs, including streaming responses, usage reporting, asynchronous handling, and standardized error responses for seamless end-to-end integration. (#703, #788, #799)
  • Introduced the /v1/models endpoint for compatibility with OpenAI-style API clients. (#802)
  • Refactored gateway-plugins with an extensible ext-proc server architecture, laying the foundation for pluggable policies. (#810)
  • Improved concurrency safety and routing stability through major cache and router redesigns (#878, #884)

Control Plane:

  • Added Kubernetes webhook validation for CRDs, providing early error feedback during resource creation (#748, #786).
  • Improve RayClusterFleet to fully support Deepseek-r1/v3 models (#789, #826, #835, #914, #954).
  • Add scale subresource in RayClusterFleet CRD and enable HPA support (#1082, #1109)

Installation Experiences:

  • Introduced Terraform modules for GCP and Kubernetes deployment (#823).
  • Added setup guides for Minikube on Lambda Cloud and AWS in the documentation (#1020).
  • Enabled standalone controller installation for simplified system bootstrapping.(#930, #931)
  • Streamlined upgrade workflows by introducing kubectl apply support. CRDs are now split and applied with --server-side, avoiding annotation size limits and enabling smooth incremental updates. (#793)
  • Enabled container image publishing to Github Container Registry (GHCR) (#1041).
  • Support ARM container Images (#1090)

Observability & Stability:

  • Shipped prebuilt Grafana dashboards covering control plane, gateway, and KV cache components for out-of-the-box observability. (#1048)
  • Tuned Envoy proxy memory and buffer configurations for better performance under high concurrency. (#825)
  • Tuned Envoy proxy configurations for memory and buffer management under high concurrency (#967).
  • Added graceful shutdown, liveness, and readiness probes to improve service resilience (#962).
  • Delivered production-ready monitoring setups for all major system components (#1048).

New Contributors

What's Changed

Full Changelog: v0.2.0...v0.3.0

Read more

v0.3.0-rc.2

21 May 04:13
c3bb240
Compare
Choose a tag to compare
v0.3.0-rc.2 Pre-release
Pre-release

Automatically generated release for tag v0.3.0-rc.2.

What's Changed

New Contributors

Full Changelog: v0.3.0-rc.1...v0.3.0-rc.2

v0.3.0-rc.1

13 May 07:12
575aa5d
Compare
Choose a tag to compare
v0.3.0-rc.1 Pre-release
Pre-release

What's Changed

Read more

v0.2.1

09 Mar 13:25
858ec82
Compare
Choose a tag to compare

Automatically generated release for tag v0.2.1.

What's Changed

Full Changelog: v0.2.0...v0.2.1

v0.2.0

19 Feb 18:31
0a21d77
Compare
Choose a tag to compare

Automatically generated release for tag v0.2.0.

🚀 New Features Highlights

  • Distributed KV Cache: Implemented support for managing KV cache across multiple nodes, enhancing performance.
  • Cost-Driven Heterogenous Serving: Improved scheduling and inference strategies for mixed GPU environments, optimizing cost and resource utilization. (#371 #430, #509, #598, #554, #598)
  • Optimizer Based Autoscaling: Leverage offline profiles of inference server to calculate the number of replicas. (#430, #500, #692, #508)
  • Prefix Cache Aware Routing: Added support for routing decisions based on prefix cache hits, improving inference efficiency. (#641, #657)

📊 Feature Enhancements

  • LoRA Scheduling Enhancements: Introduced multiple scheduling strategies, including bin packing, least latency, least throughput, and random. (#544)
  • Prefix Cache Aware Routing: Added support for routing decisions based on prefix cache hits, improving inference efficiency. (#641)
  • Gateway Enhancements: Improved request handling efficiency by enabling streaming in the Envoy gateway. (#377) Enhanced the handling of model registration and invalid cache scenarios. (#542), Introduced fallback strategies to ensure robust request allocation. (#445) Optimized cache store retrieval, reducing unnecessary overhead. (#639) Addressed missing Prometheus config preventing gateway startup. (#441)
  • PodAutoscaler Scaling improvements: Improved scaling logic to handle edge cases more efficiently. (#508, #515)

🛠Infrastructure & CI/CD Upgrades

  • Parallelized Build Tasks: CI efficiency improvements by running builds in parallel. (#398)
  • CrashLoopBackOff Detection in CI: Added monitoring for pod failures in testing workflows. (#444)
  • Improved GitHub Actions Cost Efficiency: Optimized triggers and removed unnecessary nightly builds. (#411, #422)
  • Integration Tests for Core Components: Added integration tests for autoscalers, routing policies, and deployment configurations. (#616, #620)

What's Changed

  • Add envoy gateway streaming support by @varungup90 in #377
  • Add client traffic policy to increase per connection buffer size from 32kb to 256kb by @varungup90 in #395
  • Misc: add support to metricsSources property of podautoscaler by @zhangjyr in #371
  • [Misc] Update runtime server startup command in v0.1.0 by @brosoul in #396
  • [CI] improve the ci efficiency by parallelizing the build tasks by @nwangfw in #398
  • Fix the ticker interval by removing unnecessary ms by @Jeffwan in #415
  • [Misc] Disable specific endpoints logs by @Jeffwan in #418
  • [CI] Github Action trigger condition optimized for cost saving by @nwangfw in #411
  • [Misc] Fix the mocked app role permission issue by @Jeffwan in #416
  • [CI] Nightly tag removed for release branch by @nwangfw in #422
  • Enable setting PodAutoscaler configuration via YAML labels by @kr11 in #409
  • Update manifest to adopt v0.1.1 images by @Jeffwan in #429
  • [Bug]: duplicated http in rest metrics fetcher (#408) by @zhangjyr in #421
  • [MISC]: Improve Request Trace Granularity with Version Control by @zhangjyr in #431
  • Support histogram metrics from engine in cache by @Jeffwan in #424
  • Support fetching metrics from remote Prometheus server by @Jeffwan in #433
  • [CI] Add python wheel to release artifact by @Jeffwan in #434
  • Fix update cache pod issue and refactor updatePod handler by @Jeffwan in #439
  • Extract common metrics structure to types and utils by @Jeffwan in #438
  • Fix gateway startup issue due to missing prometheus config by @Jeffwan in #441
  • [feat]: GPU Optimizer and Simulator development app by @zhangjyr in #430
  • Add selectrandom fallback in routing and only scraping healthy pods by @Jeffwan in #445
  • AIBrix Workload Generator / Scenario Simulator by @happyandslow in #428
  • CrashLoopBackOff status detection in CI by @nwangfw in #444
  • Support installing individual controllers from giant controller-manager by @nwangfw in #442
  • Refactor Scaler: Resolve Issues with Metric Parameter Updates in Multiple KPAs by @kr11 in #437
  • Support metrics multi labels for different models by @brosoul in #450
  • Add health check api interface for runtime by @Jeffwan in #451
  • Fix the service name override issue in rolebindings by @Jeffwan in #453
  • Reorganize docs/development and docs/tutorial structure by @Jeffwan in #455
  • Move tools to separate folders and update mocked app README.md by @Jeffwan in #457
  • Fix multi models metric result in PromQL by @brosoul in #458
  • Support Azure LLM trace in workload generator by @happyandslow in #462
  • Fix autoscaler scalingstrategy switching logic by @nwangfw in #475
  • Fix missing handle of PromQL scope is PodMetricScope by @brosoul in #479
  • [Misc] Consolidate app and simulator by @zhangjyr in #477
  • [Bug] Avoid including sensitive info in Dockerfile ENV by @zhangjyr in #487
  • Refactor generator to generate time-based traces by @happyandslow in #478
  • [CI] Update deploy workload script in installation test by @nwangfw in #499
  • [Bug] handle metricKey creation with MetricsSources by @nwangfw in #498
  • Adding Client for Workload Generator Workload File by @happyandslow in #501
  • [Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity by @zhangjyr in #500
  • Fix some simulator format issue and add some TODOs by @Jeffwan in #505
  • [Bug] Fix the way how podautoscaler handle 0 pods. by @zhangjyr in #508
  • [Misc] Improve gpu optimizer debugging on podautoscaler. by @zhangjyr in #509
  • Optimize kustomize overlay for volcano engine deployment by @Jeffwan in #512
  • [perf] Refact tos downloader in Runtime by @brosoul in #510
  • Refactor metric source for customized protocol, port and path by @kr11 in #511
  • [Bug] Fixed the yaml of deployments in heterogenous GPU settings to make KPA scaling work as expected. by @zhangjyr in #513
  • [Misc] Heterogeneous GPU Optimizer Logging Clean Up by @nwangfw in #514
  • Fix KPA bug, and an elaborate KPA test case by @kr11 in #515
  • Cut v0.2.0-rc.1 release by @Jeffwan in #516
  • [Bug] Accumulated bug fix on controller manager, mock app configuration, and gpu optimizer. by @zhangjyr in #522
  • [Misc] Reduced runtime's container image size by @nwangfw in #518
  • clean memory scaler object when pa crd is deleted by @kr11 in #520
  • Configure autoscaler http client to skip certificate check by @Jeffwan in #530
  • [Doc] Update aibrix documentation by @Jeffwan in #533
  • Refactor the gateway-plugin and metadata service manifests by @Jeffwan in #531
  • Fix the GITHUB_WORKSPACE artifact sharing issue in release workflow by @Jeffwan in #532
  • [Misc] Polish the benchmark scripts by @Jeffwan in #525
  • Fix APA bugs in creation, add test and demo yaml by @kr11 in #536
  • Add VKE IPv4 Testing Cluster Config by @nwangfw in #537
  • Support for request length internal trace by @happyandslow in #538
  • [Feat] Add download status into runtime downloader by @brosoul in #539
  • [Feat] Add runtime model management api by @brosoul in #540
  • [gateway] handle the wrong model name and cache inconsistency case by @Jeffwan in #542
  • [Docs] fix: update the parameters instruction in readme by @scarlet25151 in #548
  • add lora schedulers - bin pack, least latency, least throughput, random by @Aspirin96 in #544
  • add request routers - least kv cache, least expected latency by @Aspirin96 in #543
  • [Docs] heterogenous gpu docs added by ...
Read more

v0.2.0-rc.2

23 Jan 22:23
6ee2f11
Compare
Choose a tag to compare
v0.2.0-rc.2 Pre-release
Pre-release

Automatically generated release for tag v0.2.0-rc.2.

What's Changed

  • [Bug] Accumulated bug fix on controller manager, mock app configuration, and gpu optimizer. by @zhangjyr in #522
  • [Misc] Reduced runtime's container image size by @nwangfw in #518
  • clean memory scaler object when pa crd is deleted by @kr11 in #520
  • Configure autoscaler http client to skip certificate check by @Jeffwan in #530
  • [Doc] Update aibrix documentation by @Jeffwan in #533
  • Refactor the gateway-plugin and metadata service manifests by @Jeffwan in #531
  • Fix the GITHUB_WORKSPACE artifact sharing issue in release workflow by @Jeffwan in #532
  • [Misc] Polish the benchmark scripts by @Jeffwan in #525
  • Fix APA bugs in creation, add test and demo yaml by @kr11 in #536
  • Add VKE IPv4 Testing Cluster Config by @nwangfw in #537
  • Support for request length internal trace by @happyandslow in #538
  • [Feat] Add download status into runtime downloader by @brosoul in #539
  • [Feat] Add runtime model management api by @brosoul in #540
  • [gateway] handle the wrong model name and cache inconsistency case by @Jeffwan in #542
  • [Docs] fix: update the parameters instruction in readme by @scarlet25151 in #548
  • add lora schedulers - bin pack, least latency, least throughput, random by @Aspirin96 in #544
  • add request routers - least kv cache, least expected latency by @Aspirin96 in #543
  • [Docs] heterogenous gpu docs added by @nwangfw in #545
  • Fix race condition in cache by @varungup90 in #550
  • Fix pod internal cache delete handling by @varungup90 in #552
  • Handle terminating pod for request routing by @varungup90 in #549
  • Support absolute path as lora adapter artifact path by @Jeffwan in #556
  • Deadlock fix for cache by @varungup90 in #557
  • Mock app log fix for missing metrics warning by @varungup90 in #564
  • Add vllm graceful termination configuration by @nwangfw in #568
  • Enhance dynamic lora adapter support for auth enabled scenario by @Jeffwan in #571
  • Update pyproject.toml to support python 3.12 by @Jeffwan in #579
  • [Docs ]Update ai runtime management api and downloader docs by @Jeffwan in #577
  • Check the HPA ownerReference in request enqueue by @Jeffwan in #582
  • Add request length for traces by @happyandslow in #569
  • Support model registration flow using aibrix runtime api by @Jeffwan in #580
  • Gateway plugin report total incoming requests and pending requests by @zhangjyr in #554
  • Support distributed kv cache orchestration by @Jeffwan in #583
  • Grant workflow action permission to write packages by @Jeffwan in #586
  • Update routers to use GetPodModelMetric api and misc cleanup in metri… by @varungup90 in #590
  • Update upload/download artifact github actions version to v4 by @varungup90 in #591
  • Update version in aibrix/python to 0.2.0-rc.2 by @varungup90 in #594

New Contributors

Full Changelog: v0.2.0-rc.1...v0.2.0-rc.2

v0.1.2

09 Jan 06:44
b0766a9
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.1.1...v0.1.2

v0.2.0-rc.1

10 Dec 20:16
0d40fbd
Compare
Choose a tag to compare
v0.2.0-rc.1 Pre-release
Pre-release

What's Changed

  • Add envoy gateway streaming support by @varungup90 in #377
  • Add client traffic policy to increase per connection buffer size from 32kb to 256kb by @varungup90 in #395
  • Misc: add support to metricsSources property of podautoscaler by @zhangjyr in #371
  • [Misc] Update runtime server startup command in v0.1.0 by @brosoul in #396
  • [CI] improve the ci efficiency by parallelizing the build tasks by @nwangfw in #398
  • Fix the ticker interval by removing unnecessary ms by @Jeffwan in #415
  • [Misc] Disable specific endpoints logs by @Jeffwan in #418
  • [CI] Github Action trigger condition optimized for cost saving by @nwangfw in #411
  • [Misc] Fix the mocked app role permission issue by @Jeffwan in #416
  • [CI] Nightly tag removed for release branch by @nwangfw in #422
  • Enable setting PodAutoscaler configuration via YAML labels by @kr11 in #409
  • Update manifest to adopt v0.1.1 images by @Jeffwan in #429
  • [Bug]: duplicated http in rest metrics fetcher (#408) by @zhangjyr in #421
  • [MISC]: Improve Request Trace Granularity with Version Control by @zhangjyr in #431
  • Support histogram metrics from engine in cache by @Jeffwan in #424
  • Support fetching metrics from remote Prometheus server by @Jeffwan in #433
  • [CI] Add python wheel to release artifact by @Jeffwan in #434
  • Fix update cache pod issue and refactor updatePod handler by @Jeffwan in #439
  • Extract common metrics structure to types and utils by @Jeffwan in #438
  • Fix gateway startup issue due to missing prometheus config by @Jeffwan in #441
  • [feat]: GPU Optimizer and Simulator development app by @zhangjyr in #430
  • Add selectrandom fallback in routing and only scraping healthy pods by @Jeffwan in #445
  • AIBrix Workload Generator / Scenario Simulator by @happyandslow in #428
  • CrashLoopBackOff status detection in CI by @nwangfw in #444
  • Support installing individual controllers from giant controller-manager by @nwangfw in #442
  • Refactor Scaler: Resolve Issues with Metric Parameter Updates in Multiple KPAs by @kr11 in #437
  • Support metrics multi labels for different models by @brosoul in #450
  • Add health check api interface for runtime by @Jeffwan in #451
  • Fix the service name override issue in rolebindings by @Jeffwan in #453
  • Reorganize docs/development and docs/tutorial structure by @Jeffwan in #455
  • Move tools to separate folders and update mocked app README.md by @Jeffwan in #457
  • Fix multi models metric result in PromQL by @brosoul in #458
  • Support Azure LLM trace in workload generator by @happyandslow in #462
  • Fix autoscaler scalingstrategy switching logic by @nwangfw in #475
  • Fix missing handle of PromQL scope is PodMetricScope by @brosoul in #479
  • [Misc] Consolidate app and simulator by @zhangjyr in #477
  • [Bug] Avoid including sensitive info in Dockerfile ENV by @zhangjyr in #487
  • Refactor generator to generate time-based traces by @happyandslow in #478
  • [CI] Update deploy workload script in installation test by @nwangfw in #499
  • [Bug] handle metricKey creation with MetricsSources by @nwangfw in #498
  • Adding Client for Workload Generator Workload File by @happyandslow in #501
  • [Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity by @zhangjyr in #500
  • Fix some simulator format issue and add some TODOs by @Jeffwan in #505
  • [Bug] Fix the way how podautoscaler handle 0 pods. by @zhangjyr in #508
  • [Misc] Improve gpu optimizer debugging on podautoscaler. by @zhangjyr in #509
  • Optimize kustomize overlay for volcano engine deployment by @Jeffwan in #512
  • [perf] Refact tos downloader in Runtime by @brosoul in #510
  • Refactor metric source for customized protocol, port and path by @kr11 in #511
  • [Bug] Fixed the yaml of deployments in heterogenous GPU settings to make KPA scaling work as expected. by @zhangjyr in #513
  • [Misc] Heterogeneous GPU Optimizer Logging Clean Up by @nwangfw in #514
  • Fix KPA bug, and an elaborate KPA test case by @kr11 in #515
  • Cut v0.2.0-rc.1 release by @Jeffwan in #516

Full Changelog: v0.1.1...v0.2.0-rc.1