Name		Name	Last commit message	Last commit date
parent directory ..
vtc		vtc
README.md		README.md
least_busy_time.go		least_busy_time.go
least_kv_cache.go		least_kv_cache.go
least_latency.go		least_latency.go
least_request.go		least_request.go
prefix_cache.go		prefix_cache.go
prefix_cache_and_load.go		prefix_cache_and_load.go
prefix_cache_test.go		prefix_cache_test.go
random.go		random.go
router.go		router.go
router_test.go		router_test.go
throughput.go		throughput.go
util.go		util.go
vtc.go		vtc.go

README.md

Routing Algorithms

Prefix Cache Aware

Below is the pseudo-code for prefix-cache aware routing.

func prefix_cache_routing(ready_pods []*v1.Pod) {
    if check_load_imbalance(ready_pods) {
        target_pod = select_pod_with_least_running_requests(ready_pods)
    } else {
        match_pods, prefix_hashes = match_prefix(ready_pods)
        if len(match_pod) > 0 {
            target_pod = select_least_loaded_match_pod(match_pods)
        }
    }

    // if no target pod is selected, fallback to select pod with least request
    if target_pod == nil {
        target_pod = select_pod_with_least_running_requests(ready_pods)
    }
}

func check_load_imbalance(ready_pods) {
    // filter pods with min and max number of running requests
    min_pod = select_pod_min_running_requests()
    max_pod = select_pod_max_running_requests()
    
    // if difference between max & min running requests count 
    // is more than configurable ABS_RUNNING_REQUEST_COUNT (default: 8)
    // then load is imbalanced
    if max_pod - min_pod > ABS_RUNNING_REQUEST_COUNT {
        return true
    }
    return false
}

func match_prefix(input_tokens, ready_pods) {
    // input_tokens are split based off configurable block_sizes and 
    // hash is calculated for each token_block
    hashes = calculate_hashes(input_tokens)

    // checks if token_block exists on ready_pods [prefix_match], 
    // if present calculate pod_name: prefix_match_percent
    match_pods_with_prefix_match_percent = check_hashes_on_ready_pods(hashes, ready_pods)
}

func select_least_loaded_match_pod(match_pods_with_prefix_match_percent, ready_pods) {
    mean = calculate_mean_running_request(ready_pods)   
    std_dev = calculate_std_dev_running_request(ready_pods)

    // sort match_pods in decreasing perfix_match_percent and 
    // for same prefix_match_percent, sort in increasing running_request count.
    sort(match_pods_with_prefix_match_percent)

    // select match pod with highest prefix and running_request < (mean + std_dev)
    for pod := range match_pods_with_prefix_match_percent {
        if pod.running_request < mean + load_factor*std_dev {
            return pod
        }
    }
}

// selects pod with minimum running requests, similar to least-request routing algorithm
func select_pod_with_least_running_requests(ready_pods) {
    return select_pod_min_running_requests()
}

Configurations

AIBRIX_PREFIX_CACHE_TOKENIZER_TYPE

AIBrix gateway implements two tokenizers character and tiktoken. Default tokenizer is character.

Tokenizer Type Details

character splits input text into characters

tiktoken open-source openai/tiktoken tokenizer
AIBRIX_PREFIX_CACHE_BLOCK_SIZE

Tokenized input request is split into blocks and hash value of the blocks is cached for future match. Size of the block (i.e. number of tokens per block) defines how effective prefix match will be. Default is character tokenizer and 128 block size (tokens per block).

Tokenizer Type Block Size Recommendation

character 128

tiktoken 16
AIBRIX_PREFIX_CACHE_BLOCK_NUMBER

Maximum number of prefix cache blocks. Default is 200000.
AIBRIX_PREFIX_CACHE_POD_RUNNING_REQUEST_IMBALANCE_ABS_COUNT

Before evaluating prefix cache match, router checks if there is imbalance of running requests across pods. Imbalance is measured using absolute difference between max & min running requests across pods, for example if imbalance_abs_count = 16 and running requests for pods are [p1: 1, p2: 2, p3:20] then current scenario is flagged as imbalanced. If flagged as imbalanced then prefix match is ignored and request is routed to pod with least running requests which in above example will to route to pod p1. Default is 16 and should be adjusted based on GPU hardware & prompt length.
AIBRIX_PREFIX_CACHE_STANDARD_DEVIATION_FACTOR

After evaluating prefix match, pods are selected with matching prefix cache. Selected pods are re-evaluated to prevent a hotspot scenario where bulk of prefix matching requests are routed to same pod. Imbalanced is checked as follows
```
  prefix_match_pod.running_requests <= mean + load_factor * standard_deviation
  
```
load_factor determines number of standard deviations. Default is 2

Tokenizer Type	Details
character	splits input text into characters
tiktoken	open-source openai/tiktoken tokenizer

Tokenizer Type	Block Size Recommendation
character	128
tiktoken	16

Virtual Token Counter (VTC)

The Virtual Token Counter (VTC) is a fair scheduling algorithm for LLM serving based on the paper "Fairness in Serving Large Language Models" (Sheng et al.). VTC aims to provide fairness among clients by tracking the service (weighted token count) each client has received and prioritizing those who have received less service. It integrates with continuous batching and handles challenges unique to LLM serving, like variable token costs and unknown output lengths. The research paper and reference implementation artifact can be found at Fairness in Serving Large Language Models (Sheng et al.) .

vtc-basic

The vtc-basic variant implements a simplified version of VTC. It routes requests by combining two scores: a fairness score based on the user’s token count (normalized against all users) within a time window, and a utilization score based on the pod’s current load. The pod with the lowest combined score is selected. This approach adapts dynamically to system load and balances fairness with efficient resource use.

End-to-end tests for this algorithm can be found in vtc_routing_test.go.

Environment Variables

Variable	Description	Default
`AIBRIX_ROUTING_ALGORITHM`	Set to `vtc-basic` to enable this routing strategy.	`prefix-aware`
`AIBRIX_ROUTER_VTC_TOKEN_TRACKER_TIME_UNIT`	Time unit of the sliding window for tracking user token usage.	`minutes`
`AIBRIX_ROUTER_VTC_TOKEN_TRACKER_WINDOW_SIZE`	Size of the sliding window for tracking user token usage.	`5`
`AIBRIX_ROUTER_VTC_TOKEN_TRACKER_MIN_TOKENS`	Sensible min default value for adaptive token tracking (see vtc_basic)	`1000`
`AIBRIX_ROUTER_VTC_TOKEN_TRACKER_MAX_TOKENS`	Sensible max default value for adaptive token tracking (see vtc_basic)	`8000`
`AIBRIX_ROUTER_VTC_BASIC_INPUT_TOKEN_WEIGHT`	Weight applied to input tokens in fairness calculations.	`1.0`
`AIBRIX_ROUTER_VTC_BASIC_OUTPUT_TOKEN_WEIGHT`	Weight applied to output tokens in fairness calculations.	`2.0`
`AIBRIX_ROUTER_VTC_BASIC_MAX_POD_LOAD`	Normalization factor for pod load in utilization score calculation.	`100.0`
`AIBRIX_ROUTER_VTC_BASIC_FAIRNESS_WEIGHT`	Weight applied to fairness score in combined score calculation.	`1.0`
`AIBRIX_ROUTER_VTC_BASIC_UTILIZATION_WEIGHT`	Weight applied to utilization score in combined score calculation.	`1.0`

Other VTC variants

The VTC paper and reference implementation (slora/server/router) describe other VTC variants like vtc-fair (pure fairness), vtc-max-fair, and vtc-pred (using length prediction). Adapting these variants for use within the Aibrix Kubernetes routing context, which involves selecting specific pods rather than managing a single request queue, is a potential area for future enhancement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

algorithms

algorithms

README.md

Routing Algorithms

Prefix Cache Aware

Configurations

Virtual Token Counter (VTC)

vtc-basic

Environment Variables

Other VTC variants

Files

algorithms

Directory actions

More options

Directory actions

More options

Latest commit

History

algorithms

Folders and files

parent directory

README.md

Routing Algorithms

Prefix Cache Aware

Configurations

Virtual Token Counter (VTC)

vtc-basic

Environment Variables

Other VTC variants