Rate limit balancer for multi deployments chains #29913

tsvisab · 2025-02-21T08:53:42Z

tsvisab
Feb 21, 2025

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

A very common mode of LLM usage is through APIs rather than isolated/private deployments.
This means that a client is sharing access to an LLM with other unknown users.
Often, the observed rate limit is below/different the rate limit declared by the provider, and API based service is not commited to an SLA.
However, with multiple available API deployments, we should be able to figure out how to assign requests in a way that takes into account a deployment's ability to serve,
rather than relying on blind allocation.

Motivation

Performance

Proposal (If applicable)

Implement a probabilistic Rate limit balancer that on each request selects the deployment which is most probable to deliver a good token rate

How to select the deployment?

for the duration of a sliding window keep stats over each deployments:
`
EPSILON_RATE = 1e-10

class RequestLatencyStats(BaseModel):
latency: float
timestamp: float
output_token_count: int = 1
input_token_count: int = 0
rate: float = 0

class RateStats(BaseModel):
mean_latency: float = 0
rate_stats: List[RequestLatencyStats] = []

class RateBalancedLLMClient(BaseModel):
models_settings: List[LLMSettings]
clients_to_rate_stats: Dict[LLMClient, RateStats] = {}

def __init__(self, models_settings: List[LLMSettings], **kwargs):
    super().__init__(models_settings=deepcopy(models_settings), **kwargs)
    self.clients_to_rate_stats = {LLMClient(model_settings): RateStats() for model_settings in models_settings}

`

on each request select the best deployment using the stats that are within the sliding window and update stats:

`
from pydantic import field_validator
from pydantic_settings import BaseSettings

class RateBalancedLLMClientSettings(BaseSettings):
RB_LLM_CLIENT_SLIDING_WINDOW_SIZE: int = 60 * 2
RB_LLM_CLIENT_INPUT_TO_OUTPUT_TOKEN_RATIO: int = 250
RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR: float = 0.5

@field_validator("RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR", mode="before")
@classmethod
def validate_rate_greediness_factor(cls, value) -> float:
    value = float(value)
    if not (0 <= value <= 1):
        raise ValueError("RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR must be between 0 and 1.")
    if value < 0.2:
        warnings.warn(
            "RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR is set to a very low value (<0.2), "
            "which can result in the rate balancer behaving as unordered round-robin.",
            stacklevel=2,
        )
    elif value > 0.95:
        warnings.warn(
            "RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR is set to a very high value (>0.95), "
            "which can lead to chosen clients with rates that are local maximum.",
            stacklevel=2,
        )
    return value

rate_balanced_llm_client_settings = RateBalancedLLMClientSettings()

@staticmethod
def _get_probability_to_meet_rate_sla(effective_rate_limits: list[float], rate_sla: float):
mu = sum(effective_rate_limits) / len(effective_rate_limits) # Mean
sigma = (
sum((x - mu) ** 2 for x in effective_rate_limits) / len(effective_rate_limits)
) ** 0.5 # Standard deviation
# Probability that a sample is greater than the given value

    p_greater = 1 - np.nan_to_num(st.norm.cdf(rate_sla, loc=mu, scale=sigma))
    # print(f"p_greater: {p_greater}, rate: {rate_sla}, effective_rate_limits: {effective_rate_limits[-5:]}")
    return p_greater

@staticmethod
def _random_weighted_select_client(client_stats_tuples, weights):
    non_zero_weights = [weight for weight in weights]
    index = random.choices(range(len(non_zero_weights)), non_zero_weights)[0]
    return client_stats_tuples[index]

@staticmethod
def _calc_effective_token_rate(request_latency_stats: RequestLatencyStats):
    request_latency_stats.rate = (
        request_latency_stats.output_token_count
        + float(
            request_latency_stats.input_token_count
            / rate_balanced_llm_client_settings.RB_LLM_CLIENT_INPUT_TO_OUTPUT_TOKEN_RATIO
        )
    ) / request_latency_stats.latency
    return request_latency_stats.rate + EPSILON_RATE

def _select_client_and_stats(self):
    client_stats_tuples = list(self.clients_to_rate_stats.items())
    clients_effective_rate_limits = [
        [self._calc_effective_token_rate(request_rate_stats) for request_rate_stats in stats.rate_stats]
        for stats in self.clients_to_rate_stats.values()
    ]

    if not all(clients_effective_rate_limits) or not all(sum(rl) > 0 for rl in clients_effective_rate_limits):
        # Bootstrap
        weights = [1] * len(clients_effective_rate_limits)
        return self._random_weighted_select_client(client_stats_tuples, weights)

    all_effective_token_rates = [
        rl for client_rate_limits in clients_effective_rate_limits for rl in client_rate_limits
    ]
    sla = np.percentile(
        all_effective_token_rates, rate_balanced_llm_client_settings.RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR * 100
    )

    weights = [
        self._get_probability_to_meet_rate_sla(client_rate_limits, sla)
        for client_rate_limits, sla in zip(
            clients_effective_rate_limits, [sla] * len(clients_effective_rate_limits)
        )
    ]

    return self._random_weighted_select_client(client_stats_tuples, weights)

@staticmethod
def _slice_expired_stats(stats):
    window_size = 0
    now = time.time()
    for timestamped_latency in stats.rate_stats:
        if (
            now - timestamped_latency.timestamp
            >= rate_balanced_llm_client_settings.RB_LLM_CLIENT_SLIDING_WINDOW_SIZE
        ):
            break
        window_size += 1

    stats.rate_stats = stats.rate_stats[len(stats.rate_stats) - window_size :]
    return stats

def append_request_stats(self, stats, start_time, output_token_count, input_token_count):
    now = time.time()
    duration = now - start_time

    stats = self._slice_expired_stats(stats)
    stats.rate_stats.append(
        RequestLatencyStats(
            latency=duration,
            timestamp=now,
            output_token_count=output_token_count,
            input_token_count=input_token_count,
        )
    )
    return stats

def call_llm_impl(self, client, func: Callable, stats: RateStats):
    start_time = time.time()
    start_output_tokens_count = client.output_token_count()
    start_input_tokens_count = client.input_token_count()
    try:
        response = func()
    except Exception as e:
        self.append_request_stats(stats, start_time, 0, 0)
        raise e

    self.append_request_stats(
        stats,
        start_time,
        output_token_count=client.output_token_count() - start_output_tokens_count,
        input_token_count=(client.input_token_count() - start_input_tokens_count),
    )

    stats.mean_latency = (
        sum(timestamped_latency.latency for timestamped_latency in stats.rate_stats) / len(stats.rate_stats)
        if stats.rate_stats
        else 0
    )
    return response

def invoke(self, message: str, **kwargs):
    client, stats = self._select_client_and_stats()
    func = partial(client.invoke, message, **kwargs)
    return self.call_llm_impl(client, func, stats)`

Rational

Having multiple deployments that are available, we say each deployment's "Effective rate limit" is measured by the number of input tokens and output tokens it can process per second (Each model ratio of input/output tokens is different, e.g: gpt-4o 250 input tokens are equivalent to 1 output token ) but we don't know the distribution from which these rate limits or sampled, it's also safe to assume it constantly changes due to other actors demand and other factors, so we would like to somehow estimate it and at each turn (request) select the deployment (client) which has the probability to answer quickly enough while considering pragmatic latencies only.Having multiple deployments that are available, we say each deployment's "Effective rate limit"
is measured by the number of input tokens and output tokens it can process per second
(Each model ratio of input/output tokens is different, e.g: gpt-4o 250 input tokens are equivalent to 1 output token )
but we don't know the distribution from which these rate limits or sampled, it's also safe to assume it constantly changes due to
other actors demand and other factors, so we would like to somehow estimate it and at each turn (request)
select the deployment (client) which has the probability to answer quickly enough while considering pragmatic latencies only.

Disclaimer/Limitations

It seems that the performance of the Rate Balancer heavily depends on the nature of the different deployments.
The greater the difference between the rate distributions of two deployments, the easier it is for the Rate Balancer to identify and select the "better" deployment.
However, it is recommended to use a small RB_LLM_CLIENT_SLIDING_WINDOW_SIZE, as the rate distributions of deployments tend to fluctuate over time due to changes in demand from other actors and other dynamic factors.
The Rate Balancer is by design can't avoid deployments that are down/idle/latent for more than RB_LLM_CLIENT_SLIDING_WINDOW_SIZE

Behavior and Configuration

from pydantic_settings import BaseSettings


class RateBalancedLLMClientSettings(BaseSettings):
    RB_LLM_CLIENT_SLIDING_WINDOW_SIZE: int = 60 * 2
    RB_LLM_CLIENT_INPUT_TO_OUTPUT_TOKEN_RATIO: int = 250
    RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR: float = 0.5

Unordered Round Robin - you can achieve that by either setting RB_LLM_CLIENT_SLIDING_WINDOW_SIZE=0 or RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR=0

RB_LLM_CLIENT_SLIDING_WINDOW_SIZE <= SOME_SMALL_VALUE: Small window size means the rate balancer will often forget clients' stats, setting it to 0 means no stats are considered. (should be proportional to avg time between requests * num deployments , we only need enough stats for each deployment to be able to do some statistics)
RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR <= SOME_SMALL_VALUE: Small value would mean the rate balancer will weight clients based on their estimated probability to achieve a rate limit which is greater than the smallest observed value, which is ~1 for all clients
RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR >= SOME_LARGE_VALUE: Large value would mean the rate balancer will weight clients based on their estimated probability to achieve a rate limit which is greater than the greatest observed value, which is ~0 for all clients except for the only client that actually achieved that, this will result in local maxima for at least RB_LLM_CLIENT_SLIDING_WINDOW_SIZE time period
RB_LLM_CLIENT_INPUT_TO_OUTPUT_TOKEN_RATIO: How many input tokens are considered against a single output token gpt-4o 250 input tokens are equivalent to 1 output token

If a client hasn't received any request in the last RB_LLM_CLIENT_SLIDING_WINDOW_SIZE and therefor has no stats, then on the next request all client will get an equal chance
until all has stats

Note

The performance/importance of the rate balancer algorithm is highlighted when deployments are unstable
and differ and their ability to serve, having more deployments means we have a higher chance of one being different,
if all deployments have roughly the same rate limit, then we could
rotate each request into a new one or balance the request sizes, this is not the common case though.

Other ideas

Implement adaptive sliding window
Allow configuring statistical test for deployment's probability to exceed an SLA rather than using normal dist

What do you think?

tsvisab · 2025-02-21T15:29:07Z

tsvisab
Feb 21, 2025
Author

Related to #9601 , could be a hybrid implementation that consider RateLimit headers in some cases

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate limit balancer for multi deployments chains #29913

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Rate limit balancer for multi deployments chains #29913

tsvisab Feb 21, 2025

Checked

Feature request

Motivation

Proposal (If applicable)

Rational

Disclaimer/Limitations

Behavior and Configuration

Unordered Round Robin - you can achieve that by either setting RB_LLM_CLIENT_SLIDING_WINDOW_SIZE=0 or RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR=0

Note

Other ideas

Replies: 1 comment

tsvisab Feb 21, 2025 Author

tsvisab
Feb 21, 2025

tsvisab
Feb 21, 2025
Author