Replies: 1 comment
-
Related to #9601 , could be a hybrid implementation that consider RateLimit headers in some cases |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Checked
Feature request
A very common mode of LLM usage is through APIs rather than isolated/private deployments.
This means that a client is sharing access to an LLM with other unknown users.
Often, the observed rate limit is below/different the rate limit declared by the provider, and API based service is not commited to an SLA.
However, with multiple available API deployments, we should be able to figure out how to assign requests in a way that takes into account a deployment's ability to serve,
rather than relying on blind allocation.
Motivation
Performance
Proposal (If applicable)
Implement a probabilistic Rate limit balancer that on each request selects the deployment which is most probable to deliver a good token rate
How to select the deployment?
for the duration of a sliding window keep stats over each deployments:
`
EPSILON_RATE = 1e-10
class RequestLatencyStats(BaseModel):
latency: float
timestamp: float
output_token_count: int = 1
input_token_count: int = 0
rate: float = 0
class RateStats(BaseModel):
mean_latency: float = 0
rate_stats: List[RequestLatencyStats] = []
class RateBalancedLLMClient(BaseModel):
models_settings: List[LLMSettings]
clients_to_rate_stats: Dict[LLMClient, RateStats] = {}
`
on each request select the best deployment using the stats that are within the sliding window and update stats:
`
from pydantic import field_validator
from pydantic_settings import BaseSettings
class RateBalancedLLMClientSettings(BaseSettings):
RB_LLM_CLIENT_SLIDING_WINDOW_SIZE: int = 60 * 2
RB_LLM_CLIENT_INPUT_TO_OUTPUT_TOKEN_RATIO: int = 250
RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR: float = 0.5
rate_balanced_llm_client_settings = RateBalancedLLMClientSettings()
@staticmethod
def _get_probability_to_meet_rate_sla(effective_rate_limits: list[float], rate_sla: float):
mu = sum(effective_rate_limits) / len(effective_rate_limits) # Mean
sigma = (
sum((x - mu) ** 2 for x in effective_rate_limits) / len(effective_rate_limits)
) ** 0.5 # Standard deviation
# Probability that a sample is greater than the given value
Rational
Having multiple deployments that are available, we say each deployment's "Effective rate limit" is measured by the number of input tokens and output tokens it can process per second (Each model ratio of input/output tokens is different, e.g: gpt-4o 250 input tokens are equivalent to 1 output token ) but we don't know the distribution from which these rate limits or sampled, it's also safe to assume it constantly changes due to other actors demand and other factors, so we would like to somehow estimate it and at each turn (request) select the deployment (client) which has the probability to answer quickly enough while considering pragmatic latencies only.Having multiple deployments that are available, we say each deployment's "Effective rate limit"
is measured by the number of input tokens and output tokens it can process per second
(Each model ratio of input/output tokens is different, e.g: gpt-4o 250 input tokens are equivalent to 1 output token )
but we don't know the distribution from which these rate limits or sampled, it's also safe to assume it constantly changes due to
other actors demand and other factors, so we would like to somehow estimate it and at each turn (request)
select the deployment (client) which has the probability to answer quickly enough while considering pragmatic latencies only.
Disclaimer/Limitations
It seems that the performance of the Rate Balancer heavily depends on the nature of the different deployments.
The greater the difference between the rate distributions of two deployments, the easier it is for the Rate Balancer to identify and select the "better" deployment.
However, it is recommended to use a small
RB_LLM_CLIENT_SLIDING_WINDOW_SIZE
, as the rate distributions of deployments tend to fluctuate over time due to changes in demand from other actors and other dynamic factors.The Rate Balancer is by design can't avoid deployments that are down/idle/latent for more than
RB_LLM_CLIENT_SLIDING_WINDOW_SIZE
Behavior and Configuration
Unordered Round Robin - you can achieve that by either setting RB_LLM_CLIENT_SLIDING_WINDOW_SIZE=0 or RB_LLM_CLIENT_RATE_GREEDINESS_FACTOR=0
If a client hasn't received any request in the last RB_LLM_CLIENT_SLIDING_WINDOW_SIZE and therefor has no stats, then on the next request all client will get an equal chance
until all has stats
Note
The performance/importance of the rate balancer algorithm is highlighted when deployments are unstable
and differ and their ability to serve, having more deployments means we have a higher chance of one being different,
if all deployments have roughly the same rate limit, then we could
rotate each request into a new one or balance the request sizes, this is not the common case though.
Other ideas
What do you think?
Beta Was this translation helpful? Give feedback.
All reactions