-
Couldn't load subscription status.
- Fork 51
Add rate limiter #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rate limiter #121
Conversation
| self, | ||
| task: ToolUsageTask, | ||
| *, | ||
| model: str = "gpt-3.5-turbo-16k", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooc: why do we need a default here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove in a separate PR -- probably shouldn't be here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions/comments mainly around docs and communication. The algorithm looks fine to me. May be good to add tests
| self.available_tokens += elapsed * self.requests_per_second | ||
| self.last = now | ||
|
|
||
| self.available_tokens = min(self.available_tokens, self.requests_per_second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For noobs, could we add a comment saying something like "requests_per_second is also the maximum bucket size?
| fractions of a second. | ||
| """ | ||
| if requests_per_second < 1: | ||
| raise ValueError("Rate must be at least 1 request per second") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the lazy user who doesn't read the docstring, thoughts on adding a more instructional recommendation here? I naively would assume I could let there be 1/4 requests per second, for instance
| self, | ||
| *, | ||
| requests_per_second: float = 1, | ||
| tokens_per_request: int = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given our domain, could we clarify that tokens are NOT LLM tokens? I could imagine someone thinking this being the number of ~words we can send to the LLM per request. Maybe link to the token bucket algorithm for complete novices as well.
This is especially the case given that some providers rate limit based on e.g., prompt tokens or total tokens per minute/hour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, should call requests_per_second something like tokens_per_second to make it more clear?
| self.last: Optional[time.time] = None | ||
| self.check_every_n_seconds = check_every_n_seconds | ||
|
|
||
| def consume(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be private?
| # at a given time. | ||
| self._consume_lock = threading.Lock() | ||
| # tokens per request sets how many tokens | ||
| self.tokens_per_request = tokens_per_request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to check that tokens_per_request <= requests_per_second ?
Given that we are treating requests_per_second as the max bucket size, we'd never consume tokens if that condition doesn't hold, right?
| return input | ||
|
|
||
| return ( | ||
| RunnableLambda(_rate_limited_passthrough).with_config({"name": "Rate Limit"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also just have the function name and docstring be prettier to make the default name nice. The function is rendered in the metadata
This PR adds a simple rate limiter based on a token bucket.
I would love to extend RunnableBinding with this, we just need to make
sure there's no funny async vs. threading business.
This should be sufficient for benchmarking for now.