Rate limits outside of the CLI #952
-
Is there a way to use the rate limit not in the CLI that doesn't just involve max_tokens? I'm using an azure endpoint, so I am using LiteLLM settings - would I add it there or would I need to add it to the LiteLLM config using keys, etc? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
Our rate limit calculations: https://github.com/Future-House/ldp/blob/v0.27.0/packages/lmi/src/lmi/llms.py#L398 Don't hinge on |
Beta Was this translation helpful? Give feedback.
-
Edit: Thanks for the response, and to clarify: This is the error I'm getting: litellm.exceptions.RateLimitError: litellm.RateLimitError: AzureException RateLimitError - Requests to the Embeddings_Create Operation under Azure OpenAI API version 2025-01-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 53 seconds. My rate limit for the endpoint is 300,000/1500 for the openai endpoint, and 150,000/900 for the embedding endpoint. Right now I'm running the initial load of 10k documents, using the embedding endpoint (perhaps I should use a local one as it's already hit 1.6m tokens). From the documentation, If you use the CLI or a default LLM - there's a rate_limit setting. I'm using settings emulating some of the examples I've seen:
And have tried adding rate_limit as part of the pass through, but it's not accepted, and it's not listed in the Settings cheatsheet: https://github.com/Future-House/paper-qa?tab=readme-ov-file#settings-cheatsheet. My question is if it's possible to add to the model above (perhaps I should switch out of a dict, but I was having a number of issues so just stuck with it), or if I need to address it with the LiteLLM config or elsewhere. |
Beta Was this translation helpful? Give feedback.
-
You can configure rate limits outside of the CLI in paper-qa by editing the configuration files (such as "llm_config": {
"rate_limit": {
"gpt-4o": "800000 per 1 minute",
"gpt-4-turbo": "600000 per 1 minute"
}
} Currently, the built-in rate limiter in paper-qa only supports tokens per minute (TPM) and not requests per minute (RPM). If you need RPM-based rate limiting, you would need to implement a custom solution similar to the existing Regarding LiteLLM and Azure endpoints:
In summary:
If you need RPM-based or per-key rate limiting, you will need to implement or extend the current logic. If you want to see how LiteLLM parameters are set in configs, you can look at this example. To continue talking to Dosu, mention @dosu-bot. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
Beta Was this translation helpful? Give feedback.
Sort of - for now just to test, I added it to the RATE_CONFIG in rate_limiter.py, and I reduced my docs to ~100 for testing (which wow, still consumed a large number of embedding tokens). I have a number of other questions, but will post them in different threads.
In case anyone else needs this, this is what I added to RATE_CONFIG:
Then added to
That gave me an answer, so on to figuring out how to parse AnswerResponse and reduce my token usage!
Thank you for you…