Rate limits outside of the CLI #952

palamangelus · 2025-05-21T18:20:41Z

palamangelus
May 21, 2025

Is there a way to use the rate limit not in the CLI that doesn't just involve max_tokens? I'm using an azure endpoint, so I am using LiteLLM settings - would I add it there or would I need to add it to the LiteLLM config using keys, etc?

Answered by palamangelus

May 22, 2025

Sort of - for now just to test, I added it to the RATE_CONFIG in rate_limiter.py, and I reduced my docs to ~100 for testing (which wow, still consumed a large number of embedding tokens). I have a number of other questions, but will post them in different threads.

In case anyone else needs this, this is what I added to RATE_CONFIG:

AZURE_HOST = "[yourprojectendpointbase].openai.azure.com/"
AZURE_BASE_URL = f"https://{AZURE_HOST}"

Then added to

RATE_CONFIG: dict[tuple[str, str | MatchAllInputs], RateLimitItem] = {
    ("get", AZURE_HOST): RateLimitItemPerMinute(900, 1),

That gave me an answer, so on to figuring out how to parse AnswerResponse and reduce my token usage!

Thank you for you…

View full answer

jamesbraza · 2025-05-21T18:54:54Z

jamesbraza
May 21, 2025
Maintainer

Our rate limit calculations: https://github.com/Future-House/ldp/blob/v0.27.0/packages/lmi/src/lmi/llms.py#L398

Don't hinge on max_tokens, can you clarify what you mean? I am not following your question yet

0 replies

palamangelus · 2025-05-21T19:12:05Z

palamangelus
May 21, 2025
Author

Edit:

Thanks for the response, and to clarify:

This is the error I'm getting:

litellm.exceptions.RateLimitError: litellm.RateLimitError: AzureException RateLimitError - Requests to the Embeddings_Create Operation under Azure OpenAI API version 2025-01-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 53 seconds.

My rate limit for the endpoint is 300,000/1500 for the openai endpoint, and 150,000/900 for the embedding endpoint. Right now I'm running the initial load of 10k documents, using the embedding endpoint (perhaps I should use a local one as it's already hit 1.6m tokens).

From the documentation, If you use the CLI or a default LLM - there's a rate_limit setting. I'm using settings emulating some of the examples I've seen:

azure_llm_config = dict(
    model_list=[
        dict(
            model_name="model",
            litellm_params=dict(
                model=AZURE_MODEL,
                api_key=AZURE_API_KEY,
                api_base=AZURE_API_BASE,
                api_version=AZURE_API_VERSION,
            ),
        )
    ]
)

And have tried adding rate_limit as part of the pass through, but it's not accepted, and it's not listed in the Settings cheatsheet: https://github.com/Future-House/paper-qa?tab=readme-ov-file#settings-cheatsheet.

My question is if it's possible to add to the model above (perhaps I should switch out of a dict, but I was having a number of issues so just stuck with it), or if I need to address it with the LiteLLM config or elsewhere.

5 replies

jamesbraza May 21, 2025
Maintainer

The rate_limit key is actually not a PaperQA thing, it gets passed all the way to lmi: https://github.com/Future-House/ldp/blob/v0.27.0/packages/lmi/src/lmi/llms.py#L636-L643

My question is if it's possible to add to the model above

Yes the rate limits work with this config: https://github.com/Future-House/ldp/blob/v0.27.0/packages/lmi/src/lmi/rate_limiter.py#L57-L69

If you want to extend it, in your code (1) import lmi.rate_limiter.RATE_CONFIG then (2) add an entry for your model

And have tried adding rate_limit as part of the pass through

Can you elaborate on what you tried?

palamangelus May 21, 2025
Author

Not sure if you saw my edit above, but I broke it out of a dict and I'm not erroring out on what I'm passing. I am however, still hitting the embed rate limit, even though I've specified the rate limit in the llm embed config & settings object as such:


azure_embedding_config = {
    "model_list": [
        {
            "model_name": "model_embedding",
            "litellm_params": {
                "model": AZURE_EMBEDDING_MODEL,
                "api_base": AZURE_API_BASE,
		"api_key": AZURE_API_KEY,
		"api_version": AZURE_API_VERSION,
            },
	    "rate_limit": "100000 per 1 minute",
        }
    ]
}

azure_settings = Settings(
    llm=model,
    llm_config=azure_llm_config,
    summary_llm=model,
    summary_llm_config=azure_summary_config,
    embedding=model_embedding,
    embedding_config=azure_embedding_config,
    agent={
        "agent_llm": model,
        "agent_llm_config": azure_llm_config,
        "agent_type": "ToolSelector",
    },
    parsing=ParsingSettings(use_doc_details=False, chunk_size=2000),
    paper_directory=source_folder,
    index_directory=index_folder,
)

jamesbraza May 21, 2025
Maintainer

Yeah it's because Azure is not in https://github.com/Future-House/ldp/blob/v0.27.0/packages/lmi/src/lmi/rate_limiter.py#L57-L69

Did you do the part where you manually add Azure models into RATE_CONFIG at the top of your script

palamangelus May 22, 2025
Author

Sort of - for now just to test, I added it to the RATE_CONFIG in rate_limiter.py, and I reduced my docs to ~100 for testing (which wow, still consumed a large number of embedding tokens). I have a number of other questions, but will post them in different threads.

In case anyone else needs this, this is what I added to RATE_CONFIG:

AZURE_HOST = "[yourprojectendpointbase].openai.azure.com/"
AZURE_BASE_URL = f"https://{AZURE_HOST}"

Then added to

RATE_CONFIG: dict[tuple[str, str | MatchAllInputs], RateLimitItem] = {
    ("get", AZURE_HOST): RateLimitItemPerMinute(900, 1),

That gave me an answer, so on to figuring out how to parse AnswerResponse and reduce my token usage!

Thank you for your quick responses @jamesbraza. Much appreciated!

Answer selected by jamesbraza

jamesbraza May 22, 2025
Maintainer

Note lmi has an outstanding issue to make good docs on this: Future-House/ldp#239

We are a small org and our docs are lagging 😈

@dosu-bot · 2025-05-22T20:12:13Z

dosubot[bot]
bot May 22, 2025

You can configure rate limits outside of the CLI in paper-qa by editing the configuration files (such as tier1_limits.json, tier2_limits.json, etc.). These files allow you to set rate limits for different models under the llm_config and summary_llm_config sections, using the rate_limit key. The limits are typically specified as tokens per minute (TPM) or requests per minute (RPM), for example:

"llm_config": {
  "rate_limit": {
    "gpt-4o": "800000 per 1 minute",
    "gpt-4-turbo": "600000 per 1 minute"
  }
}

See example

Currently, the built-in rate limiter in paper-qa only supports tokens per minute (TPM) and not requests per minute (RPM). If you need RPM-based rate limiting, you would need to implement a custom solution similar to the existing rate_limiter decorator, but modified to track requests instead of tokens. This is discussed in this issue.

Regarding LiteLLM and Azure endpoints:

The configuration files support specifying LiteLLM parameters under litellm_params, but there is no specific documentation or example in the repo for Azure endpoint rate limiting via LiteLLM settings.
You can add rate limits in the config files as shown above, but if you need more granular control (such as per-key or per-endpoint limits), you would likely need to extend the configuration and possibly the codebase.

In summary:

You can set rate limits outside the CLI via the config files.
These limits are typically TPM, not RPM.
For Azure endpoints and LiteLLM, you can use the config files, but there is no built-in support for advanced rate limiting per key or endpoint—customization would be required.

If you need RPM-based or per-key rate limiting, you will need to implement or extend the current logic. If you want to see how LiteLLM parameters are set in configs, you can look at this example.

To continue talking to Dosu, mention @dosu-bot.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rate limits outside of the CLI #952

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Rate limits outside of the CLI #952

Uh oh!

palamangelus May 21, 2025

Replies: 3 comments · 5 replies

Uh oh!

jamesbraza May 21, 2025 Maintainer

Uh oh!

Uh oh!

palamangelus May 21, 2025 Author

Uh oh!

jamesbraza May 21, 2025 Maintainer

Uh oh!

palamangelus May 21, 2025 Author

Uh oh!

jamesbraza May 21, 2025 Maintainer

Uh oh!

palamangelus May 22, 2025 Author

Uh oh!

jamesbraza May 22, 2025 Maintainer

Uh oh!

dosubot[bot] bot May 22, 2025

palamangelus
May 21, 2025

Replies: 3 comments 5 replies

jamesbraza
May 21, 2025
Maintainer

palamangelus
May 21, 2025
Author

jamesbraza May 21, 2025
Maintainer

palamangelus May 21, 2025
Author

jamesbraza May 21, 2025
Maintainer

palamangelus May 22, 2025
Author

jamesbraza May 22, 2025
Maintainer

dosubot[bot]
bot May 22, 2025