-
Notifications
You must be signed in to change notification settings - Fork 645
Description
Description
I’m using ART with a local Ollama server as the inference backend (for both the agent model and judge models). I’ve configured my Ollama model with a context window well above 8192 tokens (e.g. ctx: 16384) and adjusted num_predict accordingly.
However, in many runs I still get errors like:
token count exceeded 8192
This happens even though:
-
The Ollama model is configured with
ctx > 8192(for example 16384). -
I’m explicitly passing the correct base URL pointing to my local Ollama server for:
- The agent model (used by
init_chat_model) - The judge model (RULER)
- Any other inference calls
- The agent model (used by
This suggests there is a hardcoded or implicit max token limit of 8192 somewhere in ART/RULER, or in how token counts are computed, independent of the actual model’s context window.
What I expect
-
ART should respect the context window of the underlying model or the configured
ctxwhen running through Ollama. -
If a hard limit exists (e.g. 8192), it should be:
- Documented and configurable; or
- Derived from the model’s metadata, not hard-coded.
What actually happens
-
Even with a model and server configured to support > 8k context, I regularly get
token count exceeded 8192errors. -
This happens when:
- Running rollouts with a LangGraph agent via
init_chat_model - Running RULER scoring with the same Ollama backend
- Running rollouts with a LangGraph agent via
Environment
- Backend: Local Ollama server
- Model: Qwen / other Ollama-hosted model (with
ctx> 8192) - ART: latest version (as of date of issue)
- Using ART’s LangGraph integration (
init_chat_model) and RULER scoring
Questions / Requests
- Is there an internal default limit of 8192 tokens that’s applied regardless of the model’s context?
- Can you expose this limit via configuration, or derive it from the model / backend rather than hardcoding?
- Any guidance on how to set ART/RULER up so that it fully respects Ollama’s larger
ctx?