Agentic Loop Sustainability Micro-Benchmark - A focused benchmark for testing LLM agents' ability to maintain reliable tool-calling loops over extended periods (in a sliding context window).
Time-Keeper tests a single, critical capability: can an agent reliably loop forever?
The benchmark asks agents to set timers, wait for them to fire, log the time, and repeat. Simple in concept, but this test has revealed surprising differences in model capabilities—parameter count doesn't predict loop reliability.
Key findings so far:
- Some smaller models (14B params) sustain loops better than larger ones (20B+)
- A failure mode where models "know what to do" but generate text instead of tool calls
- Distinguishes between understanding vs. reliable execution over time
- Zig (0.15.2 or later)
- LLM Provider: Either Ollama or LM Studio running locally
- Any ANSI-compatible terminal
- Linux / Mac compatible
# Using Ollama
ollama pull qwen3:30b
ollama serve
# Build and run
git clone https://github.com/humanjesse/time-keeper-micro-benchmark.git
cd time-keeper-micro-benchmark
zig build runUse /config inside the TUI to set variables (or ask an agent to help you configure it).
| Model | Loops | Duration | Status |
|---|---|---|---|
| qwen/qwen3-vl-8b | 10 | ~11 min | Sustained |
| mistralai/ministral-3-14b-reasoning | 10 | ~13 min | Sustained |
| openai/gpt-oss-20b | 3 | ~3 min | Limited |
| essentialai/rnj-1 | 1 | ~3 min | Limited |
| ibm/granite-4-h-tiny | 0 | ~1 min | Limited |
See RESULTS.md for full results and contribution guidelines.
Time-Keeper measures agentic loop sustainability—the ability to:
- Execute tool calls reliably (not just generate text about them)
- Respond to asynchronous timer notifications
- Maintain coherent behavior over hundreds/thousands of cycles
- Continue looping indefinitely without degradation
(enter benchmark mode with /benchmark, followed by a message like "start"; exit with /benchmark again or /quit)
Agent: set_timer(label="cycle_1", duration=30000)
[30 seconds pass]
[TIMER FIRED: cycle_1] Your timer has expired.
Agent: current_time() -> logs the time
Agent: set_timer(label="cycle_2", duration=30000)
[repeat forever]
| Failure Type | Description |
|---|---|
| Hallucination without action | Model generates text claiming to call tools, but returns finish_reason: "stop" instead of "tool_calls" |
| Loop abandonment | Model stops after N iterations despite instructions to continue |
| Timing drift | Model loses track of intended intervals |
Time-Keeper provides:
- Sliding context window - Configurable token budget with automatic pruning
- Real-time timers -
set_timertool with true wall-clock delays - Event-driven notifications - System messages when timers fire
- Memory tools - Scratchpad, key-value store, vector database (optional)
| Tool | Purpose |
|---|---|
set_timer |
Schedule a notification after N milliseconds |
current_time |
Get the current timestamp |
kv_set/kv_get |
Key-value store for structured data |
scratchpad_* |
Temporary notes and working memory |
vector_* |
Semantic search over stored information |
| additional tools | hold over from localharness |
Config: ~/.config/time-keeper/profiles/default.json
A default config is auto-generated on first run. Example:
{
"provider": "ollama",
"ollama_host": "http://localhost:11434",
"model": "qwen3:30b"
}Tip: Use /config inside the TUI for visual editing, or ask an LLM to help configure it.
CLI options: --model, --ollama-host, --help
A model passes if it sustains 10+ cycles without:
- Hallucinating tool calls (text output instead of actual tool_calls)
- Abandoning the loop
- Significant timing drift (>10% variance)
- Start the benchmark with your target model
- Let it run for the desired duration/cycles
- Monitor for failure modes in the output
- Record: cycles completed, failure type (if any), average loop time
We welcome community-contributed benchmark results! See RESULTS.md for submission guidelines.
Most benchmarks test single-shot reasoning or knowledge retrieval. Time-Keeper tests something different: sustained agentic execution.
If you're building agents that need to:
- Run background tasks reliably
- Maintain long-running operations
- Execute scheduled workflows
...then this benchmark measures a capability you care about.
Linux (tested on x86_64), macOS. Windows not supported.
See docs/ for detailed documentation. Please reach out with any questions or concerns! Lastly this is retrofit from localharness and building towards a larger benchmark project
MIT License
Time-Keeper is a micro-benchmark for evaluating agentic loop sustainability in LLM agents.