Skip to content

Micro benchmark for testing LLM's in repetitive tool calling loops locally

License

Notifications You must be signed in to change notification settings

humanjesse/time-keeper-micro-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Time-Keeper

Agentic Loop Sustainability Micro-Benchmark - A focused benchmark for testing LLM agents' ability to maintain reliable tool-calling loops over extended periods (in a sliding context window).

Overview

Time-Keeper tests a single, critical capability: can an agent reliably loop forever?

The benchmark asks agents to set timers, wait for them to fire, log the time, and repeat. Simple in concept, but this test has revealed surprising differences in model capabilities—parameter count doesn't predict loop reliability.

Key findings so far:

  • Some smaller models (14B params) sustain loops better than larger ones (20B+)
  • A failure mode where models "know what to do" but generate text instead of tool calls
  • Distinguishes between understanding vs. reliable execution over time

Quick Start

Prerequisites

  • Zig (0.15.2 or later)
  • LLM Provider: Either Ollama or LM Studio running locally
  • Any ANSI-compatible terminal
  • Linux / Mac compatible

Installation

# Using Ollama
ollama pull qwen3:30b
ollama serve

# Build and run
git clone https://github.com/humanjesse/time-keeper-micro-benchmark.git
cd time-keeper-micro-benchmark
zig build run

Use /config inside the TUI to set variables (or ask an agent to help you configure it).

Results

Model Loops Duration Status
qwen/qwen3-vl-8b 10 ~11 min Sustained
mistralai/ministral-3-14b-reasoning 10 ~13 min Sustained
openai/gpt-oss-20b 3 ~3 min Limited
essentialai/rnj-1 1 ~3 min Limited
ibm/granite-4-h-tiny 0 ~1 min Limited

See RESULTS.md for full results and contribution guidelines.

What It Tests

Time-Keeper measures agentic loop sustainability—the ability to:

  1. Execute tool calls reliably (not just generate text about them)
  2. Respond to asynchronous timer notifications
  3. Maintain coherent behavior over hundreds/thousands of cycles
  4. Continue looping indefinitely without degradation

The Core Loop

(enter benchmark mode with /benchmark, followed by a message like "start"; exit with /benchmark again or /quit)

Agent: set_timer(label="cycle_1", duration=30000)
[30 seconds pass]
[TIMER FIRED: cycle_1] Your timer has expired.
Agent: current_time() -> logs the time
Agent: set_timer(label="cycle_2", duration=30000)
[repeat forever]

Potential Failure Modes

Failure Type Description
Hallucination without action Model generates text claiming to call tools, but returns finish_reason: "stop" instead of "tool_calls"
Loop abandonment Model stops after N iterations despite instructions to continue
Timing drift Model loses track of intended intervals

Architecture

Time-Keeper provides:

  • Sliding context window - Configurable token budget with automatic pruning
  • Real-time timers - set_timer tool with true wall-clock delays
  • Event-driven notifications - System messages when timers fire
  • Memory tools - Scratchpad, key-value store, vector database (optional)

Available Tools

Tool Purpose
set_timer Schedule a notification after N milliseconds
current_time Get the current timestamp
kv_set/kv_get Key-value store for structured data
scratchpad_* Temporary notes and working memory
vector_* Semantic search over stored information
additional tools hold over from localharness

Configuration

Config: ~/.config/time-keeper/profiles/default.json

A default config is auto-generated on first run. Example:

{
  "provider": "ollama",
  "ollama_host": "http://localhost:11434",
  "model": "qwen3:30b"
}

Tip: Use /config inside the TUI for visual editing, or ask an LLM to help configure it.

CLI options: --model, --ollama-host, --help

Benchmarking Methodology

Pass/Fail Criteria

A model passes if it sustains 10+ cycles without:

  • Hallucinating tool calls (text output instead of actual tool_calls)
  • Abandoning the loop
  • Significant timing drift (>10% variance)

Running a Benchmark

  1. Start the benchmark with your target model
  2. Let it run for the desired duration/cycles
  3. Monitor for failure modes in the output
  4. Record: cycles completed, failure type (if any), average loop time

Contributing Results

We welcome community-contributed benchmark results! See RESULTS.md for submission guidelines.

Why This Matters

Most benchmarks test single-shot reasoning or knowledge retrieval. Time-Keeper tests something different: sustained agentic execution.

If you're building agents that need to:

  • Run background tasks reliably
  • Maintain long-running operations
  • Execute scheduled workflows

...then this benchmark measures a capability you care about.

Platform Support

Linux (tested on x86_64), macOS. Windows not supported.

Documentation

See docs/ for detailed documentation. Please reach out with any questions or concerns! Lastly this is retrofit from localharness and building towards a larger benchmark project

License

MIT License


Time-Keeper is a micro-benchmark for evaluating agentic loop sustainability in LLM agents.

About

Micro benchmark for testing LLM's in repetitive tool calling loops locally

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages