Time-Keeper

Agentic Loop Sustainability Micro-Benchmark - A focused benchmark for testing LLM agents' ability to maintain reliable tool-calling loops over extended periods (in a sliding context window).

Overview

Time-Keeper tests a single, critical capability: can an agent reliably loop forever?

The benchmark asks agents to set timers, wait for them to fire, log the time, and repeat. Simple in concept, but this test has revealed surprising differences in model capabilities—parameter count doesn't predict loop reliability.

Key findings so far:

Some smaller models (14B params) sustain loops better than larger ones (20B+)
A failure mode where models "know what to do" but generate text instead of tool calls
Distinguishes between understanding vs. reliable execution over time

Quick Start

Prerequisites

Zig (0.15.2 or later)
LLM Provider: Either Ollama or LM Studio running locally
Any ANSI-compatible terminal
Linux / Mac compatible

Installation

# Using Ollama
ollama pull qwen3:30b
ollama serve

# Build and run
git clone https://github.com/humanjesse/time-keeper-micro-benchmark.git
cd time-keeper-micro-benchmark
zig build run

Use /config inside the TUI to set variables (or ask an agent to help you configure it).

Results

Model	Loops	Duration	Status
qwen/qwen3-vl-8b	10	~11 min	Sustained
mistralai/ministral-3-14b-reasoning	10	~13 min	Sustained
openai/gpt-oss-20b	3	~3 min	Limited
essentialai/rnj-1	1	~3 min	Limited
ibm/granite-4-h-tiny	0	~1 min	Limited

See RESULTS.md for full results and contribution guidelines.

What It Tests

Time-Keeper measures agentic loop sustainability—the ability to:

Execute tool calls reliably (not just generate text about them)
Respond to asynchronous timer notifications
Maintain coherent behavior over hundreds/thousands of cycles
Continue looping indefinitely without degradation

The Core Loop

(enter benchmark mode with /benchmark, followed by a message like "start"; exit with /benchmark again or /quit)

Agent: set_timer(label="cycle_1", duration=30000)
[30 seconds pass]
[TIMER FIRED: cycle_1] Your timer has expired.
Agent: current_time() -> logs the time
Agent: set_timer(label="cycle_2", duration=30000)
[repeat forever]

Potential Failure Modes

Failure Type	Description
Hallucination without action	Model generates text claiming to call tools, but returns `finish_reason: "stop"` instead of `"tool_calls"`
Loop abandonment	Model stops after N iterations despite instructions to continue
Timing drift	Model loses track of intended intervals

Architecture

Time-Keeper provides:

Sliding context window - Configurable token budget with automatic pruning
Real-time timers - set_timer tool with true wall-clock delays
Event-driven notifications - System messages when timers fire
Memory tools - Scratchpad, key-value store, vector database (optional)

Available Tools

Tool	Purpose
`set_timer`	Schedule a notification after N milliseconds
`current_time`	Get the current timestamp
`kv_set/kv_get`	Key-value store for structured data
`scratchpad_*`	Temporary notes and working memory
`vector_*`	Semantic search over stored information
additional tools	hold over from localharness

Configuration

Config: ~/.config/time-keeper/profiles/default.json

A default config is auto-generated on first run. Example:

{
  "provider": "ollama",
  "ollama_host": "http://localhost:11434",
  "model": "qwen3:30b"
}

Tip: Use /config inside the TUI for visual editing, or ask an LLM to help configure it.

CLI options: --model, --ollama-host, --help

Benchmarking Methodology

Pass/Fail Criteria

A model passes if it sustains 10+ cycles without:

Hallucinating tool calls (text output instead of actual tool_calls)
Abandoning the loop
Significant timing drift (>10% variance)

Running a Benchmark

Start the benchmark with your target model
Let it run for the desired duration/cycles
Monitor for failure modes in the output
Record: cycles completed, failure type (if any), average loop time

Contributing Results

We welcome community-contributed benchmark results! See RESULTS.md for submission guidelines.

Why This Matters

Most benchmarks test single-shot reasoning or knowledge retrieval. Time-Keeper tests something different: sustained agentic execution.

If you're building agents that need to:

Run background tasks reliably
Maintain long-running operations
Execute scheduled workflows

...then this benchmark measures a capability you care about.

Platform Support

Linux (tested on x86_64), macOS. Windows not supported.

Documentation

See docs/ for detailed documentation. Please reach out with any questions or concerns! Lastly this is retrofit from localharness and building towards a larger benchmark project

License

MIT License

Time-Keeper is a micro-benchmark for evaluating agentic loop sustainability in LLM agents.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
agents_hardcoded		agents_hardcoded
docs		docs
results		results
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
agent_builder_input.zig		agent_builder_input.zig
agent_builder_renderer.zig		agent_builder_renderer.zig
agent_builder_state.zig		agent_builder_state.zig
agent_executor.zig		agent_executor.zig
agent_loader.zig		agent_loader.zig
agent_writer.zig		agent_writer.zig
agents.zig		agents.zig
app.zig		app.zig
build.zig		build.zig
build.zig.zon		build.zig.zon
config.zig		config.zig
config_editor_input.zig		config_editor_input.zig
config_editor_renderer.zig		config_editor_renderer.zig
config_editor_state.zig		config_editor_state.zig
context.zig		context.zig
embedder_interface.zig		embedder_interface.zig
embeddings.zig		embeddings.zig
help_input.zig		help_input.zig
help_renderer.zig		help_renderer.zig
help_state.zig		help_state.zig
howtorundebugginginthisapp.md		howtorundebugginginthisapp.md
lexer.zig		lexer.zig
llm_helper.zig		llm_helper.zig
llm_provider.zig		llm_provider.zig
lmstudio.zig		lmstudio.zig
lmstudio_manager.zig		lmstudio_manager.zig
main.zig		main.zig
markdown.zig		markdown.zig
message_renderer.zig		message_renderer.zig
ollama.zig		ollama.zig
permission.zig		permission.zig
profile_commands.zig		profile_commands.zig
profile_manager.zig		profile_manager.zig
profile_ui_input.zig		profile_ui_input.zig
profile_ui_renderer.zig		profile_ui_renderer.zig
profile_ui_state.zig		profile_ui_state.zig
render.zig		render.zig
state.zig		state.zig
text_utils.zig		text_utils.zig
token_estimator.zig		token_estimator.zig
tool_executor.zig		tool_executor.zig
tools.zig		tools.zig
types.zig		types.zig
ui.zig		ui.zig

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Time-Keeper

Overview

Quick Start

Prerequisites

Installation

Results

What It Tests

The Core Loop

Potential Failure Modes

Architecture

Available Tools

Configuration

Benchmarking Methodology

Pass/Fail Criteria

Running a Benchmark

Contributing Results

Why This Matters

Platform Support

Documentation

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

humanjesse/time-keeper-micro-benchmark

Folders and files

Latest commit

History

Repository files navigation

Time-Keeper

Overview

Quick Start

Prerequisites

Installation

Results

What It Tests

The Core Loop

Potential Failure Modes

Architecture

Available Tools

Configuration

Benchmarking Methodology

Pass/Fail Criteria

Running a Benchmark

Contributing Results

Why This Matters

Platform Support

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages