Give your agent institutional memory. Drop-in retrieval of validated execution traces for any LLM agent framework.
Your agent makes the same mistakes repeatedly because it has no memory of what worked before. behavioral-memory fixes this — it stores validated execution traces (task → tool chain mappings) and retrieves semantically similar ones at query time, so your agent learns from past successes instead of starting from scratch every time.
Based on: "Behavioral Memory for Tool Orchestration: Semantic Retrieval of Validated Execution Traces in MCP-Based Agent Systems" (IEEE, 2025)
pip install behavioral-memoryThe library is framework-agnostic. You bring your own LLM, your own agent — behavioral-memory handles the memory layer.
from behavioral_memory import PlanEngine, ToolRegistry, InMemoryTraceStore
# 1. Choose your LLM (any LangChain-compatible model)
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro", temperature=0)
embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")
# 2. Create a memory store (no database needed)
store = InMemoryTraceStore(embeddings=embeddings)
# 3. Generate plans with behavioral memory
engine = PlanEngine(llm=llm, store=store)
plan = engine.generate(query="Get revenue data and email a report")That's it. Your agent now has memory.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
llm = ChatOpenAI(model="gpt-4o", temperature=0)
store = InMemoryTraceStore(embeddings=OpenAIEmbeddings())
engine = PlanEngine(llm=llm, store=store)from langchain_ollama import ChatOllama, OllamaEmbeddings
llm = ChatOllama(model="llama3")
store = InMemoryTraceStore(embeddings=OllamaEmbeddings(model="nomic-embed-text"))
engine = PlanEngine(llm=llm, store=store)from behavioral_memory import TraceStore # pip install behavioral-memory[postgres]
store = TraceStore(
embeddings=embeddings,
connection_url="postgresql+psycopg://user:pass@localhost/behavioral_memory",
)Before behavioral memory, your agent sees only the task and tool schemas — it has to figure out orchestration from scratch every time. With behavioral memory, it retrieves validated examples of similar tasks that worked before.
Your Agent's Query: "Build a revenue analysis pipeline"
│
┌────────────┴────────────┐
│ BEHAVIORAL MEMORY │
│ │
│ 1. Retrieve top-k │ ← finds 3 similar validated traces
│ similar traces │ from past successful executions
│ │
│ 2. Merge with tool │ ← current MCP tool schemas
│ schemas │
│ │
│ 3. Generate plan │ ← LLM sees examples + schemas + query
└────────────┬────────────┘
│
▼
Better execution plan
(right tools, right params, right order)
from behavioral_memory import ExecutionTrace, ToolCall
trace = ExecutionTrace(
task_description="Calculate quarterly revenue",
tool_chain=[
ToolCall(step_id="s1", tool_name="query_database",
parameters={"query": "SELECT SUM(quantity * unit_price) FROM order_items"}),
ToolCall(step_id="s2", tool_name="generate_report",
parameters={"source_step": "s1", "format": "markdown_table"}),
],
source="seed",
)
store.add(trace)The PlanEngine needs to know what tools your agent has:
from behavioral_memory import ToolSchema, ToolRegistry
schema = ToolSchema(
name="search_docs",
description="Search internal documentation",
parameters_schema={
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
)
registry = ToolRegistry()
registry.register(schema)
engine = PlanEngine(llm=llm, store=store, registry=registry)Or load schemas dynamically from an MCP server:
from behavioral_memory.tools.mcp_client import fetch_mcp_schemas
schemas = await fetch_mcp_schemas("http://localhost:3000/sse")
registry.register_many(schemas)Don't let bad traces into memory. The gatekeeper runs three checks before accepting a trace:
from behavioral_memory import GatekeeperPipeline
gatekeeper = GatekeeperPipeline(store=store, registry=registry)
result = gatekeeper.submit(trace) # schema check → sandbox → dedup → store
print(result.accepted) # True if all gates passedTraces logged to Langfuse can be reviewed by domain experts. Positively scored traces automatically flow back into memory through the gatekeeper:
from behavioral_memory import FeedbackPoller, AnnotationHandler
poller = FeedbackPoller(settings=settings)
handler = AnnotationHandler(poller=poller, gatekeeper=gatekeeper)
handler.run_loop() # continuously polls → validates → storesIf you don't use LangChain, you can use the lower-level primitives directly:
from behavioral_memory.planner.prompt import SYSTEM_PROMPT, build_prompt
from behavioral_memory.planner.postprocess import postprocess_plan
# Build the prompt yourself
prompt = build_prompt(query="Get revenue data", traces=my_traces, tool_schemas=my_schemas)
# Call your own LLM
raw_output = your_llm.chat(system=SYSTEM_PROMPT, user=prompt)
# Parse the JSON plan
steps = postprocess_plan(raw_output) # returns list[ToolCall]| Store | Persistence | Multi-user | Best for |
|---|---|---|---|
InMemoryTraceStore |
Process memory only | No | Dev, CI, demos |
TraceStore (pgvector) |
PostgreSQL, survives restarts | Shared DB, single collection | Production |
Current limitations:
- All traces share one collection (default:
validated_traces). No per-user or per-session isolation. - Langfuse is optional — the core framework (planning, retrieval, gatekeeper) works without it.
- The reference agent at
agent/is a planning demo with stub tool execution — bring your own tool runtime.
On a 30-task benchmark with 7 MCP tools (Gemini 2.5 Pro, temperature 0):
| Metric | Zero-Shot | Static Few-Shot | With Behavioral Memory |
|---|---|---|---|
| Tool Selection (TSA) | 63.3% | 70.0% | 83.3% |
| Parameter Validity (PV) | 72.2% | 79.6% | 84.0% |
| Plan Correctness (PCR) | 33.3% | 50.0% | 63.3% |
| Sequence Accuracy (ESA) | 63.3% | 70.0% | 83.3% |
McNemar's test: p = 0.004 vs zero-shot. Plan correctness nearly doubled.
Reproduced live run (May 2026)
| Metric | Paper | Live Run (pgvector) |
|---|---|---|
| TSA | 83.3% | 86.7% |
| PV | 84.0% | 82.2% |
| PCR | 63.3% | 80.0% |
| ESA | 83.3% | 86.7% |
| McNemar p | 0.004 | 0.039 |
All results within 95% bootstrap confidence intervals.
Three layers (from the paper):
| Layer | What it does | Key class |
|---|---|---|
| Behavioral | Store and retrieve validated execution traces via cosine similarity | InMemoryTraceStore / TraceStore |
| Tool | Load tool schemas dynamically via MCP | ToolRegistry / MCPClient |
| Executive | Assemble prompt (traces + schemas + query), call LLM, parse plan | PlanEngine |
Gatekeeper Pipeline guards memory quality with three gates:
- Schema validation — tools exist, params valid, deps logical
- Sandboxed execution — runtime check with timeout
- Semantic deduplication — cosine > 0.95 rejected
git clone https://github.com/harsh-kr11/behavioral-memory.git
cd behavioral-memory
pip install -e ".[agent,eval]"
export GOOGLE_API_KEY=your-key
# Run the 30-task benchmark
python examples/run_live_benchmark.py
# Quick test (5 tasks)
python examples/run_live_benchmark.py --limit 5
# Exact paper reproduction (with pgvector)
pip install -e ".[postgres]"
docker compose up -d # or: podman-compose up -d
python examples/run_live_benchmark.py --postgres
# Gatekeeper ablation study (Section IV.D.5)
python examples/gatekeeper_ablation.py --verbose
# Validate pipeline offline (no API keys)
python examples/validate_pipeline.pyA reference LangGraph agent is included at agent/ for demo purposes.
This repo ships a Cursor Agent Skill for guided integration. Open this repo in Cursor and type /behavioral-memory in the Agent chat to invoke the skill — it walks through store setup, seed traces, feedback loops, Langfuse v4 wiring, and pgvector persistence.
# Verify your setup after following the skill
python .cursor/skills/behavioral-memory/scripts/verify_setup.pySee integration-examples.md for LangGraph, FastAPI, and production patterns.
pip install -e ".[dev,eval]"
make test # 104 tests
make lint # ruff check
make typecheck # mypy (strict)
make ci # all checksAll via environment variables or .env:
| Variable | Default | Description |
|---|---|---|
FEW_SHOT_K |
3 |
Traces to retrieve per query |
MAX_PROMPT_TOKENS |
3500 |
Token budget for prompt |
SIMILARITY_DEDUP_THRESHOLD |
0.95 |
Dedup cosine threshold |
SANDBOX_TIMEOUT_SECONDS |
30 |
Gatekeeper sandbox timeout |
VECTOR_STORE_URL |
— | PostgreSQL connection (only for TraceStore) |
LANGFUSE_SECRET_KEY |
— | Langfuse secret (optional) |
LANGFUSE_PUBLIC_KEY |
— | Langfuse public key (optional) |
@inproceedings{khan2025behavioral,
title={Behavioral Memory for Tool Orchestration: Semantic Retrieval of
Validated Execution Traces in MCP-Based Agent Systems},
author={Khan, Mehvash and Kumar, Harsh and Jangir, Rahul},
booktitle={IEEE Conference Proceedings},
year={2025}
}Apache 2.0 — See LICENSE.