Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
130724e
Added an agent demo
moedash Dec 30, 2025
dc4bc72
explanded the demo
moedash Dec 30, 2025
e05dccb
Improved example
moedash Dec 30, 2025
fb74c6d
Added a multi NS example
moedash Dec 30, 2025
5b8aa63
Improved nexus
moedash Dec 31, 2025
c24487f
Improved docs.
moedash Dec 31, 2025
9c018d0
Added a debug-loop examples
moedash Dec 31, 2025
14e355b
Improved the debug loop example
moedash Dec 31, 2025
dd168d1
Added a script to create fresh
moedash Dec 31, 2025
2afb260
Added experimental results
moedash Dec 31, 2025
6a7941f
Added mermaid output
moedash Dec 31, 2025
418426a
Added ai-research-agent plan
moedash Dec 31, 2025
03e03fd
Updated plan and added cursor rules
moedash Dec 31, 2025
819c5be
step 1: ai-research-agent-impl
moedash Dec 31, 2025
5d8da98
prompt 1.3: ai-research-agent-impl
moedash Dec 31, 2025
1475d1c
prompt 2.1: ai-research-agent-impl
moedash Dec 31, 2025
bd7bc2c
prompt 2.2: ai-research-agent-impl
moedash Dec 31, 2025
0c79c91
nitz.
moedash Dec 31, 2025
a6233cd
fixed the plan: ai-research-agent
moedash Dec 31, 2025
ef08ee7
prompt 2.3: ai-research-agent-impl
moedash Dec 31, 2025
c23e64b
fixed format flag
moedash Dec 31, 2025
a6042c1
improved: ai-research-agent-impl
moedash Dec 31, 2025
1f8315a
prompt 3.3: ai-research-agent-impl
moedash Dec 31, 2025
cffa53c
higher chance of success: ai-research-agent-impl
moedash Dec 31, 2025
39c9d88
prompt 3.3: ai-research-agent-impl
moedash Dec 31, 2025
aaf6100
prompt 4.2: ai-research-agent-impl
moedash Dec 31, 2025
8514135
prompt 5.1: ai-research-agent-impl
moedash Dec 31, 2025
94b147d
prompt 5.2: ai-research-agent-impl
moedash Dec 31, 2025
a9f88ff
Added ticket-drop example
moedash Dec 31, 2025
7edbbe2
Made cursorrules more generic
moedash Dec 31, 2025
b5fab4b
prompt 1.1: ticketdrop
moedash Dec 31, 2025
7aa4f90
Fixed follow-children flag
moedash Dec 31, 2025
557398a
prompt 1.3: ticketdrop
moedash Dec 31, 2025
4189024
prompt 2.1: ticketdrop
moedash Jan 1, 2026
a7e8c53
prompt 2.3: ticketdrop
moedash Jan 1, 2026
7ade1f7
Added inventory
moedash Jan 1, 2026
5a9ad48
Consolidated temporal agent with temporal workflow
moedash Jan 1, 2026
436048d
Fixed examples
moedash Jan 1, 2026
8078d31
nitz
moedash Jan 1, 2026
055275f
prompt 3.2: ticketdrop
moedash Jan 3, 2026
e354a7b
prompt 3.3: ticketdrop
moedash Jan 3, 2026
46e8889
prompt 4.1: ticketdrop
moedash Jan 3, 2026
6c72f26
prompt 5.1: ticketdrop
moedash Jan 5, 2026
151b936
prompt 5.2: ticketdrop
moedash Jan 5, 2026
59a5a1e
prompt 7.1: ticketdrop
moedash Jan 5, 2026
762cdd1
Added transcripts
moedash Jan 5, 2026
4a95cc4
Merge branch 'moe/temporal-agent-cli' into moe/temporal-agent-cli-exa…
moedash Jan 9, 2026
5128820
Merge branch 'moe/temporal-agent-cli' into moe/temporal-agent-cli-exa…
moedash Jan 9, 2026
7e31468
Merge branch 'moe/temporal-agent-cli' into moe/temporal-agent-cli-exa…
moedash Jan 9, 2026
70623e0
Merge branch 'moe/temporal-agent-cli' into moe/temporal-agent-cli-exa…
moedash Jan 9, 2026
2e2e089
Added demo workload
moedash Jan 9, 2026
635664c
Updated demo.
moedash Jan 9, 2026
c787a8b
Updated demo
moedash Jan 9, 2026
ea887a0
examples: "failures" and "diagnose" become flags for extra data on li…
moedash Jan 9, 2026
19b8c99
Merge branch 'moe/temporal-agent-cli' into moe/temporal-agent-cli-exa…
moedash Jan 9, 2026
2ab62b7
examples: replace --format with --output
moedash Jan 9, 2026
a7654f9
Merge branch 'moe/temporal-agent-cli' into moe/temporal-agent-cli-exa…
moedash Jan 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions examples/EXPERIMENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# Agent Debugging Experiment

This document describes experiments to validate that the `temporal workflow` CLI commands enable AI agents to debug Temporal workflow failures using structured output instead of logs.

## Hypothesis

AI agents can query failures, trace nested workflow chains across namespaces, and get compact timelines and state — without scraping logs or manually traversing workflows.

## Experiment 1: Basic Agent Commands

### Environment

| Setting | Value |
|---------|-------|
| Temporal Environment | Staging (`us-west-2.aws.api.tmprl-test.cloud:7233`) |
| Namespace | `moedash.temporal-dev` |
| CLI Version | Built from source |
| AI Agent | Claude Code (Cursor) |

### Failure Scenarios

| Scenario | Command | Expected Failure |
|----------|---------|------------------|
| Success | `go run ./starter -scenario success` | No failure (control) |
| Payment Fail | `go run ./starter -scenario payment-fail` | Activity fails with "payment gateway connection timeout" |
| Shipping Fail | `go run ./starter -scenario shipping-fail` | Activity fails with "warehouse inventory depleted" |
| Nested Fail | `go run ./starter -scenario nested-fail` | 3-level deep child workflow chain, leaf fails with "database connection refused" |
| Timeout | `go run ./starter -scenario timeout` | Activity times out (5s activity with 2s timeout) |
| Retry Exhaustion | `go run ./starter -scenario retry-exhaustion` | Activity fails 5 times then exhausts retries |
| Multi-Child | `go run ./starter -scenario multi-child` | 3 parallel children, only "validation" child fails |

### Results: 2025-12-29

| Test | Tool Used | Root Cause Found | Score | Notes |
|------|-----------|------------------|-------|-------|
| Test 1 | `workflow failures` | 6/6 failure types identified | 95/100 | All failures found with clear root causes |
| Test 2 | `workflow diagnose` | "database connection refused" at depth 3 | 100/100 | Perfect chain traversal |
| Test 3 | `workflow show --compact` | ValidationWorkflow failed with invalid SKU | 100/100 | Clear child workflow timeline |
| Test 4 | `workflow diagnose` | "activity StartToClose timeout" | 100/100 | Correctly identified timeout vs app error |
| Test 5 | `workflow failures --error-contains` | Found 2 timeout-related failures | 100/100 | Filter worked correctly |

**Overall Score:** 99/100

---

## Experiment 2: Multi-Namespace Nexus Traversal

### Environment

| Setting | Value |
|---------|-------|
| Temporal Environment | Staging (`us-west-2.aws.api.tmprl-test.cloud:7233`) |
| Namespaces | `moedash-commerce-ns.temporal-dev`, `moedash-finance-ns.temporal-dev`, `moedash-logistics-ns.temporal-dev` |
| Example | `examples/ecommerce-nexus/` |

### Scenarios Tested

| Scenario | Chain | Expected Failure |
|----------|-------|------------------|
| Nexus Payment Fail | commerce → finance (Nexus) | Fraud detection fails |
| Child Shipping Fail | commerce → logistics (child workflow) | Shipping carrier error |
| Deep Chain | commerce → finance → fraud-check | 3-level Nexus + child chain |

### Results: 2025-12-30

| Metric | Target | Result | Status |
|--------|--------|--------|--------|
| Time to first failure found | < 30 seconds | 3.1 seconds | ✅ PASS |
| Root cause accuracy | 100% | 100% (all failures correctly identified) | ✅ PASS |
| Chain depth accuracy | 100% | 100% (depth 2 for Nexus chains) | ✅ PASS |
| Cross-NS traversal success | 100% | 100% (commerce-ns → finance-ns) | ✅ PASS |
| Token efficiency | < 1000 bytes per failure | 685 bytes/failure | ✅ PASS |

### Key Findings

- Cross-namespace Nexus traversal correctly followed fraud workflows from commerce-ns to finance-ns
- `--compact-errors` effectively stripped verbose wrapper messages
- `--leaf-only` reduced results by 69%, eliminating duplicate parent/child entries
- Namespace-specific API keys worked seamlessly via `TEMPORAL_API_KEY_<NAMESPACE>` pattern

---

## Experiment 3: Blind AI Diagnosis (TOCTOU Race Condition)

### Environment

| Setting | Value |
|---------|-------|
| Temporal Environment | Local dev server |
| Namespace | `default` |
| Example | `examples/debug-loop-fresh/` (hint-free version) |
| AI Agent | Claude (separate LLM session) |

### The Challenge

The `debug-loop-fresh` example contains a TOCTOU race condition with all hints removed. The LLM was given only:

> "I've created a sample example under `examples/debug-loop-fresh`, and I want you to find and fix its issue with the use of temporal workflow CLI"

### LLM's Diagnosis Process

1. **Ran the scenario** - Started worker and triggered race condition
2. **Used `temporal workflow describe --trace-root-cause`** - Found `ReserveInventory` failed for KEYBOARD-03
3. **Used `temporal workflow show --compact`** - Analyzed timestamps of both workflows
4. **Built a race timeline** - Correlated events across both orders:

| Time | Main Order | Competing Order |
|------|------------|-----------------|
| 03:37:04.708 | CheckInventory (all 3) ✓ | |
| 03:37:04.711 | | CheckInventory ✓ |
| 03:37:05.723 | | **ReserveInventory ✓** (takes keyboard) |
| 03:37:05.730 | Reserve KEYBOARD **FAILED** | Completed ✓ |

5. **Proposed the fix** - Atomic `CheckAndReserveInventory` activity
6. **Verified the fix** - Both orders now behave deterministically

### Results

| Metric | Result |
|--------|--------|
| Root cause identified | ✅ TOCTOU race condition |
| Timeline analysis used | ✅ Cross-workflow timing correlation |
| Fix proposed | ✅ Atomic check-and-reserve |
| Fix verified | ✅ Deterministic behavior |
| Human intervention needed | ❌ None |

**This validates the core thesis:** An LLM can autonomously diagnose complex timing bugs using only `temporal workflow` CLI output.

---

## Features Implemented

Based on experiment findings, the following improvements were made:

### Phase 1: Core Commands

| Feature | Status | Command |
|---------|--------|---------|
| Find recent failures | ✅ Done | `temporal workflow list --failed` |
| Trace workflow chain | ✅ Done | `temporal workflow describe --trace-root-cause` |
| Workflow timeline | ✅ Done | `temporal workflow show --compact` |

### Phase 2: Filtering & Compaction

| Feature | Status | Flag/Command |
|---------|--------|--------------|
| Error message filter | ✅ Done | `--error-contains` |
| Multiple status values | ✅ Done | `--status Failed,TimedOut` |
| Leaf-only failures | ✅ Done | `--leaf-only` |
| Compact error messages | ✅ Done | `--compact-errors` |
| Follow child workflows | ✅ Done | `--follow-children` |

### Phase 3: State & Aggregation

| Feature | Status | Flag/Command |
|---------|--------|--------------|
| Workflow state | ✅ Done | `temporal workflow describe --pending` |
| Pending activities | ✅ Done | Included in state output |
| Pending Nexus operations | ✅ Done | Included in state output |
| Group failures by type | ✅ Done | `--group-by type\|namespace\|status\|error` |

### Phase 4: Cross-Namespace

| Feature | Status | Notes |
|---------|--------|-------|
| Nexus chain traversal | ✅ Done | Follows Nexus operations across namespaces |
| Namespace-specific API keys | ✅ Done | `TEMPORAL_API_KEY_<NAMESPACE>` env vars |
| Cross-NS documentation | ✅ Done | Added to README and examples |

### Phase 5: AI Tool Specs

| Feature | Status | Format |
|---------|--------|--------|
| OpenAI function spec | ✅ Done | `temporal tool-spec --format openai` |
| LangChain tool spec | ✅ Done | `temporal tool-spec --format langchain` |
| Claude tool spec | ✅ Done | `temporal tool-spec --format claude` |

### Phase 6: Visualization

| Feature | Status | Flag/Command |
|---------|--------|--------------|
| Trace flowchart | ✅ Done | `temporal workflow describe --trace-root-cause --output mermaid` |
| Timeline sequence diagram | ✅ Done | `temporal workflow show --compact --output mermaid` |
| State diagram | ✅ Done | `temporal workflow describe --pending --output mermaid` |
| Failures pie chart | ✅ Done | `temporal workflow list --failed --group-by error --output mermaid` |
| Failures flowchart | ✅ Done | `temporal workflow list --failed --output mermaid` |

---

## Comparison: Agent Commands vs Log-Based Debugging

| Aspect | Agent Commands | Log-Based (LogQL/grep) |
|--------|----------------|------------------------|
| Time to root cause | ~3-5 seconds | 5-30 minutes |
| Token consumption | ~500 tokens per query | ~5000+ tokens |
| Accuracy | 100% (structured data) | Variable |
| Domain knowledge required | Minimal | High |
| Manual steps | 1 command | 5+ steps |
| Cross-namespace correlation | Automatic | Manual |
| Race condition diagnosis | Timeline timestamps | Nearly impossible |

---

## Success Criteria Validation

| Criterion | Status | Evidence |
|-----------|--------|----------|
| AI finds failures without LogQL | ✅ | All experiments used `temporal workflow` only |
| Root cause accuracy | ✅ | 100% in all tests |
| Low token cost | ✅ | ~10x reduction vs logs |
| Cross-namespace traversal | ✅ | Nexus chains fully traced |
| Timing bug diagnosis | ✅ | Race condition identified from timeline |
| Autonomous fix proposal | ✅ | LLM proposed correct atomic operation fix |

---

## Conclusion

The `temporal workflow` CLI commands successfully achieve the goals:

1. **Agent-native feedback loop**: AI agents effectively debug Temporal workflow failures using structured output
2. **No logs required**: All debugging done via `temporal workflow` commands
3. **Automatic chain traversal**: Traces follow child workflows and Nexus operations across namespaces
4. **Root cause extraction**: Leaf failures clearly identified with `--leaf-only`
5. **Error compaction**: `--compact-errors` strips wrapper context for cleaner output
6. **Timing analysis**: Timeline timestamps enable race condition diagnosis
7. **Low token cost**: Structured JSON is ~10x more efficient than raw logs
8. **Autonomous debugging**: LLM successfully diagnosed and fixed a TOCTOU bug without hints
9. **Mermaid visualization**: `--output mermaid` generates visual diagrams for human-in-the-loop debugging

**Temporal's execution history + agent-optimized CLI = effective AI debugging feedback loop.**
166 changes: 166 additions & 0 deletions examples/agent-demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Temporal Agent Demo

This demo project demonstrates the `temporal workflow` commands for AI-assisted debugging.

## Overview

The demo includes several workflow scenarios:

1. **SimpleSuccessWorkflow** - A basic successful workflow with one activity
2. **OrderWorkflow** - An order processing workflow with child workflows (PaymentWorkflow, ShippingWorkflow)
3. **NestedFailureWorkflow** - A deeply nested workflow chain that fails at the leaf level

## Setup

### Prerequisites

- Go 1.23+
- **Temporal Go SDK v1.37.0+** (required for API key authentication)

### Environment Variables

**For Temporal Cloud (Production):**
```bash
export TEMPORAL_ADDRESS="us-east-1.aws.api.temporal.io:7233"
export TEMPORAL_NAMESPACE="moedash-prod.a2dd6"
export TEMPORAL_API_KEY="$(cat ../../prod-temporal-api-key.txt)"
export TEMPORAL_TASK_QUEUE="agent-demo"
```

**For Temporal Cloud (Staging):**
```bash
export TEMPORAL_ADDRESS="us-west-2.aws.api.tmprl-test.cloud:7233"
export TEMPORAL_NAMESPACE="moedash.temporal-dev"
export TEMPORAL_API_KEY="$(cat ../../staging-temporal-api-key.txt)"
export TEMPORAL_TASK_QUEUE="agent-demo"
```
> **Note:** Staging uses a self-signed certificate. The worker/starter auto-detect staging URLs and skip TLS verification. For CLI commands, add `--tls-disable-host-verification`.

**For Local Dev Server:**
```bash
export TEMPORAL_ADDRESS="localhost:7233"
export TEMPORAL_NAMESPACE="default"
export TEMPORAL_TASK_QUEUE="agent-demo"
```

### Install Dependencies

```bash
go mod tidy
```

### SDK Version Note

This demo requires **Temporal Go SDK v1.37.0+** for proper API key authentication. Earlier SDK versions may fail with "Request unauthorized" errors even with valid credentials. The demo uses `go.temporal.io/sdk/contrib/envconfig` for client configuration, matching the CLI's approach.

## Running the Demo

### 1. Start the Worker

In one terminal:

```bash
go run ./worker
```

### 2. Start Workflows

In another terminal:

```bash
# Run all scenarios
go run ./starter -scenario all

# Or run individual scenarios:
go run ./starter -scenario success
go run ./starter -scenario payment-fail
go run ./starter -scenario shipping-fail
go run ./starter -scenario nested-fail
```

## Using Temporal Workflow Commands

After workflows have run, use the agent commands to analyze them.

> **For staging:** Add `--tls-disable-host-verification` to all commands.

### List Recent Failures

```bash
temporal workflow list --failed \
--address $TEMPORAL_ADDRESS \
--namespace $TEMPORAL_NAMESPACE \
--api-key $TEMPORAL_API_KEY \
--tls \
--since 1h \
--follow-children \
--output json | jq
```

### Trace a Workflow Chain

```bash
# Find the deepest failure in an order workflow
temporal workflow describe --trace-root-cause \
--address $TEMPORAL_ADDRESS \
--namespace $TEMPORAL_NAMESPACE \
--api-key $TEMPORAL_API_KEY \
--tls \
-w order-payment-fail-XXXXXX \
--output json | jq

# Trace the nested failure workflow (3 levels deep)
temporal workflow describe --trace-root-cause \
--address $TEMPORAL_ADDRESS \
--namespace $TEMPORAL_NAMESPACE \
--api-key $TEMPORAL_API_KEY \
--tls \
-w nested-failure-XXXXXX \
--output json | jq
```

### Get Workflow Timeline

```bash
temporal workflow show --compact \
--address $TEMPORAL_ADDRESS \
--namespace $TEMPORAL_NAMESPACE \
--api-key $TEMPORAL_API_KEY \
--tls \
-w order-success-XXXXXX \
--compact \
--output json | jq
```

## Workflow Scenarios

### Payment Failure Chain

```
OrderWorkflow (ORD-XXX-X)
└── PaymentWorkflow (payment-ORD-XXX-X)
└── ProcessPaymentActivity → FAILS: "payment gateway connection timeout"
```

### Shipping Failure Chain

```
OrderWorkflow (ORD-XXX-Y)
└── PaymentWorkflow (payment-ORD-XXX-Y) → SUCCESS
└── ShippingWorkflow (shipping-ORD-XXX-Y)
└── ShipOrderActivity → FAILS: "warehouse inventory depleted"
```

### Nested Failure Chain

```
NestedFailureWorkflow (depth=0)
└── NestedFailureWorkflow (depth=1)
└── NestedFailureWorkflow (depth=2)
└── NestedFailureWorkflow (depth=3)
└── FailingActivity → FAILS: "database connection refused"
```

The `temporal workflow describe --trace-root-cause` command will automatically traverse this entire chain
and identify the leaf failure with its root cause.

Loading
Loading