Skip to content

Conversation

@jeremyeder
Copy link
Owner

Adds scripted Langfuse deployment to local kind clusters using upstream Helm chart.

Phase 1 scope:

  • deploy-langfuse-kind.sh: Automated installation
  • cleanup-langfuse.sh: Cleanup script
  • Documentation (POC guide, SessionAffinity investigation)
  • langfuse-rosa-expert agent for future ROSA work

Tested:
Podman on macOS, single-node kind cluster

Phase 2 (future PR):
Instrument platform with Langfuse for LLM observability

Quick start:

cd e2e
./scripts/deploy-langfuse-kind.sh
# Access: http://langfuse.local:8080 (Podman) or http://langfuse.local (Docker)

jeremyeder and others added 12 commits November 9, 2025 16:46
- Created e2e/scripts/deploy-langfuse-kind.sh for automated deployment
- Added comprehensive documentation in docs/deployment/langfuse-helm-poc.md
- Added Makefile target: deploy-langfuse-kind
- Follows project conventions from existing e2e scripts
- Uses official Langfuse Helm chart (v1.5.9) with minimal customization
- Supports automatic secret generation and validation
- Includes troubleshooting guide and cleanup instructions
- Created e2e/scripts/cleanup-langfuse.sh following cleanup.sh conventions
- Deletes Langfuse namespace
- Removes langfuse.local from /etc/hosts (with backup)
- Cleans up .env.langfuse credentials file
- Supports --delete-cluster flag to also remove kind cluster
- Follows project emoji/status message style
- Move container engine detection before kind cluster check
- Set KIND_EXPERIMENTAL_PROVIDER before running kind commands
- Ensures Podman users can check for existing clusters correctly
- Use langfuse.nextauth.secret.value instead of langfuse.nextauth.secret
- Use langfuse.salt.value instead of langfuse.salt
- Fix password generation to use openssl instead of /dev/urandom
- Prevents hanging on password generation and Helm template errors
- Set clickhouse.replicaCount=1 (was 3 by default)
- Disable pod anti-affinity for ClickHouse, PostgreSQL, Redis, ZooKeeper
- Prevents pods from being stuck in Pending state on single-node clusters
- Uses podAntiAffinityPreset=none for all StatefulSets
After thorough investigation of the langfuse-k8s Helm chart and its Bitnami
dependencies, determined that:

- Headless services (clusterIP: None) correctly omit sessionAffinity
- Regular services only include sessionAffinity when explicitly configured
- Issue is in upstream Bitnami charts, not langfuse-k8s repository
- No PR needed for langfuse-k8s

Documented three options if SessionAffinity warnings occur:
1. Override values at deployment time
2. Report to Bitnami upstream charts
3. Verify warnings are actually occurring

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- ClickHouse: 3Gi → 512Mi requests, 1Gi limits
- ZooKeeper: Reduce to 1 replica (was 3), 256Mi requests, 512Mi limits
- Fixes 'Insufficient memory' scheduling errors on kind nodes
- Total memory footprint now fits within kind node capacity (~2Gi)
- Change zookeeper.replicaCount to zookeeper.replicas
- Bitnami ZooKeeper chart uses 'replicas' not 'replicaCount'
- Will now correctly deploy 1 ZooKeeper pod instead of 3
The Helm chart expects langfuse.ingress.* not just ingress.*
This was preventing the Ingress resource from being created.

Fixed URLs:
- ingress.enabled -> langfuse.ingress.enabled
- ingress.className -> langfuse.ingress.className
- ingress.hosts -> langfuse.ingress.hosts
- Deploy script now shows correct URL based on container engine
- Documentation updated to reflect port 8080 for Podman
- Docker uses standard port 80
This agent provides expertise in LangFuse deployment on ROSA:
- Installation and configuration
- Upgrading and troubleshooting
- Performance optimization
- Architecture design for scale
- Integration with OpenShift features
@github-actions
Copy link

github-actions bot commented Nov 9, 2025

Pull Request Review: Langfuse Deployment for kind (Phase 1)

Summary

This PR adds a comprehensive Langfuse observability platform deployment for local kind clusters. The implementation is well-structured, properly documented, and follows established patterns from the existing e2e infrastructure. Overall, this is a solid foundation for Phase 1 with some recommendations for improvement.


✅ Strengths

1. Excellent Documentation

  • langfuse-helm-poc.md is comprehensive with architecture diagrams, prerequisites, and troubleshooting
  • Clear quick start instructions
  • Good use of visual ASCII diagrams
  • Proper attribution to upstream Langfuse project

2. Consistent with Project Patterns

  • Follows existing e2e/scripts/ conventions (setup-kind.sh pattern)
  • Container engine detection logic matches existing scripts
  • Proper use of Makefile targets for discoverability
  • Namespace isolation (langfuse namespace)

3. Security Best Practices

  • Auto-generates secure secrets using openssl rand -base64
  • Strips special characters from passwords (PostgreSQL, ClickHouse, Redis)
  • Saves credentials to .env.langfuse (should be gitignored)
  • No hardcoded passwords

4. Operational Excellence

  • Idempotent operations (checks if namespace/entries exist)
  • Proper error handling with set -euo pipefail
  • Graceful degradation (warns if /etc/hosts fails)
  • Wait conditions for pod readiness
  • Cleanup script mirrors deployment script structure

5. Thoughtful Agent Design

  • langfuse-rosa-expert.md has clear scope and competencies
  • Good SRE collaboration patterns
  • Comprehensive operational methodology

🔧 Recommendations

Priority 1: Critical Issues

1. Missing .gitignore Entry

Issue: .env.langfuse contains sensitive credentials but may not be gitignored.

Fix: Verify e2e/.gitignore includes:

.env.langfuse

Location: e2e/.gitignore

2. Unquoted Variable in sed Command

Issue: Line 47 in cleanup-langfuse.sh has an unquoted variable in sed that could cause issues with special characters.

# Current (line 47)
sudo sed -i.bak '/langfuse.local/d' /etc/hosts

# Better
sudo sed -i.bak '/langfuse\.local/d' /etc/hosts

Location: e2e/scripts/cleanup-langfuse.sh:47

Rationale: Escape the dot to match literal langfuse.local instead of langfuse<any-char>local.

3. StatefulSet Wait Condition Fragility

Issue: Lines 130-137 in deploy-langfuse-kind.sh use jsonpath='{.status.readyReplicas}'=1 which may not work for all StatefulSet states.

# Current (lines 132-136)
kubectl wait --namespace langfuse \
  --for=jsonpath='{.status.readyReplicas}'=1 \
  --timeout=300s \
  statefulset/$statefulset &>/dev/null || true

# More robust
kubectl wait --namespace langfuse \
  --for=jsonpath='{.status.readyReplicas}'=1 \
  --timeout=300s \
  statefulset/$statefulset 2>/dev/null || echo "   ⚠️ Warning: $statefulset may still be starting"

Location: e2e/scripts/deploy-langfuse-kind.sh:132-136

Rationale: Better error visibility when pods don't reach ready state.


Priority 2: Enhancements

4. Resource Limits for Local Testing

Observation: The script configures significant resources:

  • ClickHouse: 512Mi-1Gi memory, 500m-1 CPU
  • ZooKeeper: 256Mi-512Mi memory, 250m-500m CPU
  • Langfuse web/worker: 1Gi-2Gi memory, 500m-1000m CPU

Suggestion: Document total resource requirements in the script header:

# Resource Requirements:
#   CPU: ~9 cores
#   Memory: ~19.5GB RAM
#   Disk: ~50GB
# For smaller environments, consider reducing replica counts

Location: e2e/scripts/deploy-langfuse-kind.sh:1-10

5. Helm Chart Version Pinning

Issue: Line 80 uses langfuse/langfuse without version pinning.

# Current (line 80)
helm upgrade --install langfuse langfuse/langfuse \

# Better (with version pin)
LANGFUSE_CHART_VERSION="1.5.9"  # Or make configurable
helm upgrade --install langfuse langfuse/langfuse \
  --version "$LANGFUSE_CHART_VERSION" \

Location: e2e/scripts/deploy-langfuse-kind.sh:80

Rationale: Reproducibility and avoiding unexpected breaking changes from chart updates.

6. Error Handling for Helm Failures

Issue: Line 80-111 helm install has --wait but errors are not explicitly caught.

Suggestion: Add explicit error handling:

if \! helm upgrade --install langfuse langfuse/langfuse \
  # ... all the flags ...
  --wait \
  --timeout=10m; then
  echo "❌ Helm installation failed. Check logs:"
  echo "   kubectl logs -n langfuse -l app.kubernetes.io/name=langfuse --tail=100"
  exit 1
fi

Location: e2e/scripts/deploy-langfuse-kind.sh:80-112

7. Documentation: OpenShift Route Clarification

Issue: Documentation mentions OpenShift but only kind deployment is implemented.

Suggestion: In langfuse-helm-poc.md, add a note in the "OpenShift Deployment" section:

## OpenShift Deployment (Phase 2 - Not Yet Implemented)

OpenShift deployment script (`deploy-langfuse-openshift.sh`) is planned for a future PR with:
- Security Context Constraints (SCC) configuration
- OpenShift Route support
- ...

**Status**: Phase 2 work - not included in this PR.

Location: docs/deployment/langfuse-helm-poc.md:351


Priority 3: Nice-to-Haves

8. Add Validation for Required Ports

Suggestion: Check if ports 80/8080 are available before deployment:

# Add after line 36 in deploy-langfuse-kind.sh
echo ""
echo "Checking port availability..."
if [ "$CONTAINER_ENGINE" = "podman" ]; then
  PORT=8080
else
  PORT=80
fi

if lsof -i:$PORT >/dev/null 2>&1; then
  echo "   ⚠️ Warning: Port $PORT is already in use"
  echo "   Langfuse may not be accessible at expected URL"
fi

9. Add Smoke Test Target

Suggestion: Add a test-langfuse Makefile target:

test-langfuse: ## Test Langfuse deployment
\t@cd e2e && ./scripts/test-langfuse.sh

With a simple test script:

#\!/bin/bash
# e2e/scripts/test-langfuse.sh
set -euo pipefail

echo "Testing Langfuse deployment..."

# Check pods
kubectl get pods -n langfuse

# Test HTTP endpoint
URL="http://langfuse.local:8080"  # Adjust for Docker
if curl -s -o /dev/null -w "%{http_code}" "$URL" | grep -q "200\|30[0-9]"; then
  echo "✅ Langfuse is accessible at $URL"
else
  echo "❌ Langfuse is not responding at $URL"
  exit 1
fi

10. Agent Definition: Missing Examples

Observation: The langfuse-rosa-expert.md has excellent structure but the examples in the frontmatter description are ROSA-specific, while the implementation is kind-specific.

Suggestion: Add a note in the agent definition about kind deployment:

## Local Development
For local kind cluster deployments, use `make deploy-langfuse-kind` instead of this agent.
This agent is specialized for production ROSA deployments.

Location: agents/langfuse-rosa-expert.md:6


🔍 Code Quality Assessment

Bash Scripting

  • ✅ Proper shebang and set -euo pipefail
  • ✅ Consistent error handling
  • ✅ Good use of functions (could extract more)
  • ✅ Clear variable naming
  • ⚠️ Some opportunities for functions (e.g., check_prerequisites)

Documentation

  • ✅ Comprehensive and well-structured
  • ✅ Troubleshooting section is excellent
  • ✅ Architecture diagrams add clarity
  • ⚠️ Could benefit from a "Known Limitations" section

Security

  • ✅ No hardcoded secrets
  • ✅ Proper secret generation
  • ⚠️ Ensure .env.langfuse is gitignored (verify)
  • ✅ Credentials stored securely locally

Testing

  • ⚠️ No automated tests for these scripts
  • ⚠️ No smoke test to verify deployment
  • 📋 Consider adding basic smoke tests in future

🎯 Verdict

Recommendation: ✅ Approve with minor changes

This is a well-implemented Phase 1 foundation. The critical issues are minor and easily addressed. The code quality is high, follows project conventions, and the documentation is excellent.

Before Merge (Required):

  1. ✅ Verify .env.langfuse is in e2e/.gitignore
  2. ✅ Fix sed escaping in cleanup script (line 47)
  3. ✅ Add version pinning to Helm chart (reproducibility)

Post-Merge (Recommended):

  • Add smoke test script and Makefile target
  • Extract repeated logic into functions
  • Add port availability checks
  • Create Phase 2 issue for OpenShift deployment

📊 Alignment with CLAUDE.md

✅ Follows Project Standards:

  • Bash scripting patterns match existing e2e/ scripts
  • Makefile integration follows conventions
  • Documentation structure aligns with project style
  • Namespace isolation for multi-tenant considerations

✅ Security Practices:

  • Proper secret management
  • No hardcoded credentials
  • Secure defaults

⚠️ Testing Gap:

  • CLAUDE.md emphasizes test-driven development
  • No automated tests for deployment scripts
  • Recommend adding smoke tests in follow-up

📝 Final Notes

Great work on this PR! The implementation is clean, well-documented, and production-minded. The separation of Phase 1 (kind) and Phase 2 (OpenShift) is smart. The agent definition is thorough and will be valuable for future ROSA work.

Estimated Review Time: 45 minutes
Complexity: Medium
Risk: Low (isolated to e2e infrastructure)


Reviewed by: Claude Code (AI Code Review)
Date: 2025-11-09

Comprehensive planning document covering:
- Phase 1 completion status and deployment details
- Integration points (runner, backend, operator)
- SDK usage patterns for Python and Go
- Configuration strategies (3 options)
- Implementation plan (6 steps)
- Success criteria and testing strategy
@github-actions
Copy link

github-actions bot commented Nov 9, 2025

PR Review: Add Langfuse deployment for kind (Phase 1)

Summary

This PR adds Langfuse (LLM observability platform) deployment capability to local kind clusters as Phase 1 of LLM observability integration. The implementation is well-structured, follows project conventions, and includes comprehensive documentation.


✅ Strengths

1. Excellent Documentation

  • Comprehensive POC guide (langfuse-helm-poc.md) with architecture diagrams, prerequisites, and troubleshooting
  • Phase 2 context document clearly outlines future integration plans
  • SessionAffinity investigation document demonstrates thorough technical analysis
  • Clear separation of Phase 1 (infrastructure) and Phase 2 (instrumentation)

2. Script Quality

  • deploy-langfuse-kind.sh: Well-structured deployment automation
    • Proper error handling with set -euo pipefail
    • Comprehensive prerequisite checks
    • Auto-generates secure secrets using openssl rand
    • Supports both Docker and Podman (aligns with existing e2e infrastructure)
    • Idempotent operations with existence checks
    • Clear, informative output with status indicators
  • cleanup-langfuse.sh: Complete cleanup automation
    • Handles /etc/hosts cleanup safely with backups
    • Optional cluster deletion with --delete-cluster flag
    • Graceful handling of non-existent resources

3. Security Practices

  • Secrets generated with cryptographically secure random values
  • Credentials saved to .env.langfuse (needs .gitignore entry - see below)
  • No hardcoded secrets in code
  • Appropriate resource limits configured

4. Integration with Existing Infrastructure

  • Uses existing vteam-e2e kind cluster from e2e setup
  • Follows existing Podman/Docker detection patterns
  • Makefile target follows project conventions
  • Namespace isolation (langfuse) separates from platform components

5. Langfuse ROSA Expert Agent

  • Comprehensive agent definition with clear competencies
  • Excellent SRE collaboration pattern emphasizing automation
  • Production-ready guidance for future OpenShift deployment

⚠️ Issues and Recommendations

CRITICAL: Security - .gitignore Missing

Issue: .env.langfuse contains sensitive credentials but is not in .gitignore

Current .gitignore entries:

.env
.env.uat
e2e/.env.test

Required fix:

# E2E testing
e2e/.env.test
+e2e/.env.langfuse
e2e/node_modules/

Impact: Without this, developers might accidentally commit database passwords and secrets.

Recommendation: Add this entry before merging.


Code Quality Issues

1. Shell Script - sed Portability (cleanup-langfuse.sh:47)

Issue: sed -i.bak syntax differs between macOS (BSD) and Linux (GNU)

# Current (line 47)
sudo sed -i.bak '/langfuse.local/d' /etc/hosts

Problem: This works on macOS but may fail on Linux CI runners.

Fix:

# Portable approach
if [[ "$OSTYPE" == "darwin"* ]]; then
  sudo sed -i .bak '/langfuse.local/d' /etc/hosts  # macOS requires space
else
  sudo sed -i.bak '/langfuse.local/d' /etc/hosts   # Linux
fi

Or simpler:

# Already have backup from line 45, so just use in-place without backup
sudo sed -i'' '/langfuse.local/d' /etc/hosts  # Works on both

2. Shell Script - Unused Variable (deploy-langfuse-kind.sh:30)

Issue: Container engine detected but output message doesn't reflect actual detection logic

# Line 30
echo "Using container runtime: $CONTAINER_ENGINE"

Observation: This message appears before the kind cluster check. If the user sets CONTAINER_ENGINE manually, the auto-detection is skipped, which is correct. However, consider clarifying if it was auto-detected vs manually set:

echo "Using container runtime: $CONTAINER_ENGINE (auto-detected)"

3. Helm Values - ZooKeeper Replicas Mismatch

Issue: deploy-langfuse-kind.sh:104 sets zookeeper.replicas=1 for local dev, but production typically needs 3+ for quorum.

Current:

--set zookeeper.replicas=1 \

Consideration: This is appropriate for local dev, but the documentation should warn that Phase 2 (production ROSA deployment) will need to increase this. The langfuse-helm-poc.md mentions "Resource Requirements" but doesn't explicitly call out ZooKeeper quorum requirements.

Recommendation: Add to docs:

**Production Considerations:**
- ZooKeeper: Increase to 3 replicas minimum for proper quorum
- ClickHouse: Consider multiple shards for high-volume deployments

Documentation Suggestions

1. Add Troubleshooting for macOS Podman Port Conflicts

The docs mention http://langfuse.local:8080 for Podman but don't explain why port 8080 is needed. Add:

**Why port 8080 for Podman?**
Podman rootless mode cannot bind to privileged ports (<1024) without additional configuration. The kind cluster is created with port mappings 8080:80 and 8443:443 for rootless compatibility.

2. Credentials Management Best Practices

The Phase 2 context document mentions three configuration options but doesn't provide security guidance. Add:

**Security Best Practices:**
- Never commit `.env.langfuse` to version control
- In production, use external secret managers (HashiCorp Vault, AWS Secrets Manager)
- Rotate API keys regularly using Langfuse web UI
- Use RBAC to limit which ServiceAccounts can read langfuse-keys Secret

3. Resource Requirements Validation

The POC guide lists resource requirements but doesn't explain how to check if your system meets them. Add:

# Check available resources before deployment
docker system info | grep -E 'CPUs|Total Memory'
# or
podman system info | grep -E 'cpus|memTotal'

Performance Considerations

1. ClickHouse Resource Limits Too Low for Production

--set clickhouse.resources.limits.memory=1Gi \

Issue: ClickHouse documentation recommends minimum 2GB for production workloads with analytics queries.

Recommendation:

  • Current settings are fine for POC/dev
  • Add warning in langfuse-helm-poc.md under "Resource Requirements":
    **Note**: ClickHouse limits are set to 1Gi for local development. Production deployments should use minimum 2Gi memory and consider scaling based on trace volume.

2. No HPA or PDB Configuration

Observation: The deployment doesn't configure Horizontal Pod Autoscaling or Pod Disruption Budgets.

Recommendation: This is acceptable for Phase 1 (POC), but Phase 2 documentation should include HPA setup for langfuse-web and langfuse-worker based on trace ingestion volume.


Testing Coverage

Missing Test Validation

Issue: No automated test to verify the deployment succeeds.

Recommendation: Add optional smoke test to Makefile:

test-langfuse: ## Test Langfuse deployment
\t@echo "Testing Langfuse deployment..."
\t@kubectl wait --namespace langfuse --for=condition=available --timeout=300s deployment/langfuse-web
\t@curl -f http://langfuse.local:8080 || curl -f http://langfuse.local
\t@echo "✓ Langfuse is accessible"

Or add to existing e2e test suite if Langfuse becomes a core dependency.


📋 Pre-Merge Checklist

Must Fix Before Merge:

  • Add e2e/.env.langfuse to .gitignore
  • Fix sed -i.bak portability in cleanup-langfuse.sh

Should Consider:

  • Add ClickHouse memory warning to documentation
  • Add ZooKeeper quorum requirements to Phase 2 docs
  • Consider adding smoke test for deployment validation

Nice to Have:

  • Enhance troubleshooting section with macOS Podman port explanation
  • Add security best practices section to Phase 2 context
  • Add resource validation commands to POC guide

🎯 Overall Assessment

Recommendation: Approve with required changes

This is a well-executed Phase 1 implementation that:

  • ✅ Follows project conventions and patterns
  • ✅ Includes comprehensive documentation
  • ✅ Uses upstream Helm charts without unnecessary customization
  • ✅ Supports both Docker and Podman
  • ✅ Provides clear path to Phase 2 (instrumentation)

The only blocking issue is the missing .gitignore entry for secrets. The sed portability issue should also be fixed as it will cause failures on Linux systems.

All other recommendations are enhancements that would improve the production readiness but don't block Phase 1 POC deployment.


📚 References Reviewed

  • ✅ CLAUDE.md project conventions
  • ✅ Existing e2e infrastructure (setup-kind.sh, cleanup.sh)
  • ✅ Langfuse official documentation
  • ✅ Helm chart source (langfuse-k8s repository)
  • ✅ Project Makefile patterns

Great work on this foundational piece! The phase separation is smart, and the documentation will make Phase 2 integration much smoother.

Add automatic S3 credential patching to resolve trace storage issue
where langfuse-web and langfuse-worker deployments were missing
proper S3 secret references, preventing traces from being persisted
to storage.

Changes:
- Patch langfuse-web deployment with S3 credentials from langfuse-s3 secret
- Patch langfuse-worker deployment with S3 credentials from langfuse-s3 secret
- Add automatic rollout wait after patching
- Document S3 credential fix in troubleshooting section
- Set replica counts to 1 for kind deployments (already in script)

This ensures traces are successfully uploaded to S3 (web) and
downloaded for processing (worker), making them visible in the UI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jeremyeder
Copy link
Owner Author

S3 Credential Fix Applied

Added automatic S3 credential configuration to resolve the trace storage issue discovered during testing.

Problem

The upstream Langfuse Helm chart deploys langfuse-web and langfuse-worker without proper S3 credentials configured by default. This causes a two-stage failure:

  1. Web pod: Missing S3 secret key → cannot upload OTEL traces to S3
  2. Worker pod: Missing S3 secret key → cannot download traces from S3 for processing into database

Result: Traces appear to send successfully (200 OK) but never show up in the UI.

Solution

The deployment script now automatically patches both deployments after Helm installation to reference credentials from the langfuse-s3 secret:

  • LANGFUSE_S3_EVENT_UPLOAD_ACCESS_KEY_IDlangfuse-s3/root-user
  • LANGFUSE_S3_EVENT_UPLOAD_SECRET_ACCESS_KEYlangfuse-s3/root-password
  • Same for BATCH_EXPORT and MEDIA_UPLOAD configurations

Changes in this commit

  • ✅ Auto-patch langfuse-web deployment with S3 credentials
  • ✅ Auto-patch langfuse-worker deployment with S3 credentials
  • ✅ Wait for rollout to complete after patching
  • ✅ Document S3 fix in troubleshooting section
  • ✅ Replica counts already set to 1 for kind deployments

Validation

Tested with Python Langfuse SDK client - traces now successfully persist to S3 and appear in UI within ~5 seconds.

# Verify fix works
curl -s -u "pk-lf-xxx:sk-lf-xxx" http://localhost:3000/api/public/traces | jq '.meta.totalItems'
# Returns: 3 (traces successfully stored)

Phase 2 Changes:
- Focus on Claude Code Runner instrumentation only (removed Backend/Operator)
- Simplified to single global configuration (ConfigMap + Secret)
- Reduced metrics to MVP essentials (token usage, success/failure, basic latency)
- Streamlined testing to integration tests only
- Updated success criteria to match simplified scope
- Document reduced from 389 to 317 lines (19% reduction)

Phase 3 Ideas (New Document):
- Extracted advanced features to langfuse-phase3-ideas.md
- Includes: Backend/Operator instrumentation, multi-tenancy, feedback loops
- Includes: Prompt management, cost alerts, ROSA deployment
- Provides clear roadmap for post-Phase 2 enhancements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

Code Review - Langfuse Phase 1 Deployment

Overall Assessment

Excellent work! This PR delivers a well-structured, production-ready foundation for Langfuse observability. The implementation follows infrastructure-as-code best practices with thoughtful documentation and clear separation of concerns.


Strengths

1. Deployment Script Quality ⭐

The deploy-langfuse-kind.sh script demonstrates excellent DevOps practices:

  • ✅ Comprehensive prerequisite checks
  • ✅ Automatic secret generation using openssl rand
  • ✅ Wait conditions for pod readiness
  • ✅ Automatic S3 credential patching (solving known upstream issue)
  • ✅ Clear status output with actionable next steps
  • ✅ Proper error handling with set -euo pipefail

Highlight: The S3 credential fix (lines 141-254) shows deep understanding of the Helm chart limitations and proactive problem-solving.

2. Documentation Excellence 📚

The langfuse-helm-poc.md is comprehensive and well-organized:

  • ✅ Clear architecture diagrams
  • ✅ Resource requirements specified
  • ✅ Troubleshooting section with actual solutions
  • ✅ Multiple deployment scenarios covered
  • ✅ Integration with existing e2e/ infrastructure

3. Agent Definition 🤖

The langfuse-rosa-expert agent is well-designed with comprehensive competency mapping and clear operational methodology.

4. Security Practices 🔒

  • ✅ Secure credential generation
  • ✅ Credentials stored in .env.langfuse (gitignored)
  • ✅ No hardcoded secrets
  • ✅ Minimal permissions in Helm values

Issues & Concerns

🔴 Critical: Missing .gitignore Entry

File: e2e/.gitignore or root .gitignore

The script generates e2e/.env.langfuse with sensitive credentials, but I don't see this file added to .gitignore.

Required action: Add .env.langfuse to .gitignore

Risk: Without this, developers could accidentally commit sensitive credentials.


🟡 Medium Issues

1. Helm Timeout Configuration (e2e/scripts/deploy-langfuse-kind.sh:111)

  • Current: --timeout=10m
  • Issue: May be insufficient on resource-constrained systems
  • Recommendation: Increase to 15m or make configurable via env var

2. Cleanup Script Host File Management (e2e/scripts/cleanup-langfuse.sh:43-48)

  • Issues: Creates multiple backup files, could remove unintended entries
  • Recommendation: Use exact match pattern for safer removal

3. Resource Allocation Hardcoded (e2e/scripts/deploy-langfuse-kind.sh:92-109)

  • Issue: Different environments have different capacity needs
  • Recommendation: Create values override files (langfuse-values-kind.yaml, langfuse-values-rosa.yaml)

🟢 Minor Enhancements

  1. Add openssl prerequisite check
  2. Externalize S3 patch JSON to separate file
  3. Add HTTP health check validation after deployment
  4. Document Makefile target in CLAUDE.md
  5. Add shellcheck to CI pipeline
  6. Create automated smoke test script

Performance Considerations

Good: Resource-constrained optimizations for kind (ClickHouse replica: 1, ZooKeeper: 1, anti-affinity disabled)

⚠️ Warning: Current configuration is NOT suitable for production ROSA deployment. Production requires:

  • High-availability PostgreSQL (external RDS or CrunchyData)
  • ClickHouse sharding and replication
  • Redis Sentinel for HA
  • External S3 (AWS S3, not MinIO)
  • TLS/HTTPS with proper certificates
  • Network policies and security contexts

Code Quality

✅ Excellent

  • set -euo pipefail for safety
  • Descriptive variable names
  • Clear progress messages
  • Proper exit codes
  • Idempotent design

Recommendations Summary

Must Fix Before Merge 🔴

  1. Add e2e/.env.langfuse to .gitignore

Should Fix Before Merge 🟡

  1. Increase Helm timeout to 15 minutes
  2. Improve /etc/hosts cleanup safety
  3. Add openssl prerequisite check

Nice to Have 🟢

  1. Add HTTP health check validation
  2. Externalize S3 patch JSON
  3. Add Makefile target to CLAUDE.md
  4. Add production deployment warning comment
  5. Add shellcheck to CI
  6. Create smoke test script

Conclusion

This is high-quality infrastructure code that follows best practices. The only blocking issue is the missing .gitignore entry.

Approval: ✅ Approve after adding .gitignore entry


Next Steps for Phase 2

  1. Create langfuse_config.py for SDK initialization
  2. Add langfuse>=2.0.0 to runner requirements
  3. Create ConfigMap/Secret in ambient-code namespace
  4. Update backend operator to inject Langfuse env vars into runner Jobs
  5. Add trace/span wrappers around Claude Code CLI invocations

Happy to review Phase 2 PR when ready! 🚀

Created comprehensive, executable implementation guide for instrumenting
the Claude Code Runner's _run_claude_agent_sdk() method with Langfuse.

Prompt includes:
- Step-by-step implementation with exact file paths and line numbers
- Complete code examples (not just diffs) for copy-paste implementation
- Dependency setup (langfuse>=2.53.3 in pyproject.toml)
- Operator configuration for env var injection
- Kubernetes resource creation (ConfigMap + Secret)
- Comprehensive testing strategy with validation commands
- Troubleshooting guide for common issues
- Success criteria and rollback plan

Target: Single codepath instrumentation (wrapper.py:152-469)
Expected outcome: Session-level traces with token usage, cost, latency

Optimized for:
- AI agent consumption (Claude Code can execute directly)
- Human implementation (clear step-by-step guide)
- Production deployment (includes graceful degradation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

Pull Request Review: Langfuse Deployment for kind (Phase 1)

Overall Assessment

APPROVED - This is a well-structured Phase 1 implementation with excellent documentation, secure credential handling, and thoughtful script design. The code is production-ready with only minor suggestions for enhancement.


Strengths

1. Excellent Script Design ⭐

  • Robust error handling: set -euo pipefail prevents silent failures
  • Idempotent operations: Gracefully handles existing resources
  • Smart runtime detection: Auto-detects Docker/Podman with proper KIND provider configuration
  • Comprehensive validation: Checks prerequisites before proceeding
  • User-friendly output: Clear status messages with visual indicators (✅, ❌, ℹ️)

2. Security Best Practices ⭐

  • Strong credential generation: Uses openssl rand -base64 32 for all secrets
  • Proper gitignore coverage: .env pattern already covers .env.langfuse
  • Credential isolation: Saves to local file only, never commits to repo
  • Secret references: Patches deployments to use Kubernetes secrets properly

3. S3 Credential Fix ⭐

The automatic patching of langfuse-web and langfuse-worker deployments (lines 141-254) is excellent:

  • Addresses upstream Helm chart gap
  • Uses JSON patch for surgical updates
  • Waits for rollout completion
  • Well-documented in troubleshooting guide

4. Documentation Excellence ⭐

  • Comprehensive POC guide: 476 lines covering architecture, prerequisites, troubleshooting
  • Phase planning: Clear separation of Phase 1 (deployment) vs Phase 2 (instrumentation)
  • Agent definition: Well-designed langfuse-rosa-expert agent with SRE collaboration patterns
  • Troubleshooting section: Documents the S3 issue with root cause analysis

5. Cleanup Script ⭐

  • Optional cluster deletion with --delete-cluster flag
  • Creates /etc/hosts backup before modification
  • Removes credentials file to prevent accidental commits

Suggestions for Improvement

1. Gitignore Specificity (Low Priority)

While .env pattern covers .env.langfuse, consider adding explicit entry for clarity:

# Environments
.env
.env.uat
+.env.langfuse
.env.test

Location: .gitignore:82-83

2. Shellcheck Validation (Enhancement)

Consider adding shellcheck validation to CI pipeline for bash scripts:

# .github/workflows/shellcheck.yml
- name: Run shellcheck
  run: |
    shellcheck e2e/scripts/*.sh

This would catch potential issues early, similar to how golangci-lint works for Go code.

3. Script Error Messages (Minor)

In deploy-langfuse-kind.sh:36, consider adding the setup command to error message:

echo "❌ Kind cluster 'vteam-e2e' not found"
-echo "   Run './scripts/setup-kind.sh' first"
+echo "   Run 'cd e2e && ./scripts/setup-kind.sh' first"

Makes it easier for users unfamiliar with the repo structure.

4. Documentation Cross-References (Enhancement)

In docs/deployment/langfuse-helm-poc.md:84, consider adding reference to Makefile target:

### Step 2: Deploy Langfuse

+**Using Makefile** (from project root):
+```bash
+make deploy-langfuse-kind
+```
+
+**Direct script** (from e2e directory):
```bash
./scripts/deploy-langfuse-kind.sh

#### 5. Agent Definition - Tool Access (Question)
In `agents/langfuse-rosa-expert.md:3`, the description shows in the Task tool examples but the frontmatter doesn't specify tools. Should this agent have `(Tools: *)" in the description?

Current:
```yaml
name: langfuse-rosa-expert
description: Use this agent when working with LangFuse deployments...
model: sonnet

Consider:

description: ... (Tools: Read, Write, Edit, Bash, WebSearch, WebFetch)

This follows the pattern from other agents in CLAUDE.md.


Security Review ✅

  • ✅ No hardcoded credentials
  • ✅ Secure random generation for all secrets
  • ✅ Credentials saved locally, not committed
  • ✅ Proper Kubernetes secret references
  • ✅ /etc/hosts modifications use sudo appropriately
  • ✅ Cleanup script removes sensitive files

Code Quality ✅

Bash Scripts

  • ✅ Proper error handling (set -euo pipefail)
  • ✅ Consistent quoting and variable expansion
  • ✅ Heredoc usage for JSON patches (prevents escaping issues)
  • ✅ Idempotent operations throughout
  • ✅ Clear variable naming

Makefile

  • ✅ Proper .PHONY declaration
  • ✅ Help text follows project convention
  • ✅ Working directory change with @cd e2e

Documentation

  • ✅ Consistent markdown formatting
  • ✅ Clear code blocks with language tags
  • ✅ Proper heading hierarchy
  • ✅ Comprehensive troubleshooting section

Testing Recommendations

Pre-Merge Testing

# Test deployment
make deploy-langfuse-kind

# Verify all pods running
kubectl get pods -n langfuse

# Test accessibility
curl -I http://langfuse.local:8080  # Podman
# or
curl -I http://langfuse.local  # Docker

# Test cleanup
cd e2e && ./scripts/cleanup-langfuse.sh

# Test cleanup with cluster deletion
cd e2e && ./scripts/cleanup-langfuse.sh --delete-cluster

Future Integration Tests (Phase 2)

Consider adding to e2e test suite:

  • Deploy Langfuse via script
  • Create test trace via API
  • Verify trace appears in UI
  • Cleanup

Architecture Alignment ✅

This PR follows project standards from CLAUDE.md:

  • Container Images: Uses quay.io registry pattern
  • Kubernetes/OpenShift: Proper namespace isolation
  • Git Workflow: Feature branch with conventional commits
  • Documentation: MkDocs structure under docs/deployment/
  • Development Commands: Makefile integration

Performance Considerations

Resource Limits (deploy-langfuse-kind.sh:92-109)

The script sets conservative limits appropriate for kind:

  • langfuse-web/worker: 1 CPU / 2Gi RAM
  • clickhouse: 1 CPU / 1Gi RAM
  • zookeeper: 500m CPU / 512Mi RAM

These are suitable for local development. For production ROSA deployments, the langfuse-rosa-expert agent should recommend higher limits based on workload.

Single Replicas

The script deploys single replicas for all components (replicaCount=1). This is correct for local kind clusters. Phase 2 should document HA requirements for production.


Phase 2 Readiness ✅

This PR sets excellent groundwork for Phase 2:

  • Working deployment: Platform team can deploy Langfuse locally
  • Documentation foundation: Clear architecture diagrams and troubleshooting
  • Agent ready: langfuse-rosa-expert defined for ROSA work
  • Context documents: Phase 2 implementation prompt available

Phase 2 can now focus purely on instrumentation without deployment concerns.


Final Verdict

APPROVE AND MERGE

This PR demonstrates:

  • High code quality and attention to detail
  • Strong security practices
  • Excellent documentation
  • Thoughtful script design with error handling
  • Clear phase separation

The minor suggestions above are enhancements, not blockers. This is production-ready code that follows all project standards.

Estimated Merge Risk: Low
Test Coverage: Manual testing recommended before merge (see Testing Recommendations)
Breaking Changes: None - purely additive


Next Steps Post-Merge

  1. Test deployment on both Docker and Podman environments
  2. Add .env.langfuse explicitly to gitignore (optional)
  3. Consider shellcheck CI integration (optional)
  4. Proceed with Phase 2 instrumentation using this foundation

Great work on this POC! 🎉

Changed from full method replacement (318 lines) to surgical insertions
at 4 targeted points (~81 new lines, 25% method growth).

Key improvements:
- Lower risk: Only modifies specific insertion points, preserves 75% of existing code
- Easier review: Focused diffs show exact instrumentation locations
- Better maintainability: Future SDK changes less likely to conflict
- Same capability: Captures all usage data, costs, and metadata

Implementation changes:
- Insertion Point 1 (line 308): Initialize Langfuse trace (22 lines)
- Insertion Point 2 (lines 410-413): Wrap queries with generation spans (12 lines)
- Insertion Point 3 (lines 385-402): Capture usage from ResultMessage (25 lines)
- Insertion Point 4 (line 456): Finalize trace and flush (22 lines)

Benefits:
✅ 75% of method stays untouched (vs 0% with full replacement)
✅ Git diffs show clear insertion points (vs entire method rewrite)
✅ Merge-friendly for future upstream changes
✅ Isolated testing (verify instrumentation independently)

Document structure:
- Added "Why surgical approach?" section with comparison table
- Each insertion point shows exact line numbers and surrounding context
- Clear before/after code snippets for accurate placement
- Summary showing total line changes and rationale

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

Code Review: Langfuse Deployment for kind (Phase 1)

Thank you for this well-structured PR! This is a solid foundation for adding LLM observability to the platform.

✅ Strengths

Excellent Script Quality

  • Robust error handling with set -euo pipefail
  • Clear user feedback with emoji indicators
  • Idempotent design handling pre-existing resources
  • Auto-generated secrets using openssl rand
  • Comprehensive documentation

Smart Engineering Decisions

  • Using upstream Helm chart minimizes maintenance
  • Automatic Docker/Podman detection
  • Proactive S3 credential fix for upstream issue
  • Resource-optimized for kind

Comprehensive Documentation

  • Excellent langfuse-helm-poc.md with architecture diagrams
  • Clear Phase 2/3 roadmap
  • Well-structured langfuse-rosa-expert agent

🔍 Issues Found

🟡 Medium Priority

1. Credentials File Permissions (deploy-langfuse-kind.sh:273-282)
The .env.langfuse file needs chmod 600 to prevent credential exposure.

2. S3 Patch Race Condition (deploy-langfuse-kind.sh:230-249)
Consider adding explicit wait before patching deployments.

3. Hardcoded StatefulSet Names (deploy-langfuse-kind.sh:130-137)
Current approach works but dynamic discovery would be more robust.

🟢 Low Priority

  • Verify .env.langfuse is in e2e/.gitignore
  • Enhance Makefile target documentation
  • Document /etc/hosts backup accumulation behavior

🔒 Security Considerations

✅ Good Practices:

  • Auto-generated secrets (32 bytes base64)
  • Helm values via --set
  • No hardcoded credentials

📋 For Phase 2/Production:

  • Consider Sealed Secrets or External Secrets Operator
  • Add NetworkPolicy resources
  • Enable ingress TLS
  • Document RBAC requirements

📊 Testing & Validation

Suggested additions:

  1. Add shellcheck validation to CI
  2. Add smoke test for Langfuse health endpoint
  3. Consider GitHub Actions workflow for deployment testing

🎯 Alignment with CLAUDE.md Standards

✅ Follows Guidelines:

  • Bash best practices
  • Documentation standards
  • Security practices
  • Deployment validation

📋 Minor Gaps:

  • No CI/CD integration for testing
  • No shellcheck validation

🔢 Metrics

  • Lines changed: 2382 additions / 1 deletion
  • Files changed: 9 files
  • Documentation ratio: ~62%
  • Script size: 310 + 82 lines

🏆 Overall Assessment

Rating: Approve with Minor Suggestions ⭐⭐⭐⭐½

This is high-quality infrastructure code with excellent documentation. Scripts are well-written, deployment approach is sound, and Phase 2/3 planning shows strategic thinking.

Recommendation: Merge after addressing credential file permissions (chmod 600). Other suggestions can be addressed in follow-up PRs.

Great work on this foundational piece!

Pre-Merge Checklist

  • Address credential file permissions (chmod 600)
  • Verify .env.langfuse is in .gitignore
  • Consider adding shellcheck to CI
  • Optional: Add deployment smoke test

References:

jeremyeder added a commit that referenced this pull request Nov 13, 2025
This implements complete LLM observability for the Claude Code Runner using
the surgical instrumentation approach (4 targeted insertion points vs full
method replacement).

## Changes Summary

### 1. Runner Dependencies (pyproject.toml)
- ✨ Add langfuse 3.9.1 (latest, Nov 6 2025)
- ⬆️  Update anthropic to 0.72.0 (from 0.68.0)
- ⬆️  Update claude-agent-sdk to 0.1.6 (from 0.1.4)
- All dependencies Python 3.13 compatible

### 2. Runner Instrumentation (wrapper.py)
**Import Changes:**
- Add Langfuse SDK imports (using 3.x API)
- Note: Langfuse 3.x changed API - no longer uses langfuse.decorators

**__init__ Changes (lines 38-51):**
- Initialize Langfuse client with env-based config
- Graceful degradation if LANGFUSE_ENABLED=false or keys missing
- Single client instance reused for all traces in session

**_run_claude_agent_sdk() Instrumentation (4 insertion points):**

Insertion Point 1 (lines 332-352): Session-level trace initialization
- Creates trace with session metadata (namespace, project, model, workspace)
- Links to Kubernetes session ID for cross-component correlation
- Initialize generation_span variable for per-query tracking

Insertion Point 2 (lines 455-472): Per-query generation spans
- Wraps each Claude query with generation span
- Captures prompt input and model name
- Uses nonlocal to update parent scope variable

Insertion Point 3 (lines 430-473): Usage data capture from ResultMessage
- Extracts token counts (input/output/total) from SDK result
- Records cost_usd, duration_ms, duration_api_ms
- Ends generation span and clears for next query

Insertion Point 4 (lines 540-561): Trace finalization and flush
- Updates trace with final session outcome (success, turns)
- Aggregates total cost and duration
- CRITICAL flush() call ensures data sent before pod exit

**Total Modification**: ~81 new lines across 4 insertions (~13.5% of method)

### 3. Operator Configuration (sessions.go)
**EnvFrom Changes (lines 575-609):**
- Add langfuse-keys Secret injection (Optional: true)
- Add langfuse-config ConfigMap injection (Optional: true)
- Maintain existing runnerSecretsName logic
- Optional flag ensures pods start even without Langfuse

### 4. Kubernetes Manifests (langfuse/langfuse-config.yaml)
**New ConfigMap:**
- LANGFUSE_HOST: cluster-internal URL (langfuse-web.langfuse.svc.cluster.local:3000)
- LANGFUSE_ENABLED: "true" (feature flag)

**New Secret:**
- LANGFUSE_PUBLIC_KEY: pk-lf-REPLACE-ME (placeholder)
- LANGFUSE_SECRET_KEY: sk-lf-REPLACE-ME (placeholder)

### 5. Documentation (langfuse-phase2-implementation-prompt.md)
- Complete step-by-step implementation guide (787 lines)
- Troubleshooting procedures
- Testing validation steps

## Breaking Changes

⚠️ **Langfuse 3.x API Migration**
The implementation uses Langfuse 3.9.1 which has breaking changes from 2.x:
- OLD: `from langfuse.decorators import langfuse_context, observe`
- NEW: `from langfuse import Langfuse, observe`
- `langfuse_context` no longer exists in 3.x

## Validation Completed

✅ Local Testing (Python 3.13 venv):
- Dependencies install successfully
- Langfuse 3.9.1 imports correctly
- API compatibility verified

⏭️  Cluster Testing (requires deployment):
- Kubernetes manifests apply correctly
- Traces appear in Langfuse UI
- Token usage data captured
- Cost tracking operational
- Interactive mode works

## Deployment Instructions

1. **Update Langfuse Secret** (before deploying runner):
   ```bash
   # Get keys from Langfuse UI → Settings → API Keys
   kubectl edit secret langfuse-keys -n ambient-code
   # Replace pk-lf-REPLACE-ME and sk-lf-REPLACE-ME
   ```

2. **Deploy Manifests**:
   ```bash
   kubectl apply -f components/manifests/langfuse/langfuse-config.yaml
   ```

3. **Rebuild Runner Image**:
   ```bash
   cd components/runners/claude-code-runner
   make build CONTAINER_ENGINE=podman
   ```

4. **Test AgenticSession**:
   Create session and check logs for "Langfuse client initialized" message

## Next Steps (Phase 3)

Phase 3 enhancements documented in `langfuse-phase3-ideas.md`:
- Backend API instrumentation (Go)
- Operator instrumentation (Go)
- Multi-tenant project isolation
- Advanced metrics (prompt analysis, feedback loops)
- ROSA production deployment

## Related

- Phase 1 PR: #30 (Langfuse deployment and S3 fixes)
- Context Doc: docs/deployment/langfuse-phase2-context.md (reference only)
- Phase 3 Ideas: docs/deployment/langfuse-phase3-ideas.md (future work)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
jeremyeder added a commit that referenced this pull request Nov 13, 2025
This implements complete LLM observability for the Claude Code Runner using
the surgical instrumentation approach (4 targeted insertion points vs full
method replacement).

- ✨ Add langfuse 3.9.1 (latest, Nov 6 2025)
- ⬆️  Update anthropic to 0.72.0 (from 0.68.0)
- ⬆️  Update claude-agent-sdk to 0.1.6 (from 0.1.4)
- All dependencies Python 3.13 compatible

**Import Changes:**
- Add Langfuse SDK imports (using 3.x API)
- Note: Langfuse 3.x changed API - no longer uses langfuse.decorators

**__init__ Changes (lines 38-51):**
- Initialize Langfuse client with env-based config
- Graceful degradation if LANGFUSE_ENABLED=false or keys missing
- Single client instance reused for all traces in session

**_run_claude_agent_sdk() Instrumentation (4 insertion points):**

Insertion Point 1 (lines 332-352): Session-level trace initialization
- Creates trace with session metadata (namespace, project, model, workspace)
- Links to Kubernetes session ID for cross-component correlation
- Initialize generation_span variable for per-query tracking

Insertion Point 2 (lines 455-472): Per-query generation spans
- Wraps each Claude query with generation span
- Captures prompt input and model name
- Uses nonlocal to update parent scope variable

Insertion Point 3 (lines 430-473): Usage data capture from ResultMessage
- Extracts token counts (input/output/total) from SDK result
- Records cost_usd, duration_ms, duration_api_ms
- Ends generation span and clears for next query

Insertion Point 4 (lines 540-561): Trace finalization and flush
- Updates trace with final session outcome (success, turns)
- Aggregates total cost and duration
- CRITICAL flush() call ensures data sent before pod exit

**Total Modification**: ~81 new lines across 4 insertions (~13.5% of method)

**EnvFrom Changes (lines 575-609):**
- Add langfuse-keys Secret injection (Optional: true)
- Add langfuse-config ConfigMap injection (Optional: true)
- Maintain existing runnerSecretsName logic
- Optional flag ensures pods start even without Langfuse

**New ConfigMap:**
- LANGFUSE_HOST: cluster-internal URL (langfuse-web.langfuse.svc.cluster.local:3000)
- LANGFUSE_ENABLED: "true" (feature flag)

**New Secret:**
- LANGFUSE_PUBLIC_KEY: pk-lf-REPLACE-ME (placeholder)
- LANGFUSE_SECRET_KEY: sk-lf-REPLACE-ME (placeholder)

- Complete step-by-step implementation guide (787 lines)
- Troubleshooting procedures
- Testing validation steps

⚠️ **Langfuse 3.x API Migration**
The implementation uses Langfuse 3.9.1 which has breaking changes from 2.x:
- OLD: `from langfuse.decorators import langfuse_context, observe`
- NEW: `from langfuse import Langfuse, observe`
- `langfuse_context` no longer exists in 3.x

✅ Local Testing (Python 3.13 venv):
- Dependencies install successfully
- Langfuse 3.9.1 imports correctly
- API compatibility verified

⏭️  Cluster Testing (requires deployment):
- Kubernetes manifests apply correctly
- Traces appear in Langfuse UI
- Token usage data captured
- Cost tracking operational
- Interactive mode works

1. **Update Langfuse Secret** (before deploying runner):
   ```bash
   # Get keys from Langfuse UI → Settings → API Keys
   kubectl edit secret langfuse-keys -n ambient-code
   # Replace pk-lf-REPLACE-ME and sk-lf-REPLACE-ME
   ```

2. **Deploy Manifests**:
   ```bash
   kubectl apply -f components/manifests/langfuse/langfuse-config.yaml
   ```

3. **Rebuild Runner Image**:
   ```bash
   cd components/runners/claude-code-runner
   make build CONTAINER_ENGINE=podman
   ```

4. **Test AgenticSession**:
   Create session and check logs for "Langfuse client initialized" message

Phase 3 enhancements documented in `langfuse-phase3-ideas.md`:
- Backend API instrumentation (Go)
- Operator instrumentation (Go)
- Multi-tenant project isolation
- Advanced metrics (prompt analysis, feedback loops)
- ROSA production deployment

- Phase 1 PR: #30 (Langfuse deployment and S3 fixes)
- Context Doc: docs/deployment/langfuse-phase2-context.md (reference only)
- Phase 3 Ideas: docs/deployment/langfuse-phase3-ideas.md (future work)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants