Skip to content

feat: Add waypoint mode with automatic gateway provisioning#259

Open
akram wants to merge 4 commits intokagenti:mainfrom
akram:feat/waypoint-default-mode
Open

feat: Add waypoint mode with automatic gateway provisioning#259
akram wants to merge 4 commits intokagenti:mainfrom
akram:feat/waypoint-default-mode

Conversation

@akram
Copy link
Copy Markdown
Contributor

@akram akram commented Apr 3, 2026

Summary

This PR implements waypoint mode using Istio ambient mesh as the new default deployment pattern for Kagenti agents, replacing the legacy sidecar mode.

Key changes:

  • NamespaceWaypointReconciler: Automatically provisions Istio Gateway resources and applies ambient mesh labels to agent namespaces
  • Centralized security: keycloak-admin-secret is now read exclusively from the operator namespace (kagenti-system), never from agent namespaces
  • Resource efficiency: Reduces container count by 66% (1 container/pod vs 4 in sidecar mode) with shared L7 waypoint gateways
  • Comprehensive documentation: Design docs, migration guides, and Mermaid.js architecture diagrams

Architecture:

  • Single-container agent pods (no sidecars)
  • Shared waypoint gateways for L7 proxy (per namespace)
  • ztunnel DaemonSet for L4 mTLS (Istio ambient mesh)
  • Operator-managed OIDC client registration and credential distribution

Feature gates:

  • --enable-waypoint-provisioning: Enable automatic waypoint gateway creation (default: true)
  • --enable-operator-client-registration: Enable centralized Keycloak client registration (default: true)

Migration path:

  • Blue-green namespace-by-namespace migration from sidecar to waypoint mode
  • Full rollback support via feature gates
  • Backward compatible with existing sidecar deployments

Documentation

  • docs/waypoint-mode.md: Architecture, design principles, deployment guide
  • docs/migration-sidecar-to-waypoint.md: Step-by-step migration procedure
  • docs/diagrams/: Mermaid.js diagrams (architecture, flows, security model)

Test plan

  • E2E tests validated waypoint gateway auto-creation
  • E2E tests validated Keycloak client registration
  • E2E tests validated agent token acquisition and exchange
  • Security tests verified admin secret isolation
  • Review documentation for accuracy
  • Test migration procedure on staging cluster

🤖 Generated with Claude Code

…ioning

Implement automatic Istio waypoint gateway provisioning for namespaces
containing Kagenti agent or tool workloads, with fixes for controller
startup and centralized configuration support.

NamespaceWaypointReconciler Features:
- Watches namespaces and pods with kagenti.io/type=agent|tool labels
- Automatically applies Istio ambient mesh labels to namespaces
- Creates waypoint gateways using istio-waypoint GatewayClass
- Configures HBONE protocol listeners on port 15008
- Controlled by --enable-waypoint-provisioning flag (defaults to true)
- Triggers namespace reconciliation on pod create/update/delete events

Istio Ambient Mesh Configuration:
- Namespace labels: istio-discovery=enabled, istio.io/dataplane-mode=ambient
- Waypoint reference: istio.io/use-waypoint=<namespace>-waypoint
- Gateway labels: istio.io/waypoint-for=all

Cache Configuration Fixes:
- Removed DefaultNamespaces configuration (was nil, causing issues)
- Removed explicit ByObject entries for Namespace, Pod, Deployment, StatefulSet, Gateway
- Kept only ConfigMap in ByObject with label selectors for kagenti-relevant ConfigMaps
- All other resources now use default cluster-wide cache (controller-runtime defaults)
- Added detailed comments explaining cache configuration rationale

The root issue was that explicitly adding cluster-scoped resources (Namespace) or
workload resources to ByObject prevented controllers from starting properly. By
removing these entries and relying on controller-runtime defaults, all controllers
now start and reconcile correctly.

Client Registration Enhancements:
- Added support for centralized kagenti-operator-config in operator namespace
- First checks kagenti-system/kagenti-operator-config (preferred for waypoint mode)
- Falls back to per-namespace authbridge-config (backward compatibility for sidecar mode)
- Added OperatorNamespace field to ClientRegistrationReconciler
- Improved error messages to indicate which ConfigMap source is being used

Debug Logging:
- Added comprehensive debug logging to NamespaceWaypointReconciler
- Logs controller setup (success/failure) at startup
- Logs every reconcile invocation with namespace and enabled status
- Helps troubleshoot controller startup and reconciliation issues

Dependencies:
- Added sigs.k8s.io/gateway-api v1.2.1 for Gateway resource support

Testing Validated:
- Automatic waypoint gateway provisioning works end-to-end
- Created test-ns-alpha and test-ns-beta namespaces with agent pods
- Waypoint gateways auto-created within 19 seconds
- Istio labels automatically applied to namespaces
- Operator-managed client registration in Keycloak working
- OAuth 2.0 token exchange between agents validated
- All controllers starting and reconciling properly
- Single-container agent pods confirmed (waypoint mode active)

RBAC Requirements:
- Added ClusterRole permissions for gateways.gateway.networking.k8s.io
- Added permissions for namespaces (get, list, watch, update, patch)
- Required for waypoint gateway creation and namespace label management

This implements Phase 3 (Operator Modifications) of the waypoint implementation
plan, enabling zero-touch namespace configuration for Istio ambient mesh with
centralized L7 authentication via waypoint gateways.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Akram <akram.benaissi@gmail.com>
@akram akram force-pushed the feat/waypoint-default-mode branch from 54ec33f to 2e2b17d Compare April 3, 2026 17:24
akram added 3 commits April 3, 2026 19:34
…espace

Changed the ClientRegistrationReconciler to read the keycloak-admin-secret
from the operator namespace (kagenti-system) instead of agent namespaces.
This improves security by centralizing access to Keycloak admin credentials.

Security Benefits:
- Admin credentials only exist in operator namespace (kagenti-system)
- Agent namespaces never have access to Keycloak admin username/password
- Reduces attack surface - compromised agent namespace cannot access admin API
- Follows principle of least privilege
- Aligns with centralized configuration pattern (kagenti-operator-config)

Changes to ClientRegistrationReconciler:
- Read keycloak-admin-secret from r.OperatorNamespace instead of workload namespace
- Updated error messages to indicate operator namespace location
- Added comments explaining security model and secret location
- Updated APIReader comment to reflect new secret location

Documentation Updates (operator-managed-client-registration.md):
- Clarified that keycloak-admin-secret lives in operator namespace only
- Updated requirements section to specify operator namespace for admin secret
- Updated reconcile flow to show admin secret read from operator namespace
- Updated RBAC section to reflect split configuration placement
- Updated migration guide to specify operator namespace setup
- Added security note that agent namespaces should NOT have this secret

Installation Impact:
- Installation scripts must create keycloak-admin-secret in kagenti-system
- Agent namespaces do not need this secret (simplified namespace setup)
- Existing deployments: delete keycloak-admin-secret from agent namespaces
- Operator will automatically start using the centralized secret

Testing:
- Verified with test-ns-alpha and test-ns-beta namespaces
- Confirmed client registration works with centralized secret
- Confirmed operator logs show correct namespace lookup

This completes the centralized configuration pattern started with
kagenti-operator-config, providing a consistent security model for
waypoint mode deployments.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Akram <akram.benaissi@gmail.com>
Add comprehensive documentation for waypoint mode feature:

- docs/waypoint-mode.md: Complete design and user guide
  - Architecture and design principles
  - Key components (NamespaceWaypointReconciler, ClientRegistrationReconciler)
  - Step-by-step deployment instructions
  - Code examples (YAML, Python, Bash)
  - Security model and performance characteristics
  - Troubleshooting guide and FAQ

- docs/migration-sidecar-to-waypoint.md: Migration guide
  - Blue-green migration strategy
  - Phase-by-phase migration procedure
  - Rollback procedures
  - Validation checklist and automated script
  - Troubleshooting migration issues
  - Best practices

Total: 51KB of production-ready documentation covering design,
deployment, migration, security, and operations.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Akram <akram.benaissi@gmail.com>
Created visual diagrams to complement waypoint-mode.md:
- waypoint-architecture.mmd: High-level system architecture
- agent-communication-flow.mmd: Token exchange sequence diagram
- operator-reconciliation.mmd: Controller reconciliation flows
- security-architecture.mmd: Centralized secret management model
- waypoint-vs-sidecar.mmd: Resource comparison between modes

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Akram <akram.benaissi@gmail.com>
@akram akram force-pushed the feat/waypoint-default-mode branch from 2e2b17d to 5f88ed7 Compare April 3, 2026 17:34
Copy link
Copy Markdown
Contributor

@Alan-Cha Alan-Cha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR implements Istio waypoint mode as the default deployment pattern for Kagenti, moving from per-agent sidecar envoys to centralized waypoint gateways. The implementation is solid, but there is a critical documentation inconsistency that must be fixed before merge.

Areas reviewed: Go code, Kubernetes controllers, Helm charts, documentation, security model, commit conventions

Commits: 4 commits, all include DCO sign-off ✅

CI status: All checks passing ✅


Critical Issue: Feature Flag Default Inconsistency

The feature flag --enable-waypoint-provisioning has conflicting defaults between code and documentation:

  • Actual code (cmd/main.go:44): flag.BoolVar(&enableWaypointProvisioning, "enable-waypoint-provisioning", true, ...)
  • PR description: "Enable automatic waypoint gateway creation (default: true)"
  • Documentation (docs/waypoint-mode.md lines ~1515, ~1817): Shows examples with default: false

Impact: Operators reading the documentation will have incorrect expectations about the default behavior. Since waypoint mode is described as the "new default deployment pattern", defaulting to true is correct, but the documentation must match.

Required fix: Update the documentation examples to show default: true consistently.


Positive Findings

Security Hardening ✅

The keycloak-admin-secret read was moved from agent namespaces to operator namespace only (internal/controller/clientregistration_controller.go). This prevents credential exposure and follows the principle of least privilege. Excellent security improvement.

Optional enhancement: Consider adding a validation check that warns if keycloak-admin-secret exists in agent namespaces (legacy deployments), suggesting migration to the centralized model.

Controller Implementation ✅

NamespaceWaypointReconciler has clean, idempotent reconciliation logic with:

  • Proper status conditions
  • Comprehensive error handling
  • Waypoint label propagation to NetworkAttachmentDefinition resources (shows good understanding of Istio ambient mesh requirements)

Documentation ✅

The security architecture diagram (docs/diagrams/security-architecture.mmd) clearly illustrates the isolation boundary between operator namespace (admin credentials) and agent namespaces (per-client credentials).


Summary of Changes

  1. New controller: NamespaceWaypointReconciler automatically provisions waypoint gateways for agent namespaces
  2. Security: Centralized keycloak-admin-secret access (operator namespace only)
  3. Feature flags: Both --enable-waypoint-provisioning and --enable-operator-client-registration default to true
  4. Migration path: Blue-green namespace migration strategy documented

Verdict: REQUEST_CHANGES - Fix the documentation inconsistency, then this is ready to merge.

@pdettori
Copy link
Copy Markdown
Contributor

pdettori commented Apr 8, 2026

@akram wondering if we should wait to merge this PR until we make some decision on the approach (e.g. Istio Ambient dependency)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants