A remote AI agent running on-premises with local LLM inference, designed to execute infrastructure management tasks across Cisco network environments. Communicates with external orchestration platforms via the Agent-to-Agent (A2A) protocol, secured through Cloudflare Zero Trust tunnels.
Built for the NVIDIA booth at Cisco Live 2026.
Presidio AI Studio is the central AI Agent Platform in the cloud: it coordinates agents and workflows. The P.A.T.H. Lab is not “local” in the sense of a customer site—it is centralized lab infrastructure: Cisco AI Rally Kit with NVIDIA GPUs provides centralized LLM inference that AI Studio can use for shared or lab-hosted workloads. Separately, edge agents such as this Factory Edge Agent run a local LLM on their own on-site GPUs (e.g. Ollama on T4s at the deployment site) so sensitive tool calls and data stay at that site. The AI Studio Network Ops agent delegates infrastructure and network tasks to those edge / remote agents over the Agent-to-Agent (A2A) protocol.
flowchart TB
subgraph Cloud["Cloud — AI Agent Platform"]
Platform["AI Studio\nCentral agent orchestration"]
NetOps["AI Studio\nNetwork Ops agent"]
Platform --> NetOps
end
subgraph PATH["P.A.T.H. Lab — centralized LLM capacity"]
Rally["Cisco AI Rally Kit"]
CentralLLM["Centralized LLM inference\n(shared NVIDIA GPUs in lab)"]
Rally --> CentralLLM
end
subgraph Edge["Edge / remote agents — on-site inference"]
Factory["Factory Edge Agent\nlocal LLM on edge GPUs + NAT + LangGraph"]
Peers["Other A2A agents\neach with its own inference stack"]
end
Platform -.->|"Uses lab for centralized inference"| Rally
NetOps -->|"A2A — delegate tasks"| Factory
NetOps -->|"A2A — delegate tasks"| Peers
The diagram below zooms into the Factory Edge Agent stack (tunnels, NAT, LangGraph, on-site Ollama + MCP).
Act as a remote autonomous agent with:
- Local LLM inference on NVIDIA GPUs — sensitive infrastructure data never leaves the site
- MCP tool integration for Cisco infrastructure management (CML, IOS-XE, Splunk, ThousandEyes)
- Anthropic Agent Skills for complex multi-step workflows (digital twin creation, incident response)
- A2A protocol exposure so Presidio AI Studio can delegate tasks to this agent remotely
- NVIDIA-branded UI for direct human interaction
flowchart TB
AIStudio["Presidio AI Studio"] -->|"A2A Protocol"| CF["Cloudflare Tunnel"]
CF -->|"a2a.presidio.tech"| A2A["NAT A2A Server :10000"]
A2A --> NATServe["NAT API Server :8000"]
Browser["Operator Browser"] -->|"agent.presidio.tech"| CF2["Cloudflare Tunnel"]
CF2 --> NATUI["NAT UI :3000"]
NATUI --> NATServe
NATServe --> Graph["LangGraph Multi-Agent Router"]
Graph -->|"cml"| CMLAgent["CML Specialist Agent"]
Graph -->|"network"| NetAgent["Network Admin Agent"]
Graph -->|"general"| GenAgent["General Response"]
CMLAgent -->|"MCP"| CMLMCP["CML MCP :3010"]
CMLAgent -->|"Skills"| CMLSkill["cml-digital-twin"]
NetAgent -->|"MCP"| IOSXE["IOS-XE MCP :3011"]
NetAgent -->|"Skills"| NetSkill["cisco-iosxe-mcp"]
CMLAgent --> Ollama["Ollama + nemotron-3-nano:4b-128k"]
NetAgent --> Ollama
GenAgent --> Ollama
Graph --> Ollama
Ollama --> GPU["2x NVIDIA Tesla T4"]
| Component | Details |
|---|---|
| Server | Cisco UCS C220 M5 |
| Hypervisor | Proxmox with LXC container |
| GPUs | 2x NVIDIA Tesla T4 (16GB VRAM each, 32GB total) |
| GPU Passthrough | Via Proxmox to LXC container |
| OS | Debian (inside LXC) |
| Container Runtime | Docker CE with NVIDIA Container Toolkit |
| Server IP | Private LAN (behind Cloudflare tunnel) |
| Component | Role |
|---|---|
| NVIDIA NeMo Agent Toolkit (NAT) | API server (nat serve), A2A gateway (nat a2a serve), wraps the LangGraph agent |
| nvidia-nat-langchain | Plugin enabling NAT to wrap existing LangGraph agents via langgraph_wrapper |
| NeMo Agent Toolkit UI | NVIDIA-branded Next.js chat frontend with intermediate step visualization |
| NVIDIA Nemotron-3-Nano (4B) | Production LLM — NVIDIA's own model running locally via Ollama with 128K context |
| NVIDIA Tesla T4 GPUs | Local LLM inference via Ollama — all data stays on-premises |
| LangSmith/LangGraph | Agent tracing, visualization, and evaluation |
A single agent with all tools (20+ MCP tool definitions) creates a massive prompt that takes 20+ seconds to process on the T4 GPUs. The multi-agent routing architecture keeps each LLM call's context small and fast.
flowchart TD
Query["User Query"] --> Router["Router Node\n~300 tokens, ~1s"]
Router -->|"cml"| CML["CML Agent\n~1-2K tokens, ~5-8s"]
Router -->|"network"| Net["Network Admin Agent\n~1-2K tokens, ~5-8s"]
Router -->|"general"| Gen["General Response\n~300 tokens, ~2s"]
CML --> Done[Response]
Net --> Done
Gen --> Done
- Router: Zero tools. Tiny prompt with agent names and descriptions. Classifies the request in ~1 second.
- CML Specialist: Only CML MCP tools + CML digital twin skill. Handles lab management, topology creation, digital twin workflows.
- Network Admin Specialist: Only IOS-XE MCP tools + cisco-iosxe-mcp skill. Handles device status via RESTCONF, interface inspection, and read-only network operations. System prompt enforces RESTCONF-only access (no SSH) and provides device-to-port mapping via the skill's reference files.
- General: No tools. Answers meta-questions about available capabilities using the agent registry.
Skills follow the agentskills.io three-level progressive loading:
- Level 1 — Discovery: Only
name+descriptionfrom SKILL.md frontmatter loaded at startup (~100 tokens per skill) - Level 2 — Activation: Agent calls
activate_skilltool to load full SKILL.md body when a complex task matches a skill - Level 3 — Execution: Agent calls
read_skill_referencetool to load files from the skill'sreferences/directory on demand
This keeps prompt context minimal for simple queries while providing full procedural knowledge for complex workflows.
Testing was performed on the multi-agent routing architecture with the following models:
NVIDIA's Nemotron-3-Nano is a 4B parameter model with native 256K context support. Ollama defaults to 4K context on GPUs with less than 24GB VRAM (each T4 is 16GB), so a custom Modelfile was required to set num_ctx explicitly. The -128k variant was created with 128K context configured, balancing context capacity with VRAM usage.
Creating the custom context variant:
cat > /tmp/Modelfile << EOF
FROM nemotron-3-nano:4b
PARAMETER num_ctx 131072
EOF
ollama create nemotron-3-nano:4b-128k -f /tmp/Modelfile| Query | Tokens | Latency |
|---|---|---|
| "What tools are available?" (general route) | ~600 | ~6s |
| "Get the hostname of WAN-01" (network route, MCP RESTCONF call) | ~2.7K | ~12s |
| "List my CML labs" (CML route, MCP tool call) | ~14K | ~17s |
| Model | Size | Context | Tool Calling | Result |
|---|---|---|---|---|
| nemotron-3-nano:4b-128k | ~2.5GB | 128K (custom) | Reliable | NVIDIA model. Requires clear system prompt instructions for optional parameter handling. Fast prefill. Selected as production model. |
| qwen2.5:7b | ~4.5GB | 128K | Reliable | Follows instructions well, no thinking overhead. Previous production model — replaced by nemotron for NVIDIA alignment. |
| qwen3:8b | ~5GB | 128K | Yes | Thinking mode generates hundreds of hidden tokens. 30-40s per call. Rejected. |
| phi4-mini:3.8b | ~2.5GB | 128K | Yes | Fast inference but ignores system prompt constraints. Hallucinated capabilities instead of listing actual tools. Rejected. |
| nemotron-mini:4b | ~2.5GB | 4K | Unreliable (~50%) | Previous generation NVIDIA model. 4K context too small for tool definitions. Tool calling outputs raw XML instead of proper function calls ~50% of the time. Rejected. |
- Ollama
num_ctxdefaults are model- and GPU-dependent. Ollama silently reduces context windows on GPUs with less than 24GB VRAM. nemotron-3-nano:4b supports 256K natively but Ollama defaulted to 4K, causing tool call XML to truncate and produce parse errors. Always configurenum_ctxexplicitly via a custom Modelfile. - Prompt size is the bottleneck, not GPU power. T4s process ~250-350 tokens/sec for prefill. A 5K token prompt = ~15-20s just for prefill.
create_deep_agentwas the wrong starting point. It creates a planner that makes 3-5 sequential LLM calls per request, each processing the full prompt. Replaced with multi-agent routing usingcreate_react_agentper specialist.- Model routing reduces per-call context by 3-4x. Router sees ~300 tokens. Specialists see ~1-2K tokens (only their own tools) vs. a monolithic agent seeing 5.7K+ tokens.
- Smaller models need explicit operational context in the system prompt. MCP tool descriptions that say "parameters are optional if configured via environment variables" are not specific enough for 4B models. The specialist system prompt must explicitly state which parameters to omit and which tools to avoid. Larger models (7B+) infer this more reliably from tool descriptions alone.
- Qwen3's "thinking mode" generates invisible tokens that waste inference time. Even with
reasoning_effort: none, the model was unreliable. Qwen2.5:7b and nemotron-3-nano avoid this entirely.
The agent is exposed to the internet through Cloudflare Tunnels with Zero Trust access policies. No ports are opened on the server's firewall.
flowchart LR
Internet["Internet"] --> CF["Cloudflare Edge"]
CF -->|"agent.presidio.tech"| Tunnel1["Cloudflare Tunnel"]
CF -->|"a2a.presidio.tech"| Tunnel2["Cloudflare Tunnel"]
CF -->|"auth.presidio.tech"| Tunnel3["Cloudflare Tunnel"]
Tunnel1 -->|"localhost:3000"| NATUI["NAT UI"]
Tunnel2 -->|"localhost:10000"| A2A["NAT A2A Server"]
Tunnel3 -->|"localhost:8080"| KC["Keycloak"]
| Endpoint | URL | Backend | Auth |
|---|---|---|---|
| NAT UI | https://agent.presidio.tech |
localhost:3000 |
Cloudflare Access (SSO) |
| A2A Gateway | https://a2a.presidio.tech |
localhost:10000 |
OAuth2 Bearer token (JWT) |
| Keycloak IdP | https://auth.presidio.tech |
localhost:8080 |
Cloudflare Access Bypass for /realms/* |
The A2A endpoint accepts JSON-RPC requests following the A2A protocol specification. The agent card is available at https://a2a.presidio.tech/.well-known/agent-card.json.
The A2A endpoint is protected by OAuth2 with JWT token validation. Keycloak serves as the identity provider (IdP), and NAT validates tokens server-side using JWKS.
sequenceDiagram
participant Client as A2A Client<br/>(AI Studio)
participant KC as Keycloak<br/>(auth.presidio.tech)
participant Agent as Factory Edge Agent<br/>(a2a.presidio.tech)
Client->>Agent: GET /.well-known/agent-card.json (public, no auth)
Agent-->>Client: Agent card with securitySchemes<br/>(authorizationCode + clientCredentials flows)
Client->>KC: POST /protocol/openid-connect/token<br/>(client_credentials grant, scope: agent_execute)
KC-->>Client: JWT access token (aud: a2a.presidio.tech)
Client->>Agent: POST / (JSON-RPC)<br/>Authorization: Bearer <JWT>
Agent->>KC: Fetch JWKS (cached)
Agent->>Agent: Validate signature, issuer, audience, scopes, expiry
Agent-->>Client: A2A response
- Keycloak runs as a Docker container on the same server, exposed via Cloudflare Tunnel at
auth.presidio.tech - NAT A2A Server validates incoming JWT tokens using the JWKS endpoint (fetched locally via
localhost:8080to avoid hairpin through Cloudflare) - Agent card advertises both
authorizationCodeandclientCredentialsOAuth2 flows so clients can authenticate using whichever method they support - Cloudflare terminates TLS and passes Authorization headers through to the origin without modification
| Resource | Value |
|---|---|
| Realm | agent-auth |
| Client | Confidential client with client_credentials grant and standardFlowEnabled |
| Client Scope | agent_execute (assigned as default scope) |
| Audience Mapper | Injects the A2A public URL into the aud claim of access tokens |
| User | Optional — only needed if using authorization code flow for browser-based connections |
general:
front_end:
_type: a2a
name: "Factory Edge Agent"
description: "Remote infrastructure management agent"
host: 0.0.0.0
port: 10000
public_base_url: "https://a2a.presidio.tech"
server_auth:
issuer_url: https://auth.presidio.tech/realms/agent-auth
discovery_url: https://auth.presidio.tech/realms/agent-auth/.well-known/openid-configuration
jwks_uri: http://localhost:8080/realms/agent-auth/protocol/openid-connect/certs
scopes:
- agent_execute
audience: https://a2a.presidio.techKey points:
discovery_urlis required for NAT to resolve the correct OAuth endpoints in the agent card. Without it, NAT falls back to generic/oauth/authorizeand/oauth/tokenpaths that don't match Keycloak's OIDC paths.jwks_uripoints tolocalhost:8080so NAT can validate token signatures without routing through Cloudflare (avoids hairpin and Cloudflare Access interference).issuer_urlmust match theissclaim in tokens exactly — this is the public Keycloak URL since that's what Keycloak stamps into tokens.audiencemust match theaudclaim injected by the Keycloak audience mapper.
By default, NAT only generates an authorizationCode flow in the agent card's securitySchemes. For platforms like AI Studio that use client credentials for machine-to-machine A2A calls, the agent card must also advertise a clientCredentials flow with the token URL. This was added via a source modification to front_end_plugin_worker.py (adding ClientCredentialsOAuthFlow alongside AuthorizationCodeOAuthFlow).
auth.presidio.techrequires a Bypass policy (not Allow) for/realms/*paths. Cloudflare Access "Allow" policies require a browser session cookie — machine clients (NAT's JWKS fetcher, AI Studio's token requests) get 302-redirected to a login page even if their source IP is whitelisted.a2a.presidio.techdoes not need a Cloudflare Access policy — authentication is handled by the NAT OAuth middleware. Optionally, restrict by IP for defense-in-depth.
# Get a token via client credentials
TOKEN=$(curl -s -X POST http://localhost:8080/realms/agent-auth/protocol/openid-connect/token \
-d "grant_type=client_credentials" \
-d "client_id=<client-id>" \
-d "client_secret=<client-secret>" \
-d "scope=agent_execute" | python3 -c 'import sys,json; print(json.load(sys.stdin)["access_token"])')
# Test against NAT directly
curl -s -X POST http://localhost:10000/ \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"agent/discover","id":"1","params":{}}' | python3 -m json.tool
# Test through Cloudflare tunnel
curl -s -X POST https://a2a.presidio.tech/ \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"agent/discover","id":"1","params":{}}' | python3 -m json.toolAll services are managed via systemd.
sudo systemctl start demo-agent.target # Start all
sudo systemctl stop demo-agent.target # Stop all| Service | Command | Port |
|---|---|---|
| NAT API Server | sudo systemctl restart nat-serve |
8000 |
| NAT A2A Server | sudo systemctl restart nat-a2a |
10000 |
| NAT UI | sudo systemctl restart nat-ui |
3000 |
# Service status
systemctl status nat-serve nat-a2a nat-ui
# Port verification
ss -tlnp | grep -E ':(3000|8000|10000) '
# NAT API health
curl http://localhost:8000/health
# View logs
sudo journalctl -u nat-serve -f --no-pager -n 50
# GPU utilization
nvidia-smi
# Ollama model status
docker exec ollama ollama ps- Start the MCP server as a Docker container:
docker run -d \
-p <PORT>:<PORT> \
-e <CONFIG_VARS> \
--name <server-name> \
--restart unless-stopped \
<image>- Add the MCP server to the agent registry in
/opt/demo-agent/agent.py:
AGENT_REGISTRY = {
# ... existing agents ...
"new_agent_name": {
"description": "Description of what this agent does",
"mcp": {
"server_name": {
"transport": "streamable_http",
"url": os.getenv("NEW_MCP_URL", "http://localhost:<PORT>/mcp"),
},
},
"skill_dirs": ["skill-directory-name"], # optional
},
}- Add the specialist node in the
build_agent()function:
async def new_agent_node(state):
result = await specialists["new_agent_name"].ainvoke({"messages": state["messages"]})
return {"messages": result["messages"]}
graph.add_node("new_agent_name", new_agent_node)-
Add the routing edge — update the
conditional_edgesdict to include the new agent name. -
Add routing examples to
ROUTER_PROMPT:
- "relevant query example" -> new_agent_name- Restart:
sudo systemctl restart nat-serveSkills follow the Anthropic Agent Skills specification.
- Create the skill directory under
/opt/demo-agent/skills/:
/opt/demo-agent/skills/my-new-skill/
├── SKILL.md # Required: metadata + instructions
└── references/ # Optional: documentation loaded on demand
├── api-guide.md
└── examples.md
- Write the SKILL.md with YAML frontmatter:
---
name: my-new-skill
description: What this skill does and when to use it. Include keywords for agent discovery.
---
# My New Skill
## Overview
What this skill accomplishes.
## Procedure
1. Step one — call tool X with parameters Y
2. Step two — read reference file Z for configuration details
3. Step three — execute the workflow
## Reference Files
- `api-guide.md` — API documentation for the target system
- `examples.md` — Example configurations and expected outputs- Map the skill to an agent in
AGENT_REGISTRY:
"agent_name": {
"skill_dirs": ["my-new-skill"],
...
}- Restart:
sudo systemctl restart nat-serveThe agent will log [AGENT_NAME] Loaded 1 skills: ['my-new-skill'] on startup.
/opt/demo-agent/
├── agent.py # Multi-agent LangGraph graph (router + specialists)
├── nat-config.yml # NAT configuration (LLM + langgraph_wrapper)
├── .env # Environment variables (model, API keys, tokens)
├── langgraph.json # LangGraph dev server config (visualization only)
├── skills/
│ ├── cml-digital-twin/
│ │ ├── SKILL.md
│ │ └── references/
│ └── cisco-iosxe-mcp/
│ ├── SKILL.md
│ └── references/
└── nat-ui/ # NVIDIA NeMo Agent Toolkit UI (Next.js)
└── .env # UI configuration (branding, backend URL)
/etc/systemd/system/
├── nat-serve.service # NAT API Server (wraps LangGraph agent)
├── nat-a2a.service # NAT A2A Server (A2A protocol gateway)
├── nat-ui.service # NVIDIA NAT UI (Next.js frontend)
└── demo-agent.target # Group target for all services
| Container | Image | Port | Purpose |
|---|---|---|---|
| ollama | ollama/ollama:latest |
11434 | LLM inference with GPU acceleration |
| cml-mcp-server | ghcr.io/presidio-federal/cml-mcp:latest |
3010 | Cisco Modeling Labs MCP tools |
| iosxe-mcp-server | ghcr.io/presidio-federal/cisco-ios-xe-mcp:latest |
3011 | Cisco IOS-XE RESTCONF/SSH MCP tools |
| keycloak | quay.io/keycloak/keycloak |
8080 | OAuth2 IdP for A2A JWT validation |
For visual graph debugging, run the LangGraph dev server on a separate port:
cd /opt/demo-agent && /home/.venv/bin/langgraph dev --port 8123 --host 0.0.0.0Then connect LangGraph Studio at https://smith.langchain.com/studio pointing to http://<SERVER_IP>:8123.
This runs a second instance of the agent for visualization only. Production traffic goes through nat-serve on port 8000.
# Pull the new model
docker exec ollama ollama pull <model-name>
# If the model needs a custom context window (required for GPUs < 24GB VRAM):
cat > /tmp/Modelfile << EOF
FROM <model-name>
PARAMETER num_ctx 131072
EOF
docker exec ollama ollama create <model-name>-128k -f /tmp/Modelfile
# Update environment
sed -i 's/^OLLAMA_MODEL=.*/OLLAMA_MODEL=<model-name>/' /opt/demo-agent/.env
sed -i 's/model: .*/model: <model-name>/' /opt/demo-agent/nat-config.yml
# Unload old model and restart
docker exec ollama ollama stop <old-model>
sudo systemctl restart nat-serve
# Verify
docker exec ollama ollama psCurrent production model: nemotron-3-nano:4b-128k (NVIDIA Nemotron-3-Nano with 128K context configured via custom Modelfile)