Factory Edge Agent — NVIDIA + Cisco + Presidio

A remote AI agent running on-premises with local LLM inference, designed to execute infrastructure management tasks across Cisco network environments. Communicates with external orchestration platforms via the Agent-to-Agent (A2A) protocol, secured through Cloudflare Zero Trust tunnels.

Built for the NVIDIA booth at Cisco Live 2026.

AI Studio orchestration and A2A delegation

Presidio AI Studio is the central AI Agent Platform in the cloud: it coordinates agents and workflows. The P.A.T.H. Lab is not “local” in the sense of a customer site—it is centralized lab infrastructure: Cisco AI Rally Kit with NVIDIA GPUs provides centralized LLM inference that AI Studio can use for shared or lab-hosted workloads. Separately, edge agents such as this Factory Edge Agent run a local LLM on their own on-site GPUs (e.g. Ollama on T4s at the deployment site) so sensitive tool calls and data stay at that site. The AI Studio Network Ops agent delegates infrastructure and network tasks to those edge / remote agents over the Agent-to-Agent (A2A) protocol.

flowchart TB
    subgraph Cloud["Cloud — AI Agent Platform"]
        Platform["AI Studio\nCentral agent orchestration"]
        NetOps["AI Studio\nNetwork Ops agent"]
        Platform --> NetOps
    end

    subgraph PATH["P.A.T.H. Lab — centralized LLM capacity"]
        Rally["Cisco AI Rally Kit"]
        CentralLLM["Centralized LLM inference\n(shared NVIDIA GPUs in lab)"]
        Rally --> CentralLLM
    end

    subgraph Edge["Edge / remote agents — on-site inference"]
        Factory["Factory Edge Agent\nlocal LLM on edge GPUs + NAT + LangGraph"]
        Peers["Other A2A agents\neach with its own inference stack"]
    end

    Platform -.->|"Uses lab for centralized inference"| Rally
    NetOps -->|"A2A — delegate tasks"| Factory
    NetOps -->|"A2A — delegate tasks"| Peers

The diagram below zooms into the Factory Edge Agent stack (tunnels, NAT, LangGraph, on-site Ollama + MCP).

End Goal

Act as a remote autonomous agent with:

Local LLM inference on NVIDIA GPUs — sensitive infrastructure data never leaves the site
MCP tool integration for Cisco infrastructure management (CML, IOS-XE, Splunk, ThousandEyes)
Anthropic Agent Skills for complex multi-step workflows (digital twin creation, incident response)
A2A protocol exposure so Presidio AI Studio can delegate tasks to this agent remotely
NVIDIA-branded UI for direct human interaction

flowchart TB
    AIStudio["Presidio AI Studio"] -->|"A2A Protocol"| CF["Cloudflare Tunnel"]
    CF -->|"a2a.presidio.tech"| A2A["NAT A2A Server :10000"]
    A2A --> NATServe["NAT API Server :8000"]

    Browser["Operator Browser"] -->|"agent.presidio.tech"| CF2["Cloudflare Tunnel"]
    CF2 --> NATUI["NAT UI :3000"]
    NATUI --> NATServe

    NATServe --> Graph["LangGraph Multi-Agent Router"]

    Graph -->|"cml"| CMLAgent["CML Specialist Agent"]
    Graph -->|"network"| NetAgent["Network Admin Agent"]
    Graph -->|"general"| GenAgent["General Response"]

    CMLAgent -->|"MCP"| CMLMCP["CML MCP :3010"]
    CMLAgent -->|"Skills"| CMLSkill["cml-digital-twin"]
    NetAgent -->|"MCP"| IOSXE["IOS-XE MCP :3011"]
    NetAgent -->|"Skills"| NetSkill["cisco-iosxe-mcp"]

    CMLAgent --> Ollama["Ollama + nemotron-3-nano:4b-128k"]
    NetAgent --> Ollama
    GenAgent --> Ollama
    Graph --> Ollama

    Ollama --> GPU["2x NVIDIA Tesla T4"]

Hardware Profile

Component	Details
Server	Cisco UCS C220 M5
Hypervisor	Proxmox with LXC container
GPUs	2x NVIDIA Tesla T4 (16GB VRAM each, 32GB total)
GPU Passthrough	Via Proxmox to LXC container
OS	Debian (inside LXC)
Container Runtime	Docker CE with NVIDIA Container Toolkit
Server IP	Private LAN (behind Cloudflare tunnel)

NVIDIA Components

Component	Role
NVIDIA NeMo Agent Toolkit (NAT)	API server (`nat serve`), A2A gateway (`nat a2a serve`), wraps the LangGraph agent
nvidia-nat-langchain	Plugin enabling NAT to wrap existing LangGraph agents via `langgraph_wrapper`
NeMo Agent Toolkit UI	NVIDIA-branded Next.js chat frontend with intermediate step visualization
NVIDIA Nemotron-3-Nano (4B)	Production LLM — NVIDIA's own model running locally via Ollama with 128K context
NVIDIA Tesla T4 GPUs	Local LLM inference via Ollama — all data stays on-premises
LangSmith/LangGraph	Agent tracing, visualization, and evaluation

Agent Architecture

Why Multi-Agent Routing

A single agent with all tools (20+ MCP tool definitions) creates a massive prompt that takes 20+ seconds to process on the T4 GPUs. The multi-agent routing architecture keeps each LLM call's context small and fast.

flowchart TD
    Query["User Query"] --> Router["Router Node\n~300 tokens, ~1s"]
    Router -->|"cml"| CML["CML Agent\n~1-2K tokens, ~5-8s"]
    Router -->|"network"| Net["Network Admin Agent\n~1-2K tokens, ~5-8s"]
    Router -->|"general"| Gen["General Response\n~300 tokens, ~2s"]
    CML --> Done[Response]
    Net --> Done
    Gen --> Done

Router: Zero tools. Tiny prompt with agent names and descriptions. Classifies the request in ~1 second.
CML Specialist: Only CML MCP tools + CML digital twin skill. Handles lab management, topology creation, digital twin workflows.
Network Admin Specialist: Only IOS-XE MCP tools + cisco-iosxe-mcp skill. Handles device status via RESTCONF, interface inspection, and read-only network operations. System prompt enforces RESTCONF-only access (no SSH) and provides device-to-port mapping via the skill's reference files.
General: No tools. Answers meta-questions about available capabilities using the agent registry.

Skills (Anthropic Agent Skills Specification)

Skills follow the agentskills.io three-level progressive loading:

Level 1 — Discovery: Only name + description from SKILL.md frontmatter loaded at startup (~100 tokens per skill)
Level 2 — Activation: Agent calls activate_skill tool to load full SKILL.md body when a complex task matches a skill
Level 3 — Execution: Agent calls read_skill_reference tool to load files from the skill's references/ directory on demand

This keeps prompt context minimal for simple queries while providing full procedural knowledge for complex workflows.

LLM Benchmark Data

Testing was performed on the multi-agent routing architecture with the following models:

nemotron-3-nano:4b-128k (current production model)

NVIDIA's Nemotron-3-Nano is a 4B parameter model with native 256K context support. Ollama defaults to 4K context on GPUs with less than 24GB VRAM (each T4 is 16GB), so a custom Modelfile was required to set num_ctx explicitly. The -128k variant was created with 128K context configured, balancing context capacity with VRAM usage.

Creating the custom context variant:

cat > /tmp/Modelfile << EOF
FROM nemotron-3-nano:4b
PARAMETER num_ctx 131072
EOF
ollama create nemotron-3-nano:4b-128k -f /tmp/Modelfile

Query	Tokens	Latency
"What tools are available?" (general route)	~600	~6s
"Get the hostname of WAN-01" (network route, MCP RESTCONF call)	~2.7K	~12s
"List my CML labs" (CML route, MCP tool call)	~14K	~17s

Models Evaluated

Model	Size	Context	Tool Calling	Result
nemotron-3-nano:4b-128k	~2.5GB	128K (custom)	Reliable	NVIDIA model. Requires clear system prompt instructions for optional parameter handling. Fast prefill. Selected as production model.
qwen2.5:7b	~4.5GB	128K	Reliable	Follows instructions well, no thinking overhead. Previous production model — replaced by nemotron for NVIDIA alignment.
qwen3:8b	~5GB	128K	Yes	Thinking mode generates hundreds of hidden tokens. 30-40s per call. Rejected.
phi4-mini:3.8b	~2.5GB	128K	Yes	Fast inference but ignores system prompt constraints. Hallucinated capabilities instead of listing actual tools. Rejected.
nemotron-mini:4b	~2.5GB	4K	Unreliable (~50%)	Previous generation NVIDIA model. 4K context too small for tool definitions. Tool calling outputs raw XML instead of proper function calls ~50% of the time. Rejected.

Key Lessons Learned

Ollama num_ctx defaults are model- and GPU-dependent. Ollama silently reduces context windows on GPUs with less than 24GB VRAM. nemotron-3-nano:4b supports 256K natively but Ollama defaulted to 4K, causing tool call XML to truncate and produce parse errors. Always configure num_ctx explicitly via a custom Modelfile.
Prompt size is the bottleneck, not GPU power. T4s process ~250-350 tokens/sec for prefill. A 5K token prompt = ~15-20s just for prefill.
create_deep_agent was the wrong starting point. It creates a planner that makes 3-5 sequential LLM calls per request, each processing the full prompt. Replaced with multi-agent routing using create_react_agent per specialist.
Model routing reduces per-call context by 3-4x. Router sees ~300 tokens. Specialists see ~1-2K tokens (only their own tools) vs. a monolithic agent seeing 5.7K+ tokens.
Smaller models need explicit operational context in the system prompt. MCP tool descriptions that say "parameters are optional if configured via environment variables" are not specific enough for 4B models. The specialist system prompt must explicitly state which parameters to omit and which tools to avoid. Larger models (7B+) infer this more reliably from tool descriptions alone.
Qwen3's "thinking mode" generates invisible tokens that waste inference time. Even with reasoning_effort: none, the model was unreliable. Qwen2.5:7b and nemotron-3-nano avoid this entirely.

Secure External Access (Cloudflare Zero Trust)

The agent is exposed to the internet through Cloudflare Tunnels with Zero Trust access policies. No ports are opened on the server's firewall.

flowchart LR
    Internet["Internet"] --> CF["Cloudflare Edge"]
    CF -->|"agent.presidio.tech"| Tunnel1["Cloudflare Tunnel"]
    CF -->|"a2a.presidio.tech"| Tunnel2["Cloudflare Tunnel"]
    CF -->|"auth.presidio.tech"| Tunnel3["Cloudflare Tunnel"]
    Tunnel1 -->|"localhost:3000"| NATUI["NAT UI"]
    Tunnel2 -->|"localhost:10000"| A2A["NAT A2A Server"]
    Tunnel3 -->|"localhost:8080"| KC["Keycloak"]

Endpoint	URL	Backend	Auth
NAT UI	`https://agent.presidio.tech`	`localhost:3000`	Cloudflare Access (SSO)
A2A Gateway	`https://a2a.presidio.tech`	`localhost:10000`	OAuth2 Bearer token (JWT)
Keycloak IdP	`https://auth.presidio.tech`	`localhost:8080`	Cloudflare Access Bypass for `/realms/*`

The A2A endpoint accepts JSON-RPC requests following the A2A protocol specification. The agent card is available at https://a2a.presidio.tech/.well-known/agent-card.json.

OAuth2 Authentication (Keycloak + A2A)

The A2A endpoint is protected by OAuth2 with JWT token validation. Keycloak serves as the identity provider (IdP), and NAT validates tokens server-side using JWKS.

Authentication Flow

sequenceDiagram
    participant Client as A2A Client<br/>(AI Studio)
    participant KC as Keycloak<br/>(auth.presidio.tech)
    participant Agent as Factory Edge Agent<br/>(a2a.presidio.tech)

    Client->>Agent: GET /.well-known/agent-card.json (public, no auth)
    Agent-->>Client: Agent card with securitySchemes<br/>(authorizationCode + clientCredentials flows)

    Client->>KC: POST /protocol/openid-connect/token<br/>(client_credentials grant, scope: agent_execute)
    KC-->>Client: JWT access token (aud: a2a.presidio.tech)

    Client->>Agent: POST / (JSON-RPC)<br/>Authorization: Bearer <JWT>
    Agent->>KC: Fetch JWKS (cached)
    Agent->>Agent: Validate signature, issuer, audience, scopes, expiry
    Agent-->>Client: A2A response

Architecture

Keycloak runs as a Docker container on the same server, exposed via Cloudflare Tunnel at auth.presidio.tech
NAT A2A Server validates incoming JWT tokens using the JWKS endpoint (fetched locally via localhost:8080 to avoid hairpin through Cloudflare)
Agent card advertises both authorizationCode and clientCredentials OAuth2 flows so clients can authenticate using whichever method they support
Cloudflare terminates TLS and passes Authorization headers through to the origin without modification

Keycloak Configuration

Resource	Value
Realm	`agent-auth`
Client	Confidential client with `client_credentials` grant and `standardFlowEnabled`
Client Scope	`agent_execute` (assigned as default scope)
Audience Mapper	Injects the A2A public URL into the `aud` claim of access tokens
User	Optional — only needed if using authorization code flow for browser-based connections

NAT `server_auth` Configuration

general:
  front_end:
    _type: a2a
    name: "Factory Edge Agent"
    description: "Remote infrastructure management agent"
    host: 0.0.0.0
    port: 10000
    public_base_url: "https://a2a.presidio.tech"
    server_auth:
      issuer_url: https://auth.presidio.tech/realms/agent-auth
      discovery_url: https://auth.presidio.tech/realms/agent-auth/.well-known/openid-configuration
      jwks_uri: http://localhost:8080/realms/agent-auth/protocol/openid-connect/certs
      scopes:
        - agent_execute
      audience: https://a2a.presidio.tech

Key points:

discovery_url is required for NAT to resolve the correct OAuth endpoints in the agent card. Without it, NAT falls back to generic /oauth/authorize and /oauth/token paths that don't match Keycloak's OIDC paths.
jwks_uri points to localhost:8080 so NAT can validate token signatures without routing through Cloudflare (avoids hairpin and Cloudflare Access interference).
issuer_url must match the iss claim in tokens exactly — this is the public Keycloak URL since that's what Keycloak stamps into tokens.
audience must match the aud claim injected by the Keycloak audience mapper.

Agent Card `clientCredentials` Flow

By default, NAT only generates an authorizationCode flow in the agent card's securitySchemes. For platforms like AI Studio that use client credentials for machine-to-machine A2A calls, the agent card must also advertise a clientCredentials flow with the token URL. This was added via a source modification to front_end_plugin_worker.py (adding ClientCredentialsOAuthFlow alongside AuthorizationCodeOAuthFlow).

Cloudflare Access Considerations

auth.presidio.tech requires a Bypass policy (not Allow) for /realms/* paths. Cloudflare Access "Allow" policies require a browser session cookie — machine clients (NAT's JWKS fetcher, AI Studio's token requests) get 302-redirected to a login page even if their source IP is whitelisted.
a2a.presidio.tech does not need a Cloudflare Access policy — authentication is handled by the NAT OAuth middleware. Optionally, restrict by IP for defense-in-depth.

Testing Token Validation Locally

# Get a token via client credentials
TOKEN=$(curl -s -X POST http://localhost:8080/realms/agent-auth/protocol/openid-connect/token \
  -d "grant_type=client_credentials" \
  -d "client_id=<client-id>" \
  -d "client_secret=<client-secret>" \
  -d "scope=agent_execute" | python3 -c 'import sys,json; print(json.load(sys.stdin)["access_token"])')

# Test against NAT directly
curl -s -X POST http://localhost:10000/ \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"agent/discover","id":"1","params":{}}' | python3 -m json.tool

# Test through Cloudflare tunnel
curl -s -X POST https://a2a.presidio.tech/ \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"agent/discover","id":"1","params":{}}' | python3 -m json.tool

Service Management

All services are managed via systemd.

Start/Stop All Services

sudo systemctl start demo-agent.target    # Start all
sudo systemctl stop demo-agent.target     # Stop all

Individual Services

Service	Command	Port
NAT API Server	`sudo systemctl restart nat-serve`	8000
NAT A2A Server	`sudo systemctl restart nat-a2a`	10000
NAT UI	`sudo systemctl restart nat-ui`	3000

Health Checks

# Service status
systemctl status nat-serve nat-a2a nat-ui

# Port verification
ss -tlnp | grep -E ':(3000|8000|10000) '

# NAT API health
curl http://localhost:8000/health

# View logs
sudo journalctl -u nat-serve -f --no-pager -n 50

# GPU utilization
nvidia-smi

# Ollama model status
docker exec ollama ollama ps

How to Add a New MCP Tool Server

Start the MCP server as a Docker container:

docker run -d \
  -p <PORT>:<PORT> \
  -e <CONFIG_VARS> \
  --name <server-name> \
  --restart unless-stopped \
  <image>

Add the MCP server to the agent registry in /opt/demo-agent/agent.py:

AGENT_REGISTRY = {
    # ... existing agents ...
    "new_agent_name": {
        "description": "Description of what this agent does",
        "mcp": {
            "server_name": {
                "transport": "streamable_http",
                "url": os.getenv("NEW_MCP_URL", "http://localhost:<PORT>/mcp"),
            },
        },
        "skill_dirs": ["skill-directory-name"],  # optional
    },
}

Add the specialist node in the build_agent() function:

async def new_agent_node(state):
    result = await specialists["new_agent_name"].ainvoke({"messages": state["messages"]})
    return {"messages": result["messages"]}

graph.add_node("new_agent_name", new_agent_node)

Add the routing edge — update the conditional_edges dict to include the new agent name.
Add routing examples to ROUTER_PROMPT:

- "relevant query example" -> new_agent_name

Restart:

sudo systemctl restart nat-serve

How to Add a New Skill

Skills follow the Anthropic Agent Skills specification.

Create the skill directory under /opt/demo-agent/skills/:

/opt/demo-agent/skills/my-new-skill/
├── SKILL.md              # Required: metadata + instructions
└── references/           # Optional: documentation loaded on demand
    ├── api-guide.md
    └── examples.md

Write the SKILL.md with YAML frontmatter:

---
name: my-new-skill
description: What this skill does and when to use it. Include keywords for agent discovery.
---

# My New Skill

## Overview
What this skill accomplishes.

## Procedure
1. Step one — call tool X with parameters Y
2. Step two — read reference file Z for configuration details
3. Step three — execute the workflow

## Reference Files
- `api-guide.md` — API documentation for the target system
- `examples.md` — Example configurations and expected outputs

Map the skill to an agent in AGENT_REGISTRY:

"agent_name": {
    "skill_dirs": ["my-new-skill"],
    ...
}

Restart:

sudo systemctl restart nat-serve

The agent will log [AGENT_NAME] Loaded 1 skills: ['my-new-skill'] on startup.

File Structure

/opt/demo-agent/
├── agent.py              # Multi-agent LangGraph graph (router + specialists)
├── nat-config.yml        # NAT configuration (LLM + langgraph_wrapper)
├── .env                  # Environment variables (model, API keys, tokens)
├── langgraph.json        # LangGraph dev server config (visualization only)
├── skills/
│   ├── cml-digital-twin/
│   │   ├── SKILL.md
│   │   └── references/
│   └── cisco-iosxe-mcp/
│       ├── SKILL.md
│       └── references/
└── nat-ui/               # NVIDIA NeMo Agent Toolkit UI (Next.js)
    └── .env              # UI configuration (branding, backend URL)

Systemd Units

/etc/systemd/system/
├── nat-serve.service     # NAT API Server (wraps LangGraph agent)
├── nat-a2a.service       # NAT A2A Server (A2A protocol gateway)
├── nat-ui.service        # NVIDIA NAT UI (Next.js frontend)
└── demo-agent.target     # Group target for all services

Docker Containers

Container	Image	Port	Purpose
ollama	`ollama/ollama:latest`	11434	LLM inference with GPU acceleration
cml-mcp-server	`ghcr.io/presidio-federal/cml-mcp:latest`	3010	Cisco Modeling Labs MCP tools
iosxe-mcp-server	`ghcr.io/presidio-federal/cisco-ios-xe-mcp:latest`	3011	Cisco IOS-XE RESTCONF/SSH MCP tools
keycloak	`quay.io/keycloak/keycloak`	8080	OAuth2 IdP for A2A JWT validation

LangGraph Studio (Visualization)

For visual graph debugging, run the LangGraph dev server on a separate port:

cd /opt/demo-agent && /home/.venv/bin/langgraph dev --port 8123 --host 0.0.0.0

Then connect LangGraph Studio at https://smith.langchain.com/studio pointing to http://<SERVER_IP>:8123.

This runs a second instance of the agent for visualization only. Production traffic goes through nat-serve on port 8000.

Changing the LLM Model

# Pull the new model
docker exec ollama ollama pull <model-name>

# If the model needs a custom context window (required for GPUs < 24GB VRAM):
cat > /tmp/Modelfile << EOF
FROM <model-name>
PARAMETER num_ctx 131072
EOF
docker exec ollama ollama create <model-name>-128k -f /tmp/Modelfile

# Update environment
sed -i 's/^OLLAMA_MODEL=.*/OLLAMA_MODEL=<model-name>/' /opt/demo-agent/.env
sed -i 's/model: .*/model: <model-name>/' /opt/demo-agent/nat-config.yml

# Unload old model and restart
docker exec ollama ollama stop <old-model>
sudo systemctl restart nat-serve

# Verify
docker exec ollama ollama ps

Current production model: nemotron-3-nano:4b-128k (NVIDIA Nemotron-3-Nano with 128K context configured via custom Modelfile)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
README.md		README.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Factory Edge Agent — NVIDIA + Cisco + Presidio

AI Studio orchestration and A2A delegation

End Goal

Hardware Profile

NVIDIA Components

Agent Architecture

Why Multi-Agent Routing

Skills (Anthropic Agent Skills Specification)

LLM Benchmark Data

nemotron-3-nano:4b-128k (current production model)

Models Evaluated

Key Lessons Learned

Secure External Access (Cloudflare Zero Trust)

OAuth2 Authentication (Keycloak + A2A)

Authentication Flow

Architecture

Keycloak Configuration

NAT server_auth Configuration

Agent Card clientCredentials Flow

Cloudflare Access Considerations

Testing Token Validation Locally

Service Management

Start/Stop All Services

Individual Services

Health Checks

How to Add a New MCP Tool Server

How to Add a New Skill

File Structure

Systemd Units

Docker Containers

LangGraph Studio (Visualization)

Changing the LLM Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

NAT `server_auth` Configuration

Agent Card `clientCredentials` Flow

Packages