Skip to content

Presidio-Federal/nvidia-a2a-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Factory Edge Agent — NVIDIA + Cisco + Presidio

A remote AI agent running on-premises with local LLM inference, designed to execute infrastructure management tasks across Cisco network environments. Communicates with external orchestration platforms via the Agent-to-Agent (A2A) protocol, secured through Cloudflare Zero Trust tunnels.

Built for the NVIDIA booth at Cisco Live 2026.


AI Studio orchestration and A2A delegation

Presidio AI Studio is the central AI Agent Platform in the cloud: it coordinates agents and workflows. The P.A.T.H. Lab is not “local” in the sense of a customer site—it is centralized lab infrastructure: Cisco AI Rally Kit with NVIDIA GPUs provides centralized LLM inference that AI Studio can use for shared or lab-hosted workloads. Separately, edge agents such as this Factory Edge Agent run a local LLM on their own on-site GPUs (e.g. Ollama on T4s at the deployment site) so sensitive tool calls and data stay at that site. The AI Studio Network Ops agent delegates infrastructure and network tasks to those edge / remote agents over the Agent-to-Agent (A2A) protocol.

flowchart TB
    subgraph Cloud["Cloud — AI Agent Platform"]
        Platform["AI Studio\nCentral agent orchestration"]
        NetOps["AI Studio\nNetwork Ops agent"]
        Platform --> NetOps
    end

    subgraph PATH["P.A.T.H. Lab — centralized LLM capacity"]
        Rally["Cisco AI Rally Kit"]
        CentralLLM["Centralized LLM inference\n(shared NVIDIA GPUs in lab)"]
        Rally --> CentralLLM
    end

    subgraph Edge["Edge / remote agents — on-site inference"]
        Factory["Factory Edge Agent\nlocal LLM on edge GPUs + NAT + LangGraph"]
        Peers["Other A2A agents\neach with its own inference stack"]
    end

    Platform -.->|"Uses lab for centralized inference"| Rally
    NetOps -->|"A2A — delegate tasks"| Factory
    NetOps -->|"A2A — delegate tasks"| Peers
Loading

The diagram below zooms into the Factory Edge Agent stack (tunnels, NAT, LangGraph, on-site Ollama + MCP).


End Goal

Act as a remote autonomous agent with:

  • Local LLM inference on NVIDIA GPUs — sensitive infrastructure data never leaves the site
  • MCP tool integration for Cisco infrastructure management (CML, IOS-XE, Splunk, ThousandEyes)
  • Anthropic Agent Skills for complex multi-step workflows (digital twin creation, incident response)
  • A2A protocol exposure so Presidio AI Studio can delegate tasks to this agent remotely
  • NVIDIA-branded UI for direct human interaction
flowchart TB
    AIStudio["Presidio AI Studio"] -->|"A2A Protocol"| CF["Cloudflare Tunnel"]
    CF -->|"a2a.presidio.tech"| A2A["NAT A2A Server :10000"]
    A2A --> NATServe["NAT API Server :8000"]

    Browser["Operator Browser"] -->|"agent.presidio.tech"| CF2["Cloudflare Tunnel"]
    CF2 --> NATUI["NAT UI :3000"]
    NATUI --> NATServe

    NATServe --> Graph["LangGraph Multi-Agent Router"]

    Graph -->|"cml"| CMLAgent["CML Specialist Agent"]
    Graph -->|"network"| NetAgent["Network Admin Agent"]
    Graph -->|"general"| GenAgent["General Response"]

    CMLAgent -->|"MCP"| CMLMCP["CML MCP :3010"]
    CMLAgent -->|"Skills"| CMLSkill["cml-digital-twin"]
    NetAgent -->|"MCP"| IOSXE["IOS-XE MCP :3011"]
    NetAgent -->|"Skills"| NetSkill["cisco-iosxe-mcp"]

    CMLAgent --> Ollama["Ollama + nemotron-3-nano:4b-128k"]
    NetAgent --> Ollama
    GenAgent --> Ollama
    Graph --> Ollama

    Ollama --> GPU["2x NVIDIA Tesla T4"]
Loading

Hardware Profile

Component Details
Server Cisco UCS C220 M5
Hypervisor Proxmox with LXC container
GPUs 2x NVIDIA Tesla T4 (16GB VRAM each, 32GB total)
GPU Passthrough Via Proxmox to LXC container
OS Debian (inside LXC)
Container Runtime Docker CE with NVIDIA Container Toolkit
Server IP Private LAN (behind Cloudflare tunnel)

NVIDIA Components

Component Role
NVIDIA NeMo Agent Toolkit (NAT) API server (nat serve), A2A gateway (nat a2a serve), wraps the LangGraph agent
nvidia-nat-langchain Plugin enabling NAT to wrap existing LangGraph agents via langgraph_wrapper
NeMo Agent Toolkit UI NVIDIA-branded Next.js chat frontend with intermediate step visualization
NVIDIA Nemotron-3-Nano (4B) Production LLM — NVIDIA's own model running locally via Ollama with 128K context
NVIDIA Tesla T4 GPUs Local LLM inference via Ollama — all data stays on-premises
LangSmith/LangGraph Agent tracing, visualization, and evaluation

Agent Architecture

Why Multi-Agent Routing

A single agent with all tools (20+ MCP tool definitions) creates a massive prompt that takes 20+ seconds to process on the T4 GPUs. The multi-agent routing architecture keeps each LLM call's context small and fast.

flowchart TD
    Query["User Query"] --> Router["Router Node\n~300 tokens, ~1s"]
    Router -->|"cml"| CML["CML Agent\n~1-2K tokens, ~5-8s"]
    Router -->|"network"| Net["Network Admin Agent\n~1-2K tokens, ~5-8s"]
    Router -->|"general"| Gen["General Response\n~300 tokens, ~2s"]
    CML --> Done[Response]
    Net --> Done
    Gen --> Done
Loading
  • Router: Zero tools. Tiny prompt with agent names and descriptions. Classifies the request in ~1 second.
  • CML Specialist: Only CML MCP tools + CML digital twin skill. Handles lab management, topology creation, digital twin workflows.
  • Network Admin Specialist: Only IOS-XE MCP tools + cisco-iosxe-mcp skill. Handles device status via RESTCONF, interface inspection, and read-only network operations. System prompt enforces RESTCONF-only access (no SSH) and provides device-to-port mapping via the skill's reference files.
  • General: No tools. Answers meta-questions about available capabilities using the agent registry.

Skills (Anthropic Agent Skills Specification)

Skills follow the agentskills.io three-level progressive loading:

  1. Level 1 — Discovery: Only name + description from SKILL.md frontmatter loaded at startup (~100 tokens per skill)
  2. Level 2 — Activation: Agent calls activate_skill tool to load full SKILL.md body when a complex task matches a skill
  3. Level 3 — Execution: Agent calls read_skill_reference tool to load files from the skill's references/ directory on demand

This keeps prompt context minimal for simple queries while providing full procedural knowledge for complex workflows.


LLM Benchmark Data

Testing was performed on the multi-agent routing architecture with the following models:

nemotron-3-nano:4b-128k (current production model)

NVIDIA's Nemotron-3-Nano is a 4B parameter model with native 256K context support. Ollama defaults to 4K context on GPUs with less than 24GB VRAM (each T4 is 16GB), so a custom Modelfile was required to set num_ctx explicitly. The -128k variant was created with 128K context configured, balancing context capacity with VRAM usage.

Creating the custom context variant:

cat > /tmp/Modelfile << EOF
FROM nemotron-3-nano:4b
PARAMETER num_ctx 131072
EOF
ollama create nemotron-3-nano:4b-128k -f /tmp/Modelfile
Query Tokens Latency
"What tools are available?" (general route) ~600 ~6s
"Get the hostname of WAN-01" (network route, MCP RESTCONF call) ~2.7K ~12s
"List my CML labs" (CML route, MCP tool call) ~14K ~17s

Models Evaluated

Model Size Context Tool Calling Result
nemotron-3-nano:4b-128k ~2.5GB 128K (custom) Reliable NVIDIA model. Requires clear system prompt instructions for optional parameter handling. Fast prefill. Selected as production model.
qwen2.5:7b ~4.5GB 128K Reliable Follows instructions well, no thinking overhead. Previous production model — replaced by nemotron for NVIDIA alignment.
qwen3:8b ~5GB 128K Yes Thinking mode generates hundreds of hidden tokens. 30-40s per call. Rejected.
phi4-mini:3.8b ~2.5GB 128K Yes Fast inference but ignores system prompt constraints. Hallucinated capabilities instead of listing actual tools. Rejected.
nemotron-mini:4b ~2.5GB 4K Unreliable (~50%) Previous generation NVIDIA model. 4K context too small for tool definitions. Tool calling outputs raw XML instead of proper function calls ~50% of the time. Rejected.

Key Lessons Learned

  • Ollama num_ctx defaults are model- and GPU-dependent. Ollama silently reduces context windows on GPUs with less than 24GB VRAM. nemotron-3-nano:4b supports 256K natively but Ollama defaulted to 4K, causing tool call XML to truncate and produce parse errors. Always configure num_ctx explicitly via a custom Modelfile.
  • Prompt size is the bottleneck, not GPU power. T4s process ~250-350 tokens/sec for prefill. A 5K token prompt = ~15-20s just for prefill.
  • create_deep_agent was the wrong starting point. It creates a planner that makes 3-5 sequential LLM calls per request, each processing the full prompt. Replaced with multi-agent routing using create_react_agent per specialist.
  • Model routing reduces per-call context by 3-4x. Router sees ~300 tokens. Specialists see ~1-2K tokens (only their own tools) vs. a monolithic agent seeing 5.7K+ tokens.
  • Smaller models need explicit operational context in the system prompt. MCP tool descriptions that say "parameters are optional if configured via environment variables" are not specific enough for 4B models. The specialist system prompt must explicitly state which parameters to omit and which tools to avoid. Larger models (7B+) infer this more reliably from tool descriptions alone.
  • Qwen3's "thinking mode" generates invisible tokens that waste inference time. Even with reasoning_effort: none, the model was unreliable. Qwen2.5:7b and nemotron-3-nano avoid this entirely.

Secure External Access (Cloudflare Zero Trust)

The agent is exposed to the internet through Cloudflare Tunnels with Zero Trust access policies. No ports are opened on the server's firewall.

flowchart LR
    Internet["Internet"] --> CF["Cloudflare Edge"]
    CF -->|"agent.presidio.tech"| Tunnel1["Cloudflare Tunnel"]
    CF -->|"a2a.presidio.tech"| Tunnel2["Cloudflare Tunnel"]
    CF -->|"auth.presidio.tech"| Tunnel3["Cloudflare Tunnel"]
    Tunnel1 -->|"localhost:3000"| NATUI["NAT UI"]
    Tunnel2 -->|"localhost:10000"| A2A["NAT A2A Server"]
    Tunnel3 -->|"localhost:8080"| KC["Keycloak"]
Loading
Endpoint URL Backend Auth
NAT UI https://agent.presidio.tech localhost:3000 Cloudflare Access (SSO)
A2A Gateway https://a2a.presidio.tech localhost:10000 OAuth2 Bearer token (JWT)
Keycloak IdP https://auth.presidio.tech localhost:8080 Cloudflare Access Bypass for /realms/*

The A2A endpoint accepts JSON-RPC requests following the A2A protocol specification. The agent card is available at https://a2a.presidio.tech/.well-known/agent-card.json.


OAuth2 Authentication (Keycloak + A2A)

The A2A endpoint is protected by OAuth2 with JWT token validation. Keycloak serves as the identity provider (IdP), and NAT validates tokens server-side using JWKS.

Authentication Flow

sequenceDiagram
    participant Client as A2A Client<br/>(AI Studio)
    participant KC as Keycloak<br/>(auth.presidio.tech)
    participant Agent as Factory Edge Agent<br/>(a2a.presidio.tech)

    Client->>Agent: GET /.well-known/agent-card.json (public, no auth)
    Agent-->>Client: Agent card with securitySchemes<br/>(authorizationCode + clientCredentials flows)

    Client->>KC: POST /protocol/openid-connect/token<br/>(client_credentials grant, scope: agent_execute)
    KC-->>Client: JWT access token (aud: a2a.presidio.tech)

    Client->>Agent: POST / (JSON-RPC)<br/>Authorization: Bearer <JWT>
    Agent->>KC: Fetch JWKS (cached)
    Agent->>Agent: Validate signature, issuer, audience, scopes, expiry
    Agent-->>Client: A2A response
Loading

Architecture

  • Keycloak runs as a Docker container on the same server, exposed via Cloudflare Tunnel at auth.presidio.tech
  • NAT A2A Server validates incoming JWT tokens using the JWKS endpoint (fetched locally via localhost:8080 to avoid hairpin through Cloudflare)
  • Agent card advertises both authorizationCode and clientCredentials OAuth2 flows so clients can authenticate using whichever method they support
  • Cloudflare terminates TLS and passes Authorization headers through to the origin without modification

Keycloak Configuration

Resource Value
Realm agent-auth
Client Confidential client with client_credentials grant and standardFlowEnabled
Client Scope agent_execute (assigned as default scope)
Audience Mapper Injects the A2A public URL into the aud claim of access tokens
User Optional — only needed if using authorization code flow for browser-based connections

NAT server_auth Configuration

general:
  front_end:
    _type: a2a
    name: "Factory Edge Agent"
    description: "Remote infrastructure management agent"
    host: 0.0.0.0
    port: 10000
    public_base_url: "https://a2a.presidio.tech"
    server_auth:
      issuer_url: https://auth.presidio.tech/realms/agent-auth
      discovery_url: https://auth.presidio.tech/realms/agent-auth/.well-known/openid-configuration
      jwks_uri: http://localhost:8080/realms/agent-auth/protocol/openid-connect/certs
      scopes:
        - agent_execute
      audience: https://a2a.presidio.tech

Key points:

  • discovery_url is required for NAT to resolve the correct OAuth endpoints in the agent card. Without it, NAT falls back to generic /oauth/authorize and /oauth/token paths that don't match Keycloak's OIDC paths.
  • jwks_uri points to localhost:8080 so NAT can validate token signatures without routing through Cloudflare (avoids hairpin and Cloudflare Access interference).
  • issuer_url must match the iss claim in tokens exactly — this is the public Keycloak URL since that's what Keycloak stamps into tokens.
  • audience must match the aud claim injected by the Keycloak audience mapper.

Agent Card clientCredentials Flow

By default, NAT only generates an authorizationCode flow in the agent card's securitySchemes. For platforms like AI Studio that use client credentials for machine-to-machine A2A calls, the agent card must also advertise a clientCredentials flow with the token URL. This was added via a source modification to front_end_plugin_worker.py (adding ClientCredentialsOAuthFlow alongside AuthorizationCodeOAuthFlow).

Cloudflare Access Considerations

  • auth.presidio.tech requires a Bypass policy (not Allow) for /realms/* paths. Cloudflare Access "Allow" policies require a browser session cookie — machine clients (NAT's JWKS fetcher, AI Studio's token requests) get 302-redirected to a login page even if their source IP is whitelisted.
  • a2a.presidio.tech does not need a Cloudflare Access policy — authentication is handled by the NAT OAuth middleware. Optionally, restrict by IP for defense-in-depth.

Testing Token Validation Locally

# Get a token via client credentials
TOKEN=$(curl -s -X POST http://localhost:8080/realms/agent-auth/protocol/openid-connect/token \
  -d "grant_type=client_credentials" \
  -d "client_id=<client-id>" \
  -d "client_secret=<client-secret>" \
  -d "scope=agent_execute" | python3 -c 'import sys,json; print(json.load(sys.stdin)["access_token"])')

# Test against NAT directly
curl -s -X POST http://localhost:10000/ \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"agent/discover","id":"1","params":{}}' | python3 -m json.tool

# Test through Cloudflare tunnel
curl -s -X POST https://a2a.presidio.tech/ \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"agent/discover","id":"1","params":{}}' | python3 -m json.tool

Service Management

All services are managed via systemd.

Start/Stop All Services

sudo systemctl start demo-agent.target    # Start all
sudo systemctl stop demo-agent.target     # Stop all

Individual Services

Service Command Port
NAT API Server sudo systemctl restart nat-serve 8000
NAT A2A Server sudo systemctl restart nat-a2a 10000
NAT UI sudo systemctl restart nat-ui 3000

Health Checks

# Service status
systemctl status nat-serve nat-a2a nat-ui

# Port verification
ss -tlnp | grep -E ':(3000|8000|10000) '

# NAT API health
curl http://localhost:8000/health

# View logs
sudo journalctl -u nat-serve -f --no-pager -n 50

# GPU utilization
nvidia-smi

# Ollama model status
docker exec ollama ollama ps

How to Add a New MCP Tool Server

  1. Start the MCP server as a Docker container:
docker run -d \
  -p <PORT>:<PORT> \
  -e <CONFIG_VARS> \
  --name <server-name> \
  --restart unless-stopped \
  <image>
  1. Add the MCP server to the agent registry in /opt/demo-agent/agent.py:
AGENT_REGISTRY = {
    # ... existing agents ...
    "new_agent_name": {
        "description": "Description of what this agent does",
        "mcp": {
            "server_name": {
                "transport": "streamable_http",
                "url": os.getenv("NEW_MCP_URL", "http://localhost:<PORT>/mcp"),
            },
        },
        "skill_dirs": ["skill-directory-name"],  # optional
    },
}
  1. Add the specialist node in the build_agent() function:
async def new_agent_node(state):
    result = await specialists["new_agent_name"].ainvoke({"messages": state["messages"]})
    return {"messages": result["messages"]}

graph.add_node("new_agent_name", new_agent_node)
  1. Add the routing edge — update the conditional_edges dict to include the new agent name.

  2. Add routing examples to ROUTER_PROMPT:

- "relevant query example" -> new_agent_name
  1. Restart:
sudo systemctl restart nat-serve

How to Add a New Skill

Skills follow the Anthropic Agent Skills specification.

  1. Create the skill directory under /opt/demo-agent/skills/:
/opt/demo-agent/skills/my-new-skill/
├── SKILL.md              # Required: metadata + instructions
└── references/           # Optional: documentation loaded on demand
    ├── api-guide.md
    └── examples.md
  1. Write the SKILL.md with YAML frontmatter:
---
name: my-new-skill
description: What this skill does and when to use it. Include keywords for agent discovery.
---

# My New Skill

## Overview
What this skill accomplishes.

## Procedure
1. Step one — call tool X with parameters Y
2. Step two — read reference file Z for configuration details
3. Step three — execute the workflow

## Reference Files
- `api-guide.md` — API documentation for the target system
- `examples.md` — Example configurations and expected outputs
  1. Map the skill to an agent in AGENT_REGISTRY:
"agent_name": {
    "skill_dirs": ["my-new-skill"],
    ...
}
  1. Restart:
sudo systemctl restart nat-serve

The agent will log [AGENT_NAME] Loaded 1 skills: ['my-new-skill'] on startup.


File Structure

/opt/demo-agent/
├── agent.py              # Multi-agent LangGraph graph (router + specialists)
├── nat-config.yml        # NAT configuration (LLM + langgraph_wrapper)
├── .env                  # Environment variables (model, API keys, tokens)
├── langgraph.json        # LangGraph dev server config (visualization only)
├── skills/
│   ├── cml-digital-twin/
│   │   ├── SKILL.md
│   │   └── references/
│   └── cisco-iosxe-mcp/
│       ├── SKILL.md
│       └── references/
└── nat-ui/               # NVIDIA NeMo Agent Toolkit UI (Next.js)
    └── .env              # UI configuration (branding, backend URL)

Systemd Units

/etc/systemd/system/
├── nat-serve.service     # NAT API Server (wraps LangGraph agent)
├── nat-a2a.service       # NAT A2A Server (A2A protocol gateway)
├── nat-ui.service        # NVIDIA NAT UI (Next.js frontend)
└── demo-agent.target     # Group target for all services

Docker Containers

Container Image Port Purpose
ollama ollama/ollama:latest 11434 LLM inference with GPU acceleration
cml-mcp-server ghcr.io/presidio-federal/cml-mcp:latest 3010 Cisco Modeling Labs MCP tools
iosxe-mcp-server ghcr.io/presidio-federal/cisco-ios-xe-mcp:latest 3011 Cisco IOS-XE RESTCONF/SSH MCP tools
keycloak quay.io/keycloak/keycloak 8080 OAuth2 IdP for A2A JWT validation

LangGraph Studio (Visualization)

For visual graph debugging, run the LangGraph dev server on a separate port:

cd /opt/demo-agent && /home/.venv/bin/langgraph dev --port 8123 --host 0.0.0.0

Then connect LangGraph Studio at https://smith.langchain.com/studio pointing to http://<SERVER_IP>:8123.

This runs a second instance of the agent for visualization only. Production traffic goes through nat-serve on port 8000.


Changing the LLM Model

# Pull the new model
docker exec ollama ollama pull <model-name>

# If the model needs a custom context window (required for GPUs < 24GB VRAM):
cat > /tmp/Modelfile << EOF
FROM <model-name>
PARAMETER num_ctx 131072
EOF
docker exec ollama ollama create <model-name>-128k -f /tmp/Modelfile

# Update environment
sed -i 's/^OLLAMA_MODEL=.*/OLLAMA_MODEL=<model-name>/' /opt/demo-agent/.env
sed -i 's/model: .*/model: <model-name>/' /opt/demo-agent/nat-config.yml

# Unload old model and restart
docker exec ollama ollama stop <old-model>
sudo systemctl restart nat-serve

# Verify
docker exec ollama ollama ps

Current production model: nemotron-3-nano:4b-128k (NVIDIA Nemotron-3-Nano with 128K context configured via custom Modelfile)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors