An LLM-evaluation auditor agent built on Google Cloud Agent Builder (ADK), Gemini 2.5, and the Arize Phoenix MCP server.
Live demo: https://gemini-eval-agent-1029931682737.us-central1.run.app Demo video: https://youtu.be/3q9SFoMsAhE (1:41) License: Apache 2.0
You ask "is checkout-rag-v2 hallucinating?" or "show me the slowest 10 traces from yesterday." The agent uses the Arize Phoenix MCP tools to inspect projects, traces, experiments, and datasets, then returns a structured verdict (PASS / FAIL / NEEDS REVIEW) with cited trace IDs, evaluator scores, and a concrete next step.
The agent uses the standard Phoenix MCP tool surface
(list_projects, list_traces, get_trace_detail, list_experiments,
list_datasets, run_evaluation) — same as the official
@arizeai/phoenix-mcp.
A local stub MCP server ships with the repo so demos run without a
Phoenix tenant; flip one flag and the same agent code targets a real
tenant.
┌─────────────┐ user question ┌─────────────────────────────┐
│ Streamlit │ ──────────────────▶ │ ADK LlmAgent (Gemini 2.5) │
│ dashboard │ │ on Vertex AI │
└─────────────┘ ◀── verdict + cites ─└────┬────────────────────────┘
│ MCPToolset / stdio
▼
┌─────────────────────────┐
│ Arize Phoenix MCP │
│ (stub by default, │
│ real tenant via flag) │
└─────────────────────────┘
git clone https://github.com/MukundaKatta/gemini-eval-agent
cd gemini-eval-agent
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
gcloud auth application-default login
export GOOGLE_CLOUD_PROJECT=your-project
export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_LOCATION=us-central1
PYTHONPATH=src streamlit run app/dashboard.pyexport PHOENIX_BASE_URL=https://your-tenant.phoenix.arize.com
export PHOENIX_API_KEY=...In the dashboard sidebar, untick "Use stub Phoenix MCP". The agent now
spawns the official @arizeai/phoenix-mcp npm package via npx.
PYTHONPATH=src pytest -q11 tests cover the stub server and agent wiring.
Apache 2.0. Mukunda Katta, independent developer.