gemini-eval-agent

An LLM-evaluation auditor agent built on Google Cloud Agent Builder (ADK), Gemini 2.5, and the Arize Phoenix MCP server.

Live demo: https://gemini-eval-agent-1029931682737.us-central1.run.app Demo video: https://youtu.be/3q9SFoMsAhE (1:41) License: Apache 2.0

What it does

You ask "is checkout-rag-v2 hallucinating?" or "show me the slowest 10 traces from yesterday." The agent uses the Arize Phoenix MCP tools to inspect projects, traces, experiments, and datasets, then returns a structured verdict (PASS / FAIL / NEEDS REVIEW) with cited trace IDs, evaluator scores, and a concrete next step.

The agent uses the standard Phoenix MCP tool surface (list_projects, list_traces, get_trace_detail, list_experiments, list_datasets, run_evaluation) — same as the official @arizeai/phoenix-mcp. A local stub MCP server ships with the repo so demos run without a Phoenix tenant; flip one flag and the same agent code targets a real tenant.

Architecture

┌─────────────┐  user question      ┌─────────────────────────────┐
│  Streamlit  │ ──────────────────▶ │  ADK LlmAgent (Gemini 2.5)  │
│  dashboard  │                       │  on Vertex AI               │
└─────────────┘ ◀── verdict + cites ─└────┬────────────────────────┘
                                            │ MCPToolset / stdio
                                            ▼
                                   ┌─────────────────────────┐
                                   │  Arize Phoenix MCP      │
                                   │  (stub by default,      │
                                   │  real tenant via flag)  │
                                   └─────────────────────────┘

Try it locally (no Phoenix tenant needed)

git clone https://github.com/MukundaKatta/gemini-eval-agent
cd gemini-eval-agent
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

gcloud auth application-default login
export GOOGLE_CLOUD_PROJECT=your-project
export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_LOCATION=us-central1

PYTHONPATH=src streamlit run app/dashboard.py

Try it against a real Arize Phoenix tenant

export PHOENIX_BASE_URL=https://your-tenant.phoenix.arize.com
export PHOENIX_API_KEY=...

In the dashboard sidebar, untick "Use stub Phoenix MCP". The agent now spawns the official @arizeai/phoenix-mcp npm package via npx.

Tests

PYTHONPATH=src pytest -q

11 tests cover the stub server and agent wiring.

License

Apache 2.0. Mukunda Katta, independent developer.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
scripts		scripts
src/gemini_eval_agent		src/gemini_eval_agent
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
HACKATHON.md		HACKATHON.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gemini-eval-agent

What it does

Architecture

Try it locally (no Phoenix tenant needed)

Try it against a real Arize Phoenix tenant

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gemini-eval-agent

What it does

Architecture

Try it locally (no Phoenix tenant needed)

Try it against a real Arize Phoenix tenant

Tests

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages