How we use 3 AI agents as a full engineering team.
Claude Code as the boss. Codex + Gemini as the team. Real patterns from production.
We run an animal sanctuary in Japan. 28 cats and dogs. No engineering team.
We built an API marketplace with 30+ integrations, token economy, multi-language support, Docker deployment — using 3 AI coding agents working as a team.
This document is exactly how we orchestrate them.
┌─────────────────────────────────────────────────┐
│ Claude Code (Boss) │
│ "The Old Man" — thinks, plans, decides, codes │
│ Model: Claude Sonnet/Opus │
└──────────┬──────────────────┬────────────────────┘
│ │
┌─────▼──────┐ ┌─────▼──────┐
│ Codex │ │ Gemini │
│ "Muscle" │ │ "Eyes" │
│ Bulk work │ │ Vision + │
│ Background │ │ Research │
└────────────┘ └────────────┘
| Agent | Codename | Best at | Weak at |
|---|---|---|---|
| Claude Code | The Boss | Planning, architecture, complex logic, multi-file edits | Can't see images, limited web search |
| Codex CLI | Muscle | Bulk refactoring 10+ files, long background jobs, parallel tests | No vision, can't browse web |
| Gemini CLI | Eyes | Reading images/PDFs/videos, web search, analyzing 1M+ token codebases | High hallucination rate, overclaims solutions |
When a task comes in, the Boss decides who handles it:
| Task | Who | Why |
|---|---|---|
| Read screenshot / PDF / video | Gemini | Only agent with vision |
| "What's trending in AI this week?" | Gemini | Real-time web search |
| Analyze a 500K-line codebase | Gemini | Handles massive context windows |
| Refactor 10+ files | Codex --full-auto |
Bulk parallel work, won't get distracted |
| Run test suite for 30 minutes | Codex (background) | Long-running, no interaction needed |
| Batch rename across project | Codex --full-auto |
Mechanical, parallelizable |
| Design new feature architecture | Boss (Claude) | Needs judgment and planning |
| Debug a subtle logic bug | Boss (Claude) | Needs deep reasoning |
| Write API endpoint + tests | Boss (Claude) | Multi-step, needs context |
| Scheduled monitoring | Nebula (cron) | Runs on its own schedule |
| Cross-platform integration | Nebula | Bridges external services |
| Everything else | Boss (Claude) | Default — handles 80% of work |
The human says what they want. The Boss decides who does it. The human never needs to think about which agent to use.
Human: "Fix the login bug and update the README"
Boss: (thinks) Login bug = complex reasoning → I'll do it
(thinks) README update = mechanical → dispatch to Codex
| Risk | Action | Example |
|---|---|---|
| 🟢 Low | Use directly | Code formatting, README updates |
| 🟡 Medium | Spot-check | Codex refactoring, Gemini research |
| 🔴 High | Full verification | Anything touching money, database, or deployment |
Gemini hallucination rule: The more confident Gemini sounds, the more you should verify. Gemini will say "I've fixed everything!" when it hasn't even found the right file.
The Boss follows this decision tree for every action:
Can I decide this myself? → Just do it (rename a variable, fix a typo)
Is there risk? → Give options (2-3 approaches, let human pick)
Does it touch money/DB/deploy? → Must ask (never proceed without confirmation)
When the Boss dispatches work, the output is structured so the Boss can immediately use it:
Dispatch to Codex:
"Refactor all API routes to use the new error handler.
Files: src/routes/*.ts
Pattern: replace try/catch with handleError wrapper
Expected: ~15 files changed, no logic changes"
Codex returns:
"Done. 14 files changed. 1 file skipped (already used handleError).
Files: [list]
Tests: all passing"
Boss verifies → merges
The most powerful pattern: running 3 Claude Code terminals simultaneously.
Terminal 1 (Boss): Feature development
Terminal 2 (Worktree): Bug fixes on separate branch
Terminal 3 (Review): Code review + testing
# Terminal 1: Main development
claude code # Working on feature branch
# Terminal 2: Isolated bug fix (separate git worktree)
# Claude Code creates a worktree automatically
# Changes don't conflict with Terminal 1
# Terminal 3: QA and testing
claude code # Run tests, review code from other terminalsThe human acts as message router:
Terminal 1: "I've finished the API endpoint. Tell the reviewer."
Human: (copy-pastes summary to Terminal 3)
Terminal 3: "Reviewing... Found 2 issues. Tell the developer."
Human: (copy-pastes issues to Terminal 1)
Tip: Each terminal maintains its own context. Keep messages between them structured and concise — don't paste entire conversations.
Here's how a typical task flows through the system:
Human: "Add a new /api/translate endpoint that supports 5 providers"
Boss (Claude Code):
1. Plans architecture (provider interface, fallback logic, pricing)
2. Writes the main endpoint + provider interface
3. Dispatches to Codex: "Implement 5 provider adapters using this interface"
4. Dispatches to Gemini: "Research current pricing for DeepL, Google, AWS Translate"
5. Integrates Codex's adapters + Gemini's pricing data
6. Writes tests
7. Commits
Time: 25 minutes (vs 2-3 hours solo coding)
# Claude Code (the boss)
# Install: https://docs.anthropic.com/en/docs/claude-code
# Codex CLI (optional — for bulk work)
npm install -g @openai/codex
# Gemini CLI (optional — for vision + research)
npm install -g @google/gemini-cliAdd to your CLAUDE.md (project-level instructions):
## Agent Dispatch Rules
| Task | Agent |
|------|-------|
| Images/PDFs/videos, web search, large codebase analysis | Gemini |
| Bulk refactoring 10+ files, background tests, mechanical tasks | Codex --full-auto |
| Scheduled monitoring, cross-platform integration | Nebula |
| Everything else | Claude Code (self) |
## Dispatch Commands
- Codex: `codex exec "..." --full-auto --skip-git-repo-check`
- Gemini: `gemini -m gemini-3-flash -p "..."`
## Verification by Risk
- 🟢 Low → use directly
- 🟡 Medium → spot-check
- 🔴 High (money/DB/deploy) → full verificationAfter 6 months of multi-agent development:
- Parallel terminals — 3x throughput on independent tasks
- Codex for bulk — Renaming 20 files in 2 minutes instead of 20
- Gemini for research — Reading a 50-page PDF and summarizing in 30 seconds
- Clear dispatch rules — No time wasted deciding "which agent should do this?"
- Gemini for code — It hallucinates function signatures and claims success when it failed
- Codex for architecture — It follows instructions literally but can't make judgment calls
- Too many agents — 3 is the sweet spot. More than that = more coordination overhead than value
- Trusting Gemini's confidence — "I've fixed everything!" → nothing was fixed
In practice, Claude Code handles 80% of all work. Codex handles 15% (bulk tasks). Gemini handles 5% (vision + research). Don't over-engineer the orchestration — most tasks are best handled by one smart agent, not a committee.
| Project | Description |
|---|---|
| 112 Claude Code Skills | Everything we learned, extracted as reusable skills |
| AI API Benchmark | Monthly tests of 30+ AI APIs from Tokyo |
| AI Prompt Mastery | One prompt to make any AI respond like an expert |
Built at Washin Village — an animal sanctuary in Japan where 28 cats & dogs and 3 AI agents work together.