Skip to content

feat: Switch AI benchmark to google.genai Batch API for 50% cost reduction #1

@sangicook

Description

@sangicook

Context

The AI benchmark system (run_live_benchmark.py + consult_ai_gaps.py) currently uses OpenRouter as a proxy to access Gemini Flash. For overnight/batch runs, we should switch to Google's native google.genai package which offers:

  • Batch API (client.batches.create()) with 50% cost reduction for non-interactive workloads
  • Native structured output via response_schema parameter (replaces JSON fence stripping)
  • Direct API access without proxy latency

Proposed Changes

  1. Add google-genai to [project.optional-dependencies.ai] in pyproject.toml
  2. Add make_google_caller() factory in consult_ai_gaps.py alongside existing make_openrouter_caller()
  3. Support GOOGLE_API_KEY environment variable for authentication
  4. Implement batch mode for overnight runs:
    • Collect all prompts upfront
    • Submit as a single batch via client.batches.create()
    • Poll for completion
    • Parse results
  5. Keep OpenRouter as fallback when GOOGLE_API_KEY is not set
  6. Update MODEL_REGISTRY to include native Gemini model IDs

Batch API Usage Pattern

from google import genai

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

# Submit batch
batch = client.batches.create(
    model="gemini-2.5-flash",
    requests=[
        genai.types.BatchRequest(
            custom_id=gap_key,
            request=genai.types.GenerateContentRequest(
                contents=prompt,
                config=genai.types.GenerateContentConfig(
                    response_schema=TypedActionSchema,
                ),
            ),
        )
        for gap_key, prompt in gap_prompts
    ],
)

# Poll for completion
while batch.state == "PENDING":
    time.sleep(30)
    batch = client.batches.get(name=batch.name)

# Parse results
for result in client.batches.list_results(name=batch.name):
    responses[result.custom_id] = result.response.text

Benefits

  • ~50% cost reduction on batch workloads
  • Native structured output (no JSON parsing errors)
  • Lower latency (no proxy hop)
  • Better rate limit handling (Google's native quotas)

Notes

  • OpenRouter caller remains for interactive/debugging use
  • backend field on BenchmarkConfig already supports documenting which mode was used

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions