Skip to content

Desktop: remove client-side API keys, route STT + Gemini through backend #5393

@beastoin

Description

@beastoin

Problem

The desktop app (macOS) bundles vendor API keys (DEEPGRAM_API_KEY, GEMINI_API_KEY) in the app bundle's .env file and calls external APIs directly from the client:

  • Deepgram STT: TranscriptionService.swift connects directly to wss://api.deepgram.com/v1/listen with the API key in the WebSocket auth header
  • Gemini: GeminiClient.swift and EmbeddingService.swift call Google APIs with the key in URL query parameters (?key=<KEY>)

Security risks:

  • Keys are extractable from the app bundle (Contents/Resources/.env — plain text)
  • Keys are visible in network traffic (auth headers, URL params)
  • No per-user attribution, rate limiting, or revocation granularity
  • Blast radius = full vendor account billing

Architectural inconsistency:

  • Mobile app routes ALL audio through the Python backend's /v4/listen WebSocket — API keys stay server-side
  • Desktop app bypasses the backend entirely for STT — keys ship in the client
  • Desktop misses backend features: VAD gate (~75% Deepgram cost savings), speech profiles, speaker identification, unified billing/monitoring

Proposed Solution

Phase 1: Route desktop STT through /v4/listen

The Python backend already has a fully-featured /v4/listen WebSocket endpoint with Firebase auth, used by all mobile clients. Desktop should use it too.

Swift changes:

  • Replace direct Deepgram WebSocket connection in TranscriptionService.swift with a WebSocket connection to the backend's /v4/listen (or /v4/web/listen which supports first-message token auth)
  • Remove DEEPGRAM_API_KEY from client-side .env
  • Desktop gets VAD gate, speech profiles, speaker ID for free

Backend changes:

  • May need minor adjustments to handle desktop audio format (16kHz stereo PCM vs mobile's opus/pcm8)
  • Add source=desktop parameter for monitoring/billing segmentation

Phase 2: Route Gemini through backend endpoints

  • Add backend API endpoints for the proactive assistant operations currently calling Gemini directly (embeddings, generation)
  • Remove GEMINI_API_KEY from client-side .env
  • Enables server-side rate limiting, cost tracking, prompt governance

Phase 3: Decommission direct API paths

  • Remove direct Deepgram/Gemini code paths from desktop app
  • Remove .env bundling of vendor keys from build pipeline
  • Add CI check to block shipping vendor API keys in app bundles

Benefits

Current (direct) Proposed (backend proxy)
API key exposure Client-side, extractable Server-side only
Cost visibility Invisible to backend Unified monitoring
VAD gate savings Not available ~75% Deepgram cost reduction
Speech profiles Not available Speaker identification
Rate limiting None Per-user/device/session
Key rotation Requires app update Server-side, instant
Provider flexibility Hardcoded Deepgram Backend can switch STT providers

Latency Consideration

Adding a backend hop adds some latency. In practice, with persistent WebSocket connections and region colocation, the increase is modest relative to STT model inference + endpointing delays. Mitigated with dedicated streaming workers and autoscaling (same infra mobile already uses).

References

  • desktop/Desktop/Sources/TranscriptionService.swift — direct Deepgram connection
  • desktop/Desktop/Sources/ProactiveAssistants/Core/GeminiClient.swift — direct Gemini calls
  • backend/routers/transcribe.py — existing /v4/listen endpoint
  • backend/utils/stt/streaming.py — server-side STT providers
  • backend/utils/stt/vad_gate.py — VAD gate (active on mobile)

by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions