Skip to content

Desktop: move proactive AI to /v4/listen, remove GEMINI_API_KEY #5396

@beastoin

Description

@beastoin

Problem

Desktop macOS app bundles GEMINI_API_KEY in plain text .env and calls Google Gemini APIs directly from the client for all proactive AI features:

  • GeminiClient.swift (1,450 lines) — 9 callers across ProactiveAssistants + LiveNotes. Calls generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key=<KEY>. Uses structured JSON output, tool-calling loops, image+text, streaming SSE.
  • EmbeddingService.swift (315 lines) — Calls embedContent and batchEmbedContents with key in URL. Used by OCREmbeddingService + TaskAssistant.
  • Local SQLite stores all results (tasks, memories, focus sessions, embeddings) — should use Firestore/Pinecone like mobile.

Security risks: Same as #5393 Phase 1 (extractable keys, no per-user attribution, blast radius = full vendor billing).

Architectural inconsistency: Mobile routes ALL AI through backend. Desktop bypasses backend entirely, duplicating server-side capabilities that already exist in production.

Proposed Solution

Extend /v4/listen WebSocket to handle desktop's proactive AI needs. Desktop becomes a thin client — same pattern as mobile.

Why /v4/listen (not new endpoints)

New WebSocket Message Types

Client → Server:

Message Type Purpose Payload
screen_frame Screenshot for analysis {frame_id, image_b64, app_name, window_title, ocr_text?, analyze: ["focus","tasks","memories","advice"]}
live_notes_text Transcript → note {text, session_context}
profile_request Generate user profile {}
task_rerank Re-prioritize tasks {}
task_dedup Deduplicate tasks {}

Server → Client:

Message Type Purpose Payload
focus_result Focus detection {frame_id, status, app_or_site, description, message}
tasks_extracted Tasks from screenshot {frame_id, tasks: [{id, description, priority, confidence, source_app, due_at}]}
memories_extracted Memories from screenshot {frame_id, memories: [{id, content, category, confidence}]}
advice_extracted Proactive advice {frame_id, advice: {id, content, category, confidence}}
live_note Generated note {text}
profile_updated User profile {profile_text}
rerank_complete Tasks re-ranked {updated_tasks: [{id, new_position}]}
dedup_complete Duplicates removed {deleted_ids, reason}

Storage Migration

Desktop SQLite → Cloud Storage Status
action_items users/{uid}/action_items (Firestore) EXISTS
memories (incl. advice) users/{uid}/memories (Firestore) EXISTS
conversations users/{uid}/conversations (Firestore) EXISTS
goals users/{uid}/goals (Firestore) EXISTS
focus_sessions users/{uid}/focus_sessions (Firestore) NEW
action_items.embedding Pinecone vectors REUSE existing infra
screenshots.embedding Pinecone ns3 REUSE (already syncs)

Backend Reuse

Desktop Feature Backend Equivalent (PRODUCTION)
Memory extraction new_memories_extractor() in utils/llm/memories.py
Action item extraction + dedup extract_action_items() in utils/llm/conversation_processing.py
Goal progress detection extract_and_update_goal_progress() in utils/llm/goals.py
User profile Persona generation in utils/llm/persona.py
Data protection AES-256-GCM encryption in utils/encryption.py
Vector search Pinecone via database/vector_db.py

New backend work: Vision LLM handlers for screenshot analysis (focus, task extraction, memory extraction, advice).

Subtasks

Backend (Python)

  • Add message dispatcher for new types in _stream_handler() (transcribe.py)
  • Implement handle_screen_frame() — routes to analysis handlers in parallel
  • Implement focus analysis (vision LLM → focus_result)
  • Implement task extraction (vision LLM + Firestore dedup + Pinecone similarity → tasks_extracted)
  • Implement memory extraction from screenshots (vision LLM → memories_extracted)
  • Implement advice extraction (vision LLM → advice_extracted)
  • Implement live notes handler (text LLM → live_note)
  • Implement task re-ranking handler (Firestore fetch + LLM → rerank_complete)
  • Implement task dedup handler (Firestore + Pinecone + LLM → dedup_complete)
  • Implement profile generation handler (multi-source fetch + LLM → profile_updated)
  • Add focus_sessions Firestore collection with data protection decorators
  • Add frame_id + idempotency for duplicate frame handling

Desktop (Swift)

  • Add sendJSON() method to BackendTranscriptionService for text messages
  • Add response handlers for all new server→client message types
  • Rewrite 9 assistants as thin WebSocket message senders
  • Remove GeminiClient.swift (1,450 lines)
  • Remove EmbeddingService.swift (315 lines)
  • Remove GEMINI_API_KEY from .env.example and loadEnvironment()
  • Replace local SQLite reads with Firestore-cached data where applicable

Testing

  • End-to-end test per analysis type (focus, tasks, memories, advice, notes)
  • Latency benchmarks (focus detection target: <3s including network hop)
  • Load test screenshot bandwidth (adaptive quality/cadence)

Codex Review Summary

Scores: Correctness 6/10, Simplicity 3/10, Completeness 5/10

Key gaps to address during implementation:

  1. Protocol versioning and typed schemas per message type
  2. Backpressure — audio and vision on same WS need priority lanes
  3. Bandwidth strategy — adaptive screenshot quality/cadence, skip unchanged context
  4. Failure modes — partial outages, retries, idempotency
  5. Monolith risk — refactor transcribe.py into message dispatcher + per-capability handlers
  6. Local cache for offline mode (SQLite stays as cache, Firestore is source of truth)
  7. Privacy controls — PII/sensitive window filtering, user consent for screenshot upload

References

by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions