-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Problem
Desktop macOS app bundles GEMINI_API_KEY in plain text .env and calls Google Gemini APIs directly from the client for all proactive AI features:
- GeminiClient.swift (1,450 lines) — 9 callers across ProactiveAssistants + LiveNotes. Calls
generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key=<KEY>. Uses structured JSON output, tool-calling loops, image+text, streaming SSE. - EmbeddingService.swift (315 lines) — Calls
embedContentandbatchEmbedContentswith key in URL. Used by OCREmbeddingService + TaskAssistant. - Local SQLite stores all results (tasks, memories, focus sessions, embeddings) — should use Firestore/Pinecone like mobile.
Security risks: Same as #5393 Phase 1 (extractable keys, no per-user attribution, blast radius = full vendor billing).
Architectural inconsistency: Mobile routes ALL AI through backend. Desktop bypasses backend entirely, duplicating server-side capabilities that already exist in production.
Proposed Solution
Extend /v4/listen WebSocket to handle desktop's proactive AI needs. Desktop becomes a thin client — same pattern as mobile.
Why /v4/listen (not new endpoints)
- Desktop already connects to
/v4/listenfor STT (PR Desktop: route STT through backend /v4/listen, remove DEEPGRAM_API_KEY #5395) - Auth, heartbeat, reconnection, data protection all handled
- Image chunk protocol already exists
- Bidirectional — backend can push results back asynchronously
- Single connection for all desktop↔backend communication
New WebSocket Message Types
Client → Server:
| Message Type | Purpose | Payload |
|---|---|---|
screen_frame |
Screenshot for analysis | {frame_id, image_b64, app_name, window_title, ocr_text?, analyze: ["focus","tasks","memories","advice"]} |
live_notes_text |
Transcript → note | {text, session_context} |
profile_request |
Generate user profile | {} |
task_rerank |
Re-prioritize tasks | {} |
task_dedup |
Deduplicate tasks | {} |
Server → Client:
| Message Type | Purpose | Payload |
|---|---|---|
focus_result |
Focus detection | {frame_id, status, app_or_site, description, message} |
tasks_extracted |
Tasks from screenshot | {frame_id, tasks: [{id, description, priority, confidence, source_app, due_at}]} |
memories_extracted |
Memories from screenshot | {frame_id, memories: [{id, content, category, confidence}]} |
advice_extracted |
Proactive advice | {frame_id, advice: {id, content, category, confidence}} |
live_note |
Generated note | {text} |
profile_updated |
User profile | {profile_text} |
rerank_complete |
Tasks re-ranked | {updated_tasks: [{id, new_position}]} |
dedup_complete |
Duplicates removed | {deleted_ids, reason} |
Storage Migration
| Desktop SQLite | → Cloud Storage | Status |
|---|---|---|
| action_items | users/{uid}/action_items (Firestore) |
EXISTS |
| memories (incl. advice) | users/{uid}/memories (Firestore) |
EXISTS |
| conversations | users/{uid}/conversations (Firestore) |
EXISTS |
| goals | users/{uid}/goals (Firestore) |
EXISTS |
| focus_sessions | users/{uid}/focus_sessions (Firestore) |
NEW |
| action_items.embedding | Pinecone vectors | REUSE existing infra |
| screenshots.embedding | Pinecone ns3 | REUSE (already syncs) |
Backend Reuse
| Desktop Feature | Backend Equivalent (PRODUCTION) |
|---|---|
| Memory extraction | new_memories_extractor() in utils/llm/memories.py |
| Action item extraction + dedup | extract_action_items() in utils/llm/conversation_processing.py |
| Goal progress detection | extract_and_update_goal_progress() in utils/llm/goals.py |
| User profile | Persona generation in utils/llm/persona.py |
| Data protection | AES-256-GCM encryption in utils/encryption.py |
| Vector search | Pinecone via database/vector_db.py |
New backend work: Vision LLM handlers for screenshot analysis (focus, task extraction, memory extraction, advice).
Subtasks
Backend (Python)
- Add message dispatcher for new types in
_stream_handler()(transcribe.py) - Implement
handle_screen_frame()— routes to analysis handlers in parallel - Implement focus analysis (vision LLM →
focus_result) - Implement task extraction (vision LLM + Firestore dedup + Pinecone similarity →
tasks_extracted) - Implement memory extraction from screenshots (vision LLM →
memories_extracted) - Implement advice extraction (vision LLM →
advice_extracted) - Implement live notes handler (text LLM →
live_note) - Implement task re-ranking handler (Firestore fetch + LLM →
rerank_complete) - Implement task dedup handler (Firestore + Pinecone + LLM →
dedup_complete) - Implement profile generation handler (multi-source fetch + LLM →
profile_updated) - Add
focus_sessionsFirestore collection with data protection decorators - Add frame_id + idempotency for duplicate frame handling
Desktop (Swift)
- Add
sendJSON()method to BackendTranscriptionService for text messages - Add response handlers for all new server→client message types
- Rewrite 9 assistants as thin WebSocket message senders
- Remove
GeminiClient.swift(1,450 lines) - Remove
EmbeddingService.swift(315 lines) - Remove
GEMINI_API_KEYfrom.env.exampleandloadEnvironment() - Replace local SQLite reads with Firestore-cached data where applicable
Testing
- End-to-end test per analysis type (focus, tasks, memories, advice, notes)
- Latency benchmarks (focus detection target: <3s including network hop)
- Load test screenshot bandwidth (adaptive quality/cadence)
Codex Review Summary
Scores: Correctness 6/10, Simplicity 3/10, Completeness 5/10
Key gaps to address during implementation:
- Protocol versioning and typed schemas per message type
- Backpressure — audio and vision on same WS need priority lanes
- Bandwidth strategy — adaptive screenshot quality/cadence, skip unchanged context
- Failure modes — partial outages, retries, idempotency
- Monolith risk — refactor transcribe.py into message dispatcher + per-capability handlers
- Local cache for offline mode (SQLite stays as cache, Firestore is source of truth)
- Privacy controls — PII/sensitive window filtering, user consent for screenshot upload
References
- Issue Desktop: remove client-side API keys, route STT + Gemini through backend #5393 (Phase 1: STT migration — PR Desktop: route STT through backend /v4/listen, remove DEEPGRAM_API_KEY #5395)
desktop/Desktop/Sources/ProactiveAssistants/Core/GeminiClient.swiftdesktop/Desktop/Sources/ProactiveAssistants/Services/EmbeddingService.swiftbackend/routers/transcribe.py—/v4/listenhandlerbackend/utils/llm/— existing LLM infrastructurebackend/utils/encryption.py— data protection
by AI for @beastoin