Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
578 changes: 578 additions & 0 deletions .agents/ARCHITECTURE.md

Large diffs are not rendered by default.

857 changes: 857 additions & 0 deletions .agents/ARCHITECTURE_INTEGRATION_OVERVIEW.md

Large diffs are not rendered by default.

631 changes: 631 additions & 0 deletions .agents/FALLBACK_STRATEGIES.md

Large diffs are not rendered by default.

613 changes: 613 additions & 0 deletions .agents/GAPS_ANALYSIS.md

Large diffs are not rendered by default.

436 changes: 436 additions & 0 deletions .agents/IMPLEMENTATION_PLAN_WITH_TESTS.md

Large diffs are not rendered by default.

598 changes: 598 additions & 0 deletions .agents/IMPLEMENTATION_ROADMAP.md

Large diffs are not rendered by default.

698 changes: 698 additions & 0 deletions .agents/OPTIMAL_WEBCHAT2API_ARCHITECTURE.md

Large diffs are not rendered by default.

1,820 changes: 1,820 additions & 0 deletions .agents/RELEVANT_REPOS.md

Large diffs are not rendered by default.

396 changes: 396 additions & 0 deletions .agents/REQUIREMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,396 @@
# Universal Dynamic Web Chat Automation Framework - Requirements

## 🎯 **Core Mission**

Build a **vision-driven, fully dynamic web chat automation gateway** that can:
- Work with ANY web chat interface (existing and future)
- Auto-discover UI elements using multimodal AI
- Detect and adapt to different response streaming methods
- Provide OpenAI-compatible API for universal integration
- Cache discoveries for performance while maintaining adaptability

---

## πŸ“‹ **Functional Requirements**

### **FR1: Universal Provider Support**

**FR1.1: Dynamic Provider Registration**
- Accept URL + optional credentials (email/password)
- Automatically navigate to chat interface
- No hardcoded provider-specific logic
- Support for both authenticated and unauthenticated chats

**FR1.2: Target Providers (Examples, Not Exhaustive)**
- βœ… Z.AI (https://chat.z.ai)
- βœ… ChatGPT (https://chat.openai.com)
- βœ… Claude (https://claude.ai)
- βœ… Mistral (https://chat.mistral.ai)
- βœ… DeepSeek (https://chat.deepseek.com)
- βœ… Gemini (https://gemini.google.com)
- βœ… AI Studio (https://aistudio.google.com)
- βœ… Qwen (https://qwen.ai)
- βœ… Any future chat interface

**FR1.3: Provider Lifecycle**
```
1. Registration β†’ 2. Discovery β†’ 3. Validation β†’ 4. Caching β†’ 5. Active Use
```

---

### **FR2: Vision-Based UI Discovery**

**FR2.1: Element Detection**
Using GLM-4.5v or compatible vision models, automatically detect:

**Primary Elements (Required):**
- Chat input field (textarea, contenteditable, input)
- Submit button (send, enter, arrow icon)
- Response area (message container, output div)
- New chat button (start new conversation)

**Secondary Elements (Optional):**
- Model selector dropdown
- Temperature/parameter controls
- System prompt input
- File upload button
- Image generation controls
- Plugin/skill/MCP selectors
- Settings panel

**Tertiary Elements (Advanced):**
- File tree structure (AI Studio example)
- Code editor contents
- Chat history sidebar
- Context window indicator
- Token counter
- Export/share buttons

**FR2.2: CAPTCHA Handling**
- Automatic detection of CAPTCHA challenges
- Integration with 2Captcha API for solving
- Support for: reCAPTCHA v2/v3, hCaptcha, Cloudflare Turnstile
- Fallback: Pause and log for manual intervention

**FR2.3: Login Flow Automation**
- Vision-based detection of login forms
- Email/password field identification
- OAuth button detection (Google, GitHub, etc.)
- 2FA/MFA handling (pause and wait for code)
- Session cookie persistence

---

### **FR3: Response Capture & Streaming**

**FR3.1: Auto-Detect Streaming Method**

Analyze network traffic and DOM to detect:

**Method A: Server-Sent Events (SSE)**
- Monitor for `text/event-stream` content-type
- Intercept SSE connections
- Parse `data:` fields and detect `[DONE]` markers
- Example: ChatGPT, many OpenAI-compatible APIs

**Method B: WebSocket**
- Detect WebSocket upgrade requests
- Intercept `ws://` or `wss://` connections
- Capture bidirectional messages
- Example: Claude, some real-time chats

**Method C: XHR Polling**
- Monitor repeated XHR requests to same endpoint
- Detect polling patterns (intervals)
- Aggregate responses
- Example: Older chat interfaces

**Method D: DOM Mutation Observation**
- Set up MutationObserver on response container
- Detect text node additions/changes
- Fallback for client-side rendering
- Example: SPA frameworks with no network streams

**Method E: Hybrid Detection**
- Use multiple methods simultaneously
- Choose most reliable signal
- Graceful degradation

**FR3.2: Streaming Response Assembly**
- Capture partial responses as they arrive
- Detect completion signals:
- `[DONE]` marker (SSE)
- Connection close (WebSocket)
- Button re-enable (DOM)
- Typing indicator disappear (visual)
- Handle incomplete chunks (buffer and reassemble)
- Deduplicate overlapping content

---

### **FR4: Selector Caching & Stability**

**FR4.1: Selector Storage**
```json
{
"domain": "chat.z.ai",
"discovered_at": "2024-12-05T20:00:00Z",
"last_validated": "2024-12-05T21:30:00Z",
"validation_count": 150,
"failure_count": 2,
"stability_score": 0.987,
"selectors": {
"input": {
"css": "textarea[data-testid='chat-input']",
"xpath": "//textarea[@placeholder='Message']",
"stability": 0.95,
"fallbacks": ["textarea.chat-input", "#message-input"]
},
"submit": {
"css": "button[aria-label='Send message']",
"xpath": "//button[contains(@class, 'send')]",
"stability": 0.90,
"fallbacks": ["button[type='submit']"]
}
}
}
```

**FR4.2: Cache Invalidation Strategy**
- TTL: 7 days by default
- Validate on every 10th request
- Auto-invalidate on 3 consecutive failures
- Manual invalidation via API

**FR4.3: Selector Stability Scoring**
Based on Samelogic research:
- ID selectors: 95% stability
- data-test attributes: 90%
- Unique class combinations: 65-85%
- Position-based (nth-child): 40%
- Basic tags: 30%

**Scoring Formula:**
```
stability_score = (successful_validations / total_attempts) * selector_type_weight
```

---

### **FR5: OpenAI API Compatibility**

**FR5.1: Supported Endpoints**
- `POST /v1/chat/completions` - Primary chat endpoint
- `GET /v1/models` - List available models (discovered)
- `POST /admin/providers` - Register new provider
- `GET /admin/providers` - List registered providers
- `DELETE /admin/providers/{id}` - Remove provider

**FR5.2: Request Format**
```json
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"stream": true,
"temperature": 0.7,
"max_tokens": 2000
}
```

**FR5.3: Response Format (Streaming)**
```
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1702000000,"model":"gpt-4","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1702000000,"model":"gpt-4","choices":[{"index":0,"delta":{"content":" there"},"finish_reason":null}]}

data: [DONE]
```

**FR5.4: Response Format (Non-Streaming)**
```json
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1702000000,
"model": "gpt-4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello there! How can I help you?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 15,
"total_tokens": 25
}
}
```

---

### **FR6: Session Management**

**FR6.1: Multi-Session Support**
- Concurrent sessions per provider
- Session isolation (separate browser contexts)
- Session pooling (reuse idle sessions)
- Max sessions per provider (configurable)

**FR6.2: Session Lifecycle**
```
Created β†’ Authenticated β†’ Active β†’ Idle β†’ Expired β†’ Destroyed
```

**FR6.3: Session Persistence**
- Save cookies to SQLite
- Store localStorage/sessionStorage data
- Persist IndexedDB (if needed)
- Session health checks (periodic validation)

**FR6.4: New Chat Functionality**
- Detect "new chat" button
- Click to start fresh conversation
- Clear context window
- Maintain session authentication

---

### **FR7: Error Handling & Recovery**

**FR7.1: Error Categories**

**Category A: Network Errors**
- Timeout (30s default)
- Connection refused
- DNS resolution failed
- SSL certificate invalid
- **Recovery:** Retry with exponential backoff (3 attempts)

**Category B: Authentication Errors**
- Invalid credentials
- Session expired
- CAPTCHA required
- Rate limited
- **Recovery:** Re-authenticate, solve CAPTCHA, wait for rate limit

**Category C: Discovery Errors**
- Vision API timeout
- No elements found
- Ambiguous elements (multiple matches)
- Selector invalid
- **Recovery:** Re-run discovery with refined prompts, use fallback selectors

**Category D: Automation Errors**
- Element not interactable
- Element not visible
- Click intercepted
- Navigation failed
- **Recovery:** Wait and retry, scroll into view, use JavaScript click

**Category E: Response Errors**
- No response detected
- Partial response
- Malformed response
- Stream interrupted
- **Recovery:** Re-send message, use fallback detection method

---

## πŸ”§ **Non-Functional Requirements**

### **NFR1: Performance**
- First token latency: <3 seconds (vision-based)
- First token latency: <500ms (cached selectors)
- Selector cache hit rate: >90%
- Vision API calls: <10% of requests
- Concurrent sessions: 100+ per instance

### **NFR2: Reliability**
- Uptime: 99.5%
- Error recovery success rate: >95%
- Selector stability: >85%
- Auto-heal from failures: <30 seconds

### **NFR3: Scalability**
- Horizontal scaling via browser context pooling
- Stateless API (sessions in database)
- Support 1000+ concurrent chat conversations
- Provider registration: unlimited

### **NFR4: Security**
- Credentials encrypted at rest (AES-256)
- HTTPS only for external communication
- No logging of user messages (opt-in only)
- Sandbox browser processes
- Regular security audits

### **NFR5: Maintainability**
- Modular architecture (easy to add providers)
- Comprehensive logging (structured JSON)
- Metrics and monitoring (Prometheus)
- Documentation (inline + external)
- Self-healing capabilities

---

## πŸš€ **Success Criteria**

### **MVP Success:**
- βœ… Register 3 different providers (Z.AI, ChatGPT, Claude)
- βœ… Auto-discover UI elements with >90% accuracy
- βœ… Capture streaming responses correctly
- βœ… OpenAI SDK works transparently
- βœ… Handle authentication flows
- βœ… Cache selectors for performance

### **Production Success:**
- βœ… Support 10+ providers without code changes
- βœ… 95% selector cache hit rate
- βœ… <2s average response time
- βœ… Handle CAPTCHA automatically
- βœ… 99.5% uptime
- βœ… Self-heal from 95% of errors

---

## πŸ“¦ **Out of Scope (Future Work)**

- ❌ Voice input/output
- ❌ Video chat automation
- ❌ Mobile app automation (iOS/Android)
- ❌ Desktop app automation (Electron, etc.)
- ❌ Multi-user collaboration features
- ❌ Fine-tuning provider models
- ❌ Custom plugin development UI

---

## πŸ”— **Integration Points**

### **Upstream Dependencies:**
- Playwright (browser automation)
- GLM-4.5v API (vision/CAPTCHA detection)
- 2Captcha API (CAPTCHA solving)
- SQLite (session storage)

### **Downstream Consumers:**
- OpenAI Python SDK
- OpenAI Node.js SDK
- Any HTTP client supporting SSE
- cURL, Postman, etc.

---

**Version:** 1.0
**Last Updated:** 2024-12-05
**Status:** Draft - Awaiting Implementation

Loading