Local Dockerised Eval Server #52

olesho · 2025-09-23T17:07:31Z

Local Dockerised Eval Server

Summary by CodeRabbit

New Features
- Centralized LLM configuration with programmatic panel API, per-request overrides, persistent settings, multi-tier models (main/mini/nano), and per-client application during evaluations.
- Dynamic configuration via JSON-RPC (configure_llm) for runtime/provider updates.
Documentation
- Added comprehensive LLM configuration docs, provider guides, protocol examples, usage snippets, and an environment template.
Chores
- Default ports updated for evaluation/HTTP endpoints and examples adjusted for automated mode (auth disabled by default).

2. ✓ Fixed multi-provider credential preservation in saveProviderSpecificSettings() 3. ✓ Secured logging in both handleStorageChange() and getDebugInfo()

…e, implemented single consolidated override

- AgentNodes.ts: Keep LLMConfigurationManager approach for centralized config handling - EvaluationAgent.ts: Keep configManager scope for proper cleanup in finally block - AgentService.ts: Remove duplicate static method/property definitions The conflicts were between centralized configuration management (HEAD) and individual parameter passing (upstream/main). The centralized approach is kept for better override support during evaluations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2025-09-23T17:07:41Z

Walkthrough

Adds a centralized LLMConfigurationManager and JSON‑RPC configure_llm, introduces nested main/mini/nano model schema and per‑request overrides plus persistent storage, updates eval server and frontend evaluation flows and examples, and adds automated‑mode auth/validation shortcuts across agent and UI components.

Changes

Cohort / File(s)	Summary
Documentation & Docs `MODEL-CONFIGS.md`, `eval-server/nodejs/CLAUDE.md`	New docs describing LLM configuration architecture, model schema, JSON‑RPC `configure_llm`, multi‑provider support, model config examples, and protocol details.
Environment & Server Config `eval-server/nodejs/.env.example`, `eval-server/nodejs/src/config.js`, `eval-server/nodejs/README.md`	Adds example env keys for multiple LLM providers and defaults; restructures CONFIG with `providers`, `defaults`, `security`, `clients`, `evals`; README updated to show nested model schema.
Eval Server Core & RPC `eval-server/nodejs/src/api-server.js`, `eval-server/nodejs/src/lib/EvalServer.js`, `eval-server/nodejs/src/client-manager.js`	Adds nested model config processing and helpers (processNestedModelConfig, createDefaultModelConfig, createDynamicEvaluationNested); adds per‑connection JSON‑RPC handling (`configure_llm`) and per‑client LLM merging; adjusts model precedence behavior.
Eval Server Examples & Ports `eval-server/nodejs/examples/library-usage.js`, `eval-server/nodejs/examples/multiple-evals.js`, `eval-server/nodejs/examples/with-http-wrapper.js`	Examples import CONFIG, demonstrate multi‑provider model selection and configure_llm usage; adjusted ports/status messages (8080→8082) and automated‑mode auth messaging.
Frontend Build `front_end/panels/ai_chat/BUILD.gn`	Adds `core/LLMConfigurationManager.ts` to build sources and metadata lists.
Frontend LLM Configuration Manager `front_end/panels/ai_chat/core/LLMConfigurationManager.ts`	New singleton `LLMConfigurationManager` and `LLMConfig` interface: localStorage persistence, per‑request overrides (set/clear), validation, change listeners, provider‑specific persistence, debug utilities, and redaction/logging.
Agent Core Integration `front_end/panels/ai_chat/core/AgentService.ts`, `front_end/panels/ai_chat/core/AgentNodes.ts`, `front_end/panels/ai_chat/LLM/OpenAIProvider.ts`	AgentService and AgentNodes now source provider/apiKey/models from LLMConfigurationManager; AgentService subscribes to config changes and adds refreshCredentials; OpenAIProvider short‑circuits credential checks in automated mode.
Evaluation Protocol & Agent `front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts`, `front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts`, `front_end/panels/ai_chat/evaluation/EvaluationAgent.ts`	Adds `configure_llm` JSON‑RPC request/response types and helpers; evaluation params support per‑request `api_key`/`endpoint`; EvaluationAgent applies early overrides, clears them after runs, and handles configure_llm requests; automated‑mode auth skips added.
UI & Evaluation Config `front_end/panels/ai_chat/ui/AIChatPanel.ts`, `front_end/panels/ai_chat/common/EvaluationConfig.ts`	AIChatPanel integrates config manager, adds `setLLMConfiguration` API, centralizes model/provider getters with defined fallback logic; EvaluationConfig defaults endpoint changed to `ws://localhost:8082` and automated‑mode persistence/secret handling adjusted.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant EvalServerWS as EvalServer (WS)
  participant FrontendAgent as Frontend EvaluationAgent
  participant ConfigMgr as LLMConfigurationManager

  rect rgba(200,230,255,0.25)
    note over Client,EvalServerWS: Dynamic LLM configuration via JSON‑RPC
    Client->>EvalServerWS: {"jsonrpc":"2.0","id":"1","method":"configure_llm","params":{provider, apiKey?, endpoint?, models{main,mini?,nano?}, partial?}}
    EvalServerWS->>EvalServerWS: validate and merge per‑connection LLM config
    EvalServerWS-->>Client: {"jsonrpc":"2.0","id":"1","result":{status:"success",appliedConfig}}
  end

  rect rgba(220,255,220,0.25)
    note over Client,FrontendAgent: Per‑request evaluation with nested/flat model config
    Client->>FrontendAgent: start_evaluation { model: {...} }
    FrontendAgent->>ConfigMgr: setOverride(derived from request)
    FrontendAgent->>FrontendAgent: run chat/tools using configMgr values
    FrontendAgent->>ConfigMgr: clearOverride()
  end

sequenceDiagram
  autonumber
  participant HTTPClient
  participant APIServer
  participant EvalEngine
  participant Provider

  HTTPClient->>APIServer: POST /responses { model: nested|flat, input }
  APIServer->>APIServer: processNestedModelConfig() -> nestedModelConfig
  APIServer->>EvalEngine: createDynamicEvaluationNested(input, nestedModelConfig)
  EvalEngine->>Provider: execute with selected provider/models/keys
  Provider-->>EvalEngine: result
  EvalEngine-->>APIServer: evaluation summary
  APIServer-->>HTTPClient: OpenAI‑like response with nested model metadata

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Update agent framework structure #43 — Modifies ai_chat provider/config plumbing and likely overlaps with per‑client/provider wiring changes.
Fix the model config issue, OpenRouter Oauth issue and model refresh … #45 — Touches the same ai_chat modules and model‑configuration/refresh paths; likely overlaps with centralized config integration.
Add initial MCP version with SDK imports #44 — Changes AgentNodes/tool execution logic and may conflict with the new configuration sourcing in AgentNodes.

Poem

I twitch my whiskers, toggle keys,
New providers humming in the breeze. 🐇
Overrides hop in, then gently go,
Models nest like clover rows. 🌿
JSON whispers, configs flow — hop! 🚀

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Local Dockerised Eval Server" is concise and clearly relates to the eval-server and local-server configuration changes in the diff (new .env example, API/eval server ports, examples and README updates), but it omits other substantial work in this PR such as the new front-end LLMConfigurationManager, protocol extensions, and evaluation integration, so it only partially summarizes the full scope.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/local-eval-api-try3-w-main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2025-09-23T17:08:09Z

Claude finished @olesho's task —— View job

PR Review: Local Dockerised Eval Server

I've completed a comprehensive review of this PR implementing a local dockerized evaluation server. Here's my analysis:

Todo List:

Analyze overall architecture and changes
Review Node.js evaluation server backend code
Review frontend TypeScript integration code
Check configuration management and security
Evaluate documentation and examples
Assess test coverage
Provide comprehensive feedback and recommendations

✅ Strengths

Architecture & Design

Solid WebSocket-based RPC architecture with JSON-RPC 2.0 protocol implementation
Excellent separation of concerns - clear distinction between EvalServer library, API server, and client management
Flexible multi-provider LLM support (OpenAI, Groq, OpenRouter, LiteLLM) with proper abstraction
Well-designed configuration override system enabling both manual and automated evaluation modes
Event-driven architecture with proper EventEmitter usage and tracing support

Code Quality

Comprehensive error handling with structured logging using Winston
Good TypeScript integration with proper type definitions and interfaces
Clean API design for the EvalServer library with intuitive ClientProxy pattern
Proper resource cleanup - WebSocket connections, RPC clients, and intervals are cleaned up
Well-structured configuration management with validation and environment variable support

Documentation

Excellent CLAUDE.md files providing clear architecture overview and usage instructions
Comprehensive MODEL-CONFIGS.md documenting the configuration system thoroughly
Good inline documentation and JSDoc comments throughout the codebase

⚠️ Areas for Improvement

Security Concerns

API Key Exposure Risk (eval-server/nodejs/src/api-server.js:349)
```
api_key: process.env.OPENAI_API_KEY
```
- API keys are passed directly in response payloads and could be logged or exposed
- Recommendation: Sanitize logs and consider token-based authentication
CORS Configuration (eval-server/nodejs/src/api-server.js:61)
```
res.setHeader('Access-Control-Allow-Origin', '*');
```
- Wildcard CORS allows any origin to access the API
- Recommendation: Configure specific allowed origins for production
Weak Default Authentication (eval-server/nodejs/examples/library-usage.js:14)
```
authKey: 'hello',
```
- Examples use weak default secret keys
- Recommendation: Generate cryptographically secure keys and document secure generation

Performance & Reliability

Potential Memory Leaks (eval-server/nodejs/src/lib/EvalServer.js:59-60)
```
this.connectedClients = new Map();
this.activeEvaluations = new Map();
```
- No cleanup mechanism for completed evaluations or stale connections
- Recommendation: Implement periodic cleanup and connection heartbeat monitoring
High Timeout Values (eval-server/nodejs/src/api-server.js:415)
```
timeout: 1500000, // 25 minutes
```
- Very long timeouts could lead to resource exhaustion
- Recommendation: Use more reasonable defaults (5-10 minutes) with configurable limits
Synchronous File Operations (eval-server/nodejs/src/api-server.js:30)
```
const configContent = fs.readFileSync(configPath, 'utf8');
```
- Blocking operations can impact server performance
- Recommendation: Use async file operations (fs.promises.readFile)

Error Handling

Generic Error Responses (eval-server/nodejs/src/lib/EvalServer.js:370-378)
- RPC errors return generic messages that could expose internal details
- Recommendation: Implement proper error categorization and sanitization
Missing Validation (eval-server/nodejs/src/api-server.js:260-305)
- Request body validation is minimal in the responses endpoint
- Recommendation: Add comprehensive input validation and sanitization

Code Organization

Inconsistent Model Configuration Format
- Mix of flat and nested model configuration patterns across the codebase
- Recommendation: Standardize on the nested format and provide migration utilities
Large Files
- EvalServer.js (905 lines) and api-server.js (566 lines) are quite large
- Recommendation: Extract message handlers and validation logic into separate modules

🧪 Test Coverage Issues

No unit tests found for the evaluation server components
No integration tests for WebSocket communication
No security tests for authentication and authorization flows
No performance tests for concurrent evaluation handling

Recommendation: Add comprehensive test suite covering:

Unit tests for core server functionality
Integration tests for WebSocket RPC communication
Security tests for authentication flows
Performance tests for concurrent evaluations

🚀 Performance Considerations

Positive Aspects

Efficient WebSocket connection reuse
Proper event-driven architecture minimizing blocking operations
Good resource management with cleanup handlers

Areas for Optimization

Consider connection pooling for high-throughput scenarios
Implement request rate limiting to prevent abuse
Add metrics collection for monitoring performance

📋 Recommendations Summary

High Priority

Security hardening: Fix CORS configuration, secure API key handling, improve authentication
Add comprehensive test suite: Unit, integration, and security tests
Error handling improvements: Better validation and error sanitization

Medium Priority

Performance optimizations: Async file operations, connection cleanup, reasonable timeouts
Code organization: Break down large files, standardize configuration patterns

Low Priority

Documentation: Add API documentation and deployment guides
Monitoring: Add metrics and observability features

🎯 Overall Assessment

This is a well-architected and feature-rich implementation of a local evaluation server. The code demonstrates solid engineering practices with good separation of concerns, comprehensive configuration management, and excellent documentation.

The main concerns are around security hardening and test coverage, which are critical for production readiness. The architecture is sound and the codebase provides a solid foundation for a robust evaluation system.

Recommendation: Address the security issues and add test coverage before merging to production environments.

coderabbitai

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

front_end/panels/ai_chat/LLM/OpenAIProvider.ts (1)

421-512: Consider gating speculative model IDs (e.g., GPT‑5) behind an “experimental” flag.

Hardcoding forward-looking IDs can mislead users if unsupported. Mark as experimental or hide unless explicitly enabled.
eval-server/nodejs/src/config.js (1)
65-78: Require at least one LLM provider when requireLLM is true; enforce OPENAI_API_KEY if defaults.provider === 'openai'

validateConfig currently only checks CONFIG.llm.apiKey; expand validation to ensure at least one provider is configured (openai, LiteLLM endpoint+key, GROQ, or OpenRouter) when requireLLM is true, and explicitly require CONFIG.providers.openai.apiKey when CONFIG.defaults.provider === 'openai'. File: eval-server/nodejs/src/config.js (validateConfig). validateConfig is already invoked with requireLLM in judge flows (eval-server/nodejs/src/lib/EvalServer.js — validateConfig(!!this.judge) and validateConfig(true)).

Apply this change:
 export function validateConfig(requireLLM = false) {
   const errors = [];
   
-  // Only require OpenAI API key if LLM judge is explicitly needed
-  if (requireLLM && !CONFIG.llm.apiKey) {
-    errors.push('OPENAI_API_KEY is required when using LLM judge');
-  }
+  // Require at least one provider when LLM is needed
+  if (requireLLM) {
+    const hasAnyProvider =
+      !!CONFIG.providers.openai.apiKey ||
+      (!!CONFIG.providers.litellm.endpoint && !!CONFIG.providers.litellm.apiKey) ||
+      !!CONFIG.providers.groq.apiKey ||
+      !!CONFIG.providers.openrouter.apiKey;
+    if (!hasAnyProvider) {
+      errors.push('No LLM provider configured. Set OPENAI_API_KEY, or another provider (LiteLLM endpoint+key, GROQ_API_KEY, or OPENROUTER_API_KEY).');
+    }
+    // If default provider is openai, require OPENAI_API_KEY explicitly
+    if (CONFIG.defaults.provider === 'openai' && !CONFIG.providers.openai.apiKey) {
+      errors.push('OPENAI_API_KEY is required when default provider is openai.');
+    }
+  }
   
   if (CONFIG.server.port < 1 || CONFIG.server.port > 65535) {
     errors.push('PORT must be between 1 and 65535');
   }
front_end/panels/ai_chat/core/AgentService.ts (1)
771-805: Use config manager for LiteLLM endpoint check (avoid localStorage)

#doesCurrentConfigRequireApiKey should respect overrides and centralized storage. Reading localStorage bypasses overrides and can misclassify requirements.

Apply this diff:
       // For LiteLLM, only require API key if no endpoint is configured
       if (selectedProvider === 'litellm') {
-        const hasLiteLLMEndpoint = Boolean(localStorage.getItem('ai_chat_litellm_endpoint'));
+        const hasLiteLLMEndpoint = !!this.#configManager.getConfiguration().endpoint;
         // If we have an endpoint, API key is optional
         return !hasLiteLLMEndpoint;
       }

🧹 Nitpick comments (14)

front_end/panels/ai_chat/LLM/OpenAIProvider.ts (1)

297-337: Add request timeout to fetch to avoid hanging UI on network stalls.

Wrap fetch with AbortController and a configurable timeout.

eval-server/nodejs/examples/multiple-evals.js (1)

12-12: Provider-aware examples look good; minor consistency nit with auth.

Examples correctly source provider creds from CONFIG. Consider aligning auth usage with other examples (either all on or all off) to reduce confusion.

Also applies to: 17-35, 37-52, 57-68, 79-86, 94-96, 132-141
front_end/panels/ai_chat/common/EvaluationConfig.ts (2)
141-148: Use crypto-based UUIDs.

Math.random isn’t suitable for UUIDs. Prefer crypto.getRandomValues when available.
-  private generateUUID(): string {
-    // Generate UUID v4
-    return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
-      const r = Math.random() * 16 | 0;
-      const v = c === 'x' ? r : (r & 0x3 | 0x8);
-      return v.toString(16);
-    });
-  }
+  private generateUUID(): string {
+    if (typeof crypto !== 'undefined' && 'getRandomValues' in crypto) {
+      const bytes = new Uint8Array(16);
+      crypto.getRandomValues(bytes);
+      // Per RFC 4122
+      bytes[6] = (bytes[6] & 0x0f) | 0x40;
+      bytes[8] = (bytes[8] & 0x3f) | 0x80;
+      const toHex = (n: number) => n.toString(16).padStart(2, '0');
+      const b = Array.from(bytes, toHex).join('');
+      return `${b.slice(0,8)}-${b.slice(8,12)}-${b.slice(12,16)}-${b.slice(16,20)}-${b.slice(20)}`;
+    }
+    // Fallback
+    return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, c => {
+      const r = Math.random() * 16 | 0;
+      const v = c === 'x' ? r : (r & 0x3 | 0x8);
+      return v.toString(16);
+    });
+  }
33-33: Centralize evaluation WebSocket URL and align default to ws://localhost:8080

EvaluationConfig.ts defaults to ws://localhost:8082 while most server docs/examples use ws://localhost:8080 — introduce a single DEFAULT_EVAL_WS (e.g. 'ws://localhost:8080') and replace hardcoded occurrences to avoid connection failures.

Update front_end/panels/ai_chat/common/EvaluationConfig.ts (endpoint default, localStorage set/load/fallback).

Update front_end/panels/ai_chat/ui/SettingsDialog.ts (placeholder/value/save/load fallbacks).

Scan repo for other hardcoded ws://localhost:8080 / ws://localhost:8082 instances and use the centralized constant (note: with-http-wrapper example intentionally runs WS on 8082 when an HTTP wrapper uses 8080).
eval-server/nodejs/CLAUDE.md (1)

25-34: Add security guidance for credentials.

Recommend not logging API keys; prefer env vars for providers; document AUTH_SECRET_KEY usage and how to disable/enable auth in local Docker setups.

Also applies to: 165-172
front_end/panels/ai_chat/ui/AIChatPanel.ts (2)
753-755: Listen for LLM config changes to keep UI/state in sync.

When config changes from another tab or via configure_llm, update selections and re-init AgentService.
   // Initialize configuration manager
   this.#configManager = LLMConfigurationManager.getInstance();
+  this.#configManager.addChangeListener(() => {
+    // Refresh local selections from storage
+    this.#loadModelSelections();
+    this.#validateAndFixModelSelections();
+    // Reinitialize backend clients
+    this.#initializeAgentService();
+    this.performUpdate();
+  });
1122-1137: Update local selections after saving configuration.

Ensure #selectedModel/#miniModel/#nanoModel reflect the newly saved config before reinitialization.
   // Save configuration to localStorage
   this.#configManager.saveConfiguration(config);
 
   // Refresh the agent service with new configuration
-  this.refreshCredentials();
+  this.#loadModelSelections();
+  this.#validateAndFixModelSelections();
+  this.refreshCredentials();
eval-server/nodejs/examples/library-usage.js (4)
95-149: Add timeouts to example evaluations.

Prevent hanging examples by setting an explicit timeout (if supported by client.evaluate API).
       const response = await client.evaluate({
         id: `model_test_${test.model.replace(/[^a-z0-9]/g, '_')}`,
         name: `${test.model} ${test.task}`,
         tool: "chat",
+        timeout: 60000,
         input: {
           message: test.message
         },
         model: {
Please confirm evaluate(params) supports a top-level timeout (per EvaluationParams). If not, I’ll adjust the suggestion.

35-54: Provider detection: tolerate missing keys and avoid implying readiness.

Consider logging which providers are active vs. partially configured, and skip passing undefined endpoints except for LiteLLM. This keeps example output less confusing.

152-237: Demonstrate persistent configuration via configure_llm

Server implements the RPC (eval-server/nodejs/src/lib/EvalServer.js:346–384) and provider defaults live in eval-server/nodejs/src/config.js (~line 17). Add a short example in eval-server/nodejs/examples/library-usage.js that calls configure_llm once (JSON‑RPC or the client helper if one exists) and then runs multiple evaluate calls to show the persistent configuration taking effect.

14-17: Use CONFIG.server.host/port (or env) to avoid front/back-end port drift

Many examples, tests and docs hardcode ws://127.0.0.1:8080 while the front-end defaults to ws://localhost:8082 — centralize host/port via eval-server/nodejs/src/config.js (CONFIG.server.host / CONFIG.server.port or process.env) and reference that in examples, logs and client scripts so they stay in sync.

Key places to update (examples/docs/clients + front-end mismatch):

eval-server/nodejs/examples/library-usage.js (server instantiation + logs)

eval-server/nodejs/examples/multiple-evals.js

eval-server/nodejs/examples/with-http-wrapper.js (make both WS/HTTP URLs config-driven)

eval-server/python/test_client.py, eval-server/python/quick_test.py

eval-server/nodejs/docs/CLIENT_SETUP.md, README files

front_end/panels/ai_chat/common/EvaluationConfig.ts (default ws://localhost:8082) vs front_end/panels/ai_chat/ui/SettingsDialog.ts (placeholders/values use ws://localhost:8080) — align to read the configured endpoint

Use CONFIG.server.host and CONFIG.server.port (or process.env) when constructing connection URLs and console logs.
front_end/panels/ai_chat/core/LLMConfigurationManager.ts (3)
144-153: Return undefined for absent optional fields in getConfiguration.

Prevents leaking empty strings into optional fields and simplifies downstream checks.

Apply this diff:
   getConfiguration(): LLMConfig {
     return {
       provider: this.getProvider(),
       apiKey: this.getApiKey(),
       endpoint: this.getEndpoint(),
       mainModel: this.getMainModel(),
-      miniModel: this.getMiniModel(),
-      nanoModel: this.getNanoModel(),
+      miniModel: this.getMiniModel() || undefined,
+      nanoModel: this.getNanoModel() || undefined,
     };
   }
341-351: Redact endpoints in storage-change logs.

Endpoints may include internal hosts or query params; treat them as sensitive.

Apply this diff:
   private handleStorageChange(event: StorageEvent): void {
     if (event.key && Object.values(STORAGE_KEYS).includes(event.key as any)) {
       const sensitiveKeys = new Set([
         STORAGE_KEYS.OPENAI_API_KEY,
         STORAGE_KEYS.LITELLM_API_KEY,
         STORAGE_KEYS.GROQ_API_KEY,
         STORAGE_KEYS.OPENROUTER_API_KEY,
+        STORAGE_KEYS.LITELLM_ENDPOINT,
       ]);
26-36: Avoid duplicating storage key sources of truth.

Keys here overlap with core/Constants.ts and provider classes (getCredentialStorageKeys). Centralize to prevent drift.

Consider importing keys from a single module (e.g., core/Constants.ts) and/or using provider.getCredentialStorageKeys() for API key names instead of hardcoding them here.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c561c79 and c585f1d.

⛔ Files ignored due to path filters (1)

eval-server/nodejs/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (19)

MODEL-CONFIGS.md (1 hunks)
eval-server/nodejs/.env.example (1 hunks)
eval-server/nodejs/CLAUDE.md (4 hunks)
eval-server/nodejs/examples/library-usage.js (3 hunks)
eval-server/nodejs/examples/multiple-evals.js (4 hunks)
eval-server/nodejs/examples/with-http-wrapper.js (1 hunks)
eval-server/nodejs/src/api-server.js (6 hunks)
eval-server/nodejs/src/config.js (1 hunks)
eval-server/nodejs/src/lib/EvalServer.js (5 hunks)
front_end/panels/ai_chat/BUILD.gn (2 hunks)
front_end/panels/ai_chat/LLM/OpenAIProvider.ts (2 hunks)
front_end/panels/ai_chat/common/EvaluationConfig.ts (3 hunks)
front_end/panels/ai_chat/core/AgentNodes.ts (2 hunks)
front_end/panels/ai_chat/core/AgentService.ts (10 hunks)
front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1 hunks)
front_end/panels/ai_chat/evaluation/EvaluationAgent.ts (3 hunks)
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (12 hunks)
front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts (2 hunks)
front_end/panels/ai_chat/ui/AIChatPanel.ts (5 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

eval-server/nodejs/src/config.js

📄 CodeRabbit inference engine (eval-server/nodejs/CLAUDE.md)

eval-server/nodejs/src/config.js: Centralize configuration via environment variables in src/config.js
Require OPENAI_API_KEY and PORT (default 8080) from environment configuration

Files:

eval-server/nodejs/src/config.js

🧠 Learnings (9)

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/server.js : WebSocket server must accept agent connections, manage agent lifecycle (connect/ready/disconnect), orchestrate evaluation sessions, and handle bidirectional RPC communication

Applied to files:

front_end/panels/ai_chat/evaluation/EvaluationAgent.ts
eval-server/nodejs/examples/with-http-wrapper.js
eval-server/nodejs/src/lib/EvalServer.js
eval-server/nodejs/CLAUDE.md
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/config.js : Centralize configuration via environment variables in src/config.js

Applied to files:

front_end/panels/ai_chat/evaluation/EvaluationAgent.ts
eval-server/nodejs/.env.example
eval-server/nodejs/examples/library-usage.js
eval-server/nodejs/src/config.js
eval-server/nodejs/examples/multiple-evals.js
front_end/panels/ai_chat/common/EvaluationConfig.ts

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/config.js : Require OPENAI_API_KEY and PORT (default 8080) from environment configuration

Applied to files:

eval-server/nodejs/.env.example
eval-server/nodejs/examples/with-http-wrapper.js
eval-server/nodejs/examples/library-usage.js
eval-server/nodejs/src/config.js
front_end/panels/ai_chat/core/AgentService.ts

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/evaluator.js : LLM evaluator must integrate with the OpenAI API to judge agent responses

Applied to files:

eval-server/nodejs/examples/library-usage.js
eval-server/nodejs/examples/multiple-evals.js
eval-server/nodejs/CLAUDE.md
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must call Evaluate(request: string) -> string on connected agents

Applied to files:

eval-server/nodejs/src/api-server.js
eval-server/nodejs/CLAUDE.md

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must implement JSON-RPC 2.0 protocol

Applied to files:

eval-server/nodejs/src/lib/EvalServer.js
eval-server/nodejs/CLAUDE.md

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/evaluator.js : LLM evaluator must return structured JSON with scores and reasoning

Applied to files:

eval-server/nodejs/examples/multiple-evals.js

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must correlate requests and responses using unique IDs

Applied to files:

eval-server/nodejs/CLAUDE.md

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must handle timeouts and error conditions

Applied to files:

eval-server/nodejs/CLAUDE.md

🧬 Code graph analysis (11)

front_end/panels/ai_chat/core/AgentNodes.ts (1)

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1)

LLMConfigurationManager (42-390)

front_end/panels/ai_chat/evaluation/EvaluationAgent.ts (1)

front_end/panels/ai_chat/core/BuildConfig.ts (1)

BUILD_CONFIG (11-20)

eval-server/nodejs/examples/library-usage.js (1)

eval-server/nodejs/src/config.js (2)

CONFIG (5-63)

CONFIG (5-63)

eval-server/nodejs/src/lib/EvalServer.js (1)

eval-server/nodejs/src/config.js (2)

CONFIG (5-63)

CONFIG (5-63)

eval-server/nodejs/examples/multiple-evals.js (2)

eval-server/nodejs/src/config.js (2)

CONFIG (5-63)

CONFIG (5-63)

eval-server/nodejs/examples/library-usage.js (3)

response (63-71)

response (121-135)

response (195-223)

front_end/panels/ai_chat/core/AgentService.ts (3)

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1)

LLMConfigurationManager (42-390)

front_end/panels/ai_chat/core/BuildConfig.ts (1)

BUILD_CONFIG (11-20)

front_end/panels/ai_chat/core/Graph.ts (1)

createAgentGraph (24-42)

front_end/panels/ai_chat/LLM/OpenAIProvider.ts (1)

front_end/panels/ai_chat/core/BuildConfig.ts (1)

BUILD_CONFIG (11-20)

front_end/panels/ai_chat/ui/AIChatPanel.ts (2)

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1)

LLMConfigurationManager (42-390)

front_end/panels/ai_chat/core/AgentService.ts (1)

logger (334-345)

front_end/panels/ai_chat/common/EvaluationConfig.ts (1)

front_end/panels/ai_chat/core/BuildConfig.ts (1)

BUILD_CONFIG (11-20)

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (5)

front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts (5)

isLLMConfigurationRequest (297-299)

LLMConfigurationRequest (259-264)

createErrorResponse (240-255)

ErrorCodes (141-155)

createLLMConfigurationResponse (314-334)

front_end/panels/ai_chat/core/BuildConfig.ts (1)

BUILD_CONFIG (11-20)

front_end/panels/ai_chat/core/AgentService.ts (1)

AgentService (45-910)

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (2)

LLMConfigurationManager (42-390)

hasOverride (184-186)

eval-server/nodejs/examples/library-usage.js (3)

response (63-71)

response (121-135)

response (195-223)

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (4)

front_end/panels/ai_chat/core/AgentService.ts (1)

logger (334-345)

front_end/panels/ai_chat/core/Logger.ts (1)

createLogger (316-318)

front_end/panels/ai_chat/LLM/LLMTypes.ts (1)

LLMProvider (142-142)

front_end/panels/ai_chat/core/Constants.ts (1)

STORAGE_KEYS (39-45)

🪛 dotenv-linter (3.3.0)

eval-server/nodejs/.env.example

[warning] 6-6: [UnorderedKey] The HOST key should go before the PORT key

(UnorderedKey)

[warning] 16-16: [UnorderedKey] The LITELLM_API_KEY key should go before the LITELLM_ENDPOINT key

(UnorderedKey)

[warning] 27-27: [UnorderedKey] The DEFAULT_MAIN_MODEL key should go before the DEFAULT_PROVIDER key

(UnorderedKey)

[warning] 28-28: [UnorderedKey] The DEFAULT_MINI_MODEL key should go before the DEFAULT_PROVIDER key

(UnorderedKey)

[warning] 29-29: [UnorderedKey] The DEFAULT_NANO_MODEL key should go before the DEFAULT_PROVIDER key

(UnorderedKey)

[warning] 33-33: [UnorderedKey] The LOG_DIR key should go before the LOG_LEVEL key

(UnorderedKey)

[warning] 45-45: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

🪛 markdownlint-cli2 (0.18.1)

MODEL-CONFIGS.md

56-56: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

62-62: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

384-384: Multiple headings with the same content

(MD024, no-duplicate-heading)

🔇 Additional comments (23)

front_end/panels/ai_chat/BUILD.gn (1)

61-61: LLMConfigurationManager is correctly added to sources and metadata.

Inclusion looks consistent across module and metadata lists.

Also applies to: 218-218

front_end/panels/ai_chat/evaluation/EvaluationAgent.ts (1)

277-281: Log redaction and semantics look good.

The verification log avoids printing secrets and includes booleans only.

eval-server/nodejs/examples/with-http-wrapper.js (1)

14-17: Ports swapped: verify README and compose files reflect ws:8082/http:8080.

Ensure docs/scripts point to the updated URLs so users don’t connect to stale ports.

Also applies to: 21-22, 28-38

eval-server/nodejs/src/config.js (2)

17-32: Providers block is a solid centralization of credentials.

Clear separation by provider with env-driven values. LGTM.

34-41: Defaults section is helpful; ensure these align with frontend defaults.

No action needed if synced via docs; otherwise add a note in README.

front_end/panels/ai_chat/core/AgentNodes.ts (2)

18-18: Good: central manager import for unified LLM config.

Keeps provider/model/apiKey sourcing consistent.

817-827: Nice integration: tool execution sourced from LLMConfigurationManager.

Supports per‑request overrides and automated mode without leaking provider creds.

If createAgentNode’s model/provider are also driven by the same manager upstream, everything stays in sync. Please confirm.

front_end/panels/ai_chat/LLM/OpenAIProvider.ts (1)

525-533: Automated-mode skip is safe — runtime API key is supplied via config override.

EvaluationAgent sets LLMConfigurationManager.setOverride(...) with the per-request api_key before calling AgentService.sendMessage; AgentService.#initializeLLMClient reads config.apiKey and passes it into LLMClient.initialize, which constructs OpenAIProvider with that apiKey (no 401 risk if override is provided). See front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts and front_end/panels/ai_chat/core/AgentService.ts.

eval-server/nodejs/CLAUDE.md (1)

21-24: Document real default port to match the frontend.

This doc states default PORT 8080. Frontend defaults to 8082. Clarify which is canonical (or note Docker defaults) to avoid confusion.

front_end/panels/ai_chat/ui/AIChatPanel.ts (4)

335-341: Centralized mini-model fallback looks good.

343-351: Nano-model fallback hierarchy reads correctly.

353-362: Provider derivation via LLMConfigurationManager is appropriate.

365-373: Mini-model/provider pairing via manager is consistent.

front_end/panels/ai_chat/common/EvaluationConfig.ts (1)

69-70: Automated-mode safe — server skips secret validation; confirm front-end sends auth_verify or enable server auth

EvalServer.handleAuthVerification calls clientManager.registerClient(..., true) (skipSecretValidation=true), so a missing client secret will not block registration; CONFIG.security.authSecretKey is defined but not referenced by EvalServer. Ensure the front-end emits the 'auth_verify' message to complete registration, or enable server-side auth (AUTH_SECRET_KEY / EvalServer auth) if you require strict secret checks.

Likely an incorrect or invalid review comment.

eval-server/nodejs/src/lib/EvalServer.js (2)

199-205: Public accessor looks good

Exposing the client manager via a getter is appropriate and scoped.

275-280: Good JSON‑RPC request dispatch hook

Recognizes client→server JSON‑RPC requests early and delegates cleanly.

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (1)

282-288: Automated mode auth bypass acknowledged

Skipping secret-key verification in AUTOMATED_MODE is acceptable for Docker/CI. Ensure this cannot be enabled in user builds.

Confirm BUILD_CONFIG.AUTOMATED_MODE is only flipped at build time for Docker images and not in user releases.

eval-server/nodejs/src/api-server.js (2)

310-334: Nested/legacy model processing LGTM

Clear precedence and sane defaults.

339-351: Default provider key usage is acceptable; avoid logging later

Using env OPENAI_API_KEY as default is fine for local flows. Ensure it’s never logged (see previous comment).

front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts (2)

93-95: Per-request api_key/endpoint fields added appropriately

Matches nested model override needs and LiteLLM routing.

257-334: LLM configure_llm JSON‑RPC additions look consistent

Types, guard, and helpers align with server method name and JSON‑RPC 2.0.

front_end/panels/ai_chat/core/AgentService.ts (1)

151-209: Centralized config + provider registration logic LGTM

Validation flow with AUTOMATED_MODE bypass and provider list assembly is sound.

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1)

106-124: No action needed — credential storage keys match provider implementations.

OpenAI, LiteLLM, Groq and OpenRouter providers' getCredentialStorageKeys() return the same ai_chat_* keys used by LLMConfigurationManager, so AgentService will see consistent credential names.

coderabbitai · 2025-09-23T17:18:35Z

eval-server/nodejs/.env.example

+PORT=8080
+HOST=127.0.0.1


🧹 Nitpick

Address dotenv-linter warnings (ordering + trailing newline).

Reorder keys and add final newline for cleanliness. Functionality unaffected.

Apply this diff:

-PORT=8080 -HOST=127.0.0.1 +HOST=127.0.0.1 +PORT=8080 @@ -LITELLM_ENDPOINT=http://localhost:4000 -LITELLM_API_KEY=your-litellm-api-key-here +LITELLM_API_KEY=your-litellm-api-key-here +LITELLM_ENDPOINT=http://localhost:4000 @@ -DEFAULT_PROVIDER=openai -DEFAULT_MAIN_MODEL=gpt-4 -DEFAULT_MINI_MODEL=gpt-4-mini -DEFAULT_NANO_MODEL=gpt-3.5-turbo +DEFAULT_MAIN_MODEL=gpt-4 +DEFAULT_MINI_MODEL=gpt-4-mini +DEFAULT_NANO_MODEL=gpt-3.5-turbo +DEFAULT_PROVIDER=openai @@ -LOG_LEVEL=info -LOG_DIR=./logs +LOG_DIR=./logs +LOG_LEVEL=info @@ -AUTH_SECRET_KEY= +AUTH_SECRET_KEY= +

Also applies to: 15-16, 27-29, 33-34, 45-45

🧰 Tools

🪛 dotenv-linter (3.3.0)

[warning] 6-6: [UnorderedKey] The HOST key should go before the PORT key

(UnorderedKey)

🤖 Prompt for AI Agents

In eval-server/nodejs/.env.example around lines 5-6 (and similarly at 15-16, 27-29, 33-34, 45-45), the .env entries are out of the linter's expected ordering and the file lacks a trailing newline; reorder the environment keys into the expected (alphabetical) order within each block/section and ensure the file ends with a single final newline character so dotenv-linter warnings are resolved.

eval-server/nodejs/CLAUDE.md

eval-server/nodejs/src/api-server.js

eval-server/nodejs/src/lib/EvalServer.js

coderabbitai · 2025-09-23T17:18:36Z

MODEL-CONFIGS.md

+### Current localStorage Keys
+- `ai_chat_provider`: Selected provider ('openai', 'litellm', 'groq', 'openrouter')
+- `ai_chat_model_selection`: Main model name
+- `ai_chat_mini_model`: Mini model name
+- `ai_chat_nano_model`: Nano model name
+- `ai_chat_api_key`: OpenAI API key
+- `ai_chat_litellm_endpoint`: LiteLLM server endpoint
+- `ai_chat_litellm_api_key`: LiteLLM API key
+- `ai_chat_groq_api_key`: Groq API key
+- `ai_chat_openrouter_api_key`: OpenRouter API key
+


⚠️ Potential issue

Avoid persisting API keys in localStorage.

Storing provider keys in localStorage is high risk (readable by any script in origin; survives XSS). Prefer:

session or memory-only overrides for automation;

tighter-scoped storage (e.g., chrome.storage.session or in-memory singleton);

if persistence is required, encrypt at rest with a user-secret not available to page JS.

At minimum, document the security trade-offs and recommend non-persistent modes for CI.

Also applies to: 34-41, 360-377

🤖 Prompt for AI Agents

In MODEL-CONFIGS.md around lines 23 to 33 (and also apply same guidance to lines 34-41 and 360-377), the document currently lists API key localStorage keys which encourages insecure persistence; update the file to remove/replace any recommendation to store provider API keys in localStorage and instead state explicit secure alternatives: recommend only session- or memory-only storage (e.g., in-memory singleton or session-scoped browser APIs), recommend platform-scoped secure storage (e.g., chrome.storage.session or OS/keychain-backed storage) for short-lived keys, and if persistence is required document encrypt-at-rest with a user-provided secret not available to page JS, plus explicitly call out the XSS risk and recommend non-persistent modes for CI/automation; make these changes as concise replacements of the listed keys and add a brief note pointing to secure implementation patterns rather than code examples.

coderabbitai · 2025-09-23T17:18:36Z

MODEL-CONFIGS.md

+**Per-Request Override (Temporary)**
+- Configuration override for individual evaluation requests
+- No localStorage changes
+- Each tab can have different overrides simultaneously
+- Clean up after request completion
+
+**Persistent Configuration (New)**


🧹 Nitpick

Fix MD036: use headings instead of emphasis.

Replace emphasized labels with proper headings, e.g., “#### Per-Request Override (Temporary)” and “#### Persistent Configuration (New)”.

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

56-56: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

62-62: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents

In MODEL-CONFIGS.md around lines 56 to 62, the section labels use emphasis instead of proper headings; replace the emphasized labels like "**Per-Request Override (Temporary)**" and "**Persistent Configuration (New)**" with appropriate Markdown headings (for example "#### Per-Request Override (Temporary)" and "#### Persistent Configuration (New)"), leaving the bullet lists unchanged so the structure uses real headings for accessibility and linting compliance.

coderabbitai · 2025-09-23T17:18:37Z

MODEL-CONFIGS.md

+  // Persistence (manual mode & persistent automated mode)
+  saveConfiguration(config: LLMConfig): void;
+  loadConfiguration(): LLMConfig;
+}
+```


🛠️ Refactor suggestion

Support partial persistent updates or remove the partial flag.

LLMConfigurationManager only exposes saveConfiguration(config: LLMConfig) which implies full writes, while the protocol adds partial?: boolean. Implement a partial-merge path to avoid unintentionally clearing fields on partial updates.

Suggested addition (outside this doc, in LLMConfigurationManager.ts):

applyPartialConfiguration(partial: Partial<LLMConfig>): void { const current = this.loadConfiguration(); const next: LLMConfig = { ...current, ...partial, mainModel: partial.mainModel ?? current.mainModel, miniModel: partial.miniModel ?? current.miniModel, nanoModel: partial.nanoModel ?? current.nanoModel, provider: partial.provider ?? current.provider, apiKey: partial.apiKey ?? current.apiKey, endpoint: partial.endpoint ?? current.endpoint, }; this.saveConfiguration(next); }

Also applies to: 133-134

coderabbitai · 2025-09-23T17:18:37Z

MODEL-CONFIGS.md

+export interface EvaluationParams {
+  evaluationId: string;
+  name: string;
+  url: string;
+  tool: string;
+  input: any;
+  model?: {
+    main_model?: string;
+    mini_model?: string;
+    nano_model?: string;
+    provider?: string;
+    api_key?: string;    // New: per-request API key
+    endpoint?: string;   // New: per-request endpoint (LiteLLM)
+  };
+  timeout: number;
+  metadata: {
+    tags: string[];
+    retries: number;
+    priority?: 'low' | 'normal' | 'high';
+  };
+}
+```


⚠️ Potential issue

Unify per-request model schema (flat vs nested).

Docs show two incompatible shapes:

Flat override: model: { main_model?: string; mini_model?: string; provider?: string; api_key?: string; endpoint?: string }

Nested objects: model: { main_model: { provider, model, api_key, endpoint }, mini_model: {…}, nano_model: {…} }

Pick one and update all examples and types to match, or clearly version the protocols. This ambiguity will cause client/server decode failures.

Also applies to: 133-135, 137-161

🤖 Prompt for AI Agents

In MODEL-CONFIGS.md around lines 166-187 (and also update sections 133-135 and 137-161), the model field is ambiguous between a flat per-request override shape and a nested-per-model object shape; choose one schema and make all types and examples consistent: either (A) flatten model to a single object with optional keys main_model/mini_model/nano_model as strings plus provider/api_key/endpoint at the same level, or (B) switch to the nested shape where main_model/mini_model/nano_model are objects containing provider, model, api_key, endpoint; then update the EvaluationParams interface and every example block to match the chosen shape, bump or version the protocol section header if you need to preserve the old shape, and run a quick scan to ensure there are no remaining mixed examples or type declarations.

coderabbitai · 2025-09-23T17:18:37Z

MODEL-CONFIGS.md

+## Usage Examples
+
+### Manual Mode (Settings UI)


🧹 Nitpick

Fix MD024: duplicate “Usage Examples” heading.

There are two “## Usage Examples” sections. Rename or merge to avoid duplication.

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

384-384: Multiple headings with the same content

(MD024, no-duplicate-heading)

🤖 Prompt for AI Agents

In MODEL-CONFIGS.md around lines 384-386, there are duplicate "## Usage Examples" headings; remove or rename one to avoid the MD024 duplicate-heading error—either merge the content under a single "## Usage Examples" section or rename the second heading to something specific like "### Manual Mode (Settings UI)" (or another distinct heading) and update any cross-references accordingly.

claude · 2025-09-23T18:14:51Z

Claude finished @olesho's task —— View job

PR Review: Local Dockerised Eval Server

I've completed a comprehensive review of this PR implementing a local dockerized evaluation server. Here's my analysis:

Todo List:

Analyze overall architecture and changes
Review Node.js evaluation server backend code
Review frontend TypeScript integration code
Check configuration management and security
Evaluate documentation and examples
Assess test coverage
Provide comprehensive feedback and recommendations

✅ Strengths

Architecture & Design

Solid WebSocket-based RPC architecture with JSON-RPC 2.0 protocol implementation
Excellent separation of concerns - clear distinction between EvalServer library, API server, and client management
Flexible multi-provider LLM support (OpenAI, Groq, OpenRouter, LiteLLM) with proper abstraction
Well-designed configuration override system enabling both manual and automated evaluation modes
Event-driven architecture with proper EventEmitter usage and tracing support

Code Quality

Comprehensive error handling with structured logging using Winston
Good TypeScript integration with proper type definitions and interfaces
Clean API design for the EvalServer library with intuitive ClientProxy pattern
Proper resource cleanup - WebSocket connections, RPC clients, and intervals are cleaned up
Well-structured configuration management with validation and environment variable support

Documentation

Excellent CLAUDE.md files providing clear architecture overview and usage instructions
Comprehensive MODEL-CONFIGS.md documenting the configuration system thoroughly
Good inline documentation and JSDoc comments throughout the codebase

⚠️ Areas for Improvement

Security Concerns

API Key Exposure Risk (eval-server/nodejs/src/api-server.js:349)
```
api_key: process.env.OPENAI_API_KEY
```
- API keys are passed directly in response payloads and could be logged or exposed
- Recommendation: Sanitize logs and consider token-based authentication
CORS Configuration (eval-server/nodejs/src/api-server.js:61)
```
res.setHeader('Access-Control-Allow-Origin', '*');
```
- Wildcard CORS allows any origin to access the API
- Recommendation: Configure specific allowed origins for production
Weak Default Authentication (eval-server/nodejs/examples/library-usage.js:14)
```
authKey: 'hello',
```
- Examples use weak default secret keys
- Recommendation: Generate cryptographically secure keys and document secure generation

Performance & Reliability

Potential Memory Leaks (eval-server/nodejs/src/lib/EvalServer.js:59-60)
```
this.connectedClients = new Map();
this.activeEvaluations = new Map();
```
- No cleanup mechanism for completed evaluations or stale connections
- Recommendation: Implement periodic cleanup and connection heartbeat monitoring
High Timeout Values (eval-server/nodejs/src/api-server.js:415)
```
timeout: 1500000, // 25 minutes
```
- Very long timeouts could lead to resource exhaustion
- Recommendation: Use more reasonable defaults (5-10 minutes) with configurable limits
Synchronous File Operations (eval-server/nodejs/src/api-server.js:30)
```
const configContent = fs.readFileSync(configPath, 'utf8');
```
- Blocking operations can impact server performance
- Recommendation: Use async file operations (fs.promises.readFile)

Error Handling

Generic Error Responses (eval-server/nodejs/src/lib/EvalServer.js:370-378)
- RPC errors return generic messages that could expose internal details
- Recommendation: Implement proper error categorization and sanitization
Missing Validation (eval-server/nodejs/src/api-server.js:260-305)
- Request body validation is minimal in the responses endpoint
- Recommendation: Add comprehensive input validation and sanitization

Code Organization

Inconsistent Model Configuration Format
- Mix of flat and nested model configuration patterns across the codebase
- Recommendation: Standardize on the nested format and provide migration utilities
Large Files
- EvalServer.js (905 lines) and api-server.js (566 lines) are quite large
- Recommendation: Extract message handlers and validation logic into separate modules

🧪 Test Coverage Issues

No unit tests found for the evaluation server components
No integration tests for WebSocket communication
No security tests for authentication and authorization flows
No performance tests for concurrent evaluation handling

Recommendation: Add comprehensive test suite covering:

Unit tests for core server functionality
Integration tests for WebSocket RPC communication
Security tests for authentication flows
Performance tests for concurrent evaluations

🚀 Performance Considerations

Positive Aspects

Efficient WebSocket connection reuse
Proper event-driven architecture minimizing blocking operations
Good resource management with cleanup handlers

Areas for Optimization

Consider connection pooling for high-throughput scenarios
Implement request rate limiting to prevent abuse
Add metrics collection for monitoring performance

📋 Recommendations Summary

High Priority

Security hardening: Fix CORS configuration, secure API key handling, improve authentication
Add comprehensive test suite: Unit, integration, and security tests
Error handling improvements: Better validation and error sanitization

Medium Priority

Performance optimizations: Async file operations, connection cleanup, reasonable timeouts
Code organization: Break down large files, standardize configuration patterns

Low Priority

Documentation: Add API documentation and deployment guides
Monitoring: Add metrics and observability features

🎯 Overall Assessment

This is a well-architected and feature-rich implementation of a local evaluation server. The code demonstrates solid engineering practices with good separation of concerns, comprehensive configuration management, and excellent documentation.

The main concerns are around security hardening and test coverage, which are critical for production readiness. The architecture is sound and the codebase provides a solid foundation for a robust evaluation system.

Recommendation: Address the security issues and add test coverage before merging to production environments.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

eval-server/nodejs/README.md (1)

148-164: Doc example: include LiteLLM endpoint when demonstrating multi‑provider config.

If any tier uses LiteLLM, document the optional endpoint field in that tier’s object so readers know how to route requests.

Example snippet:
"mini_model": {
  "provider": "litellm",
  "model": "gpt-4o-mini",
  "api_key": "sk-...",
  "endpoint": "http://localhost:4000"
}

eval-server/nodejs/src/api-server.js (1)

312-322: Accept legacy/flat model shapes and wrap into nested tiers.

Current code assumes request.model is already nested. Add handling for string/flat objects to avoid 400s with legacy clients.

Apply this diff:

-    // If nested format is provided, use it directly with fallbacks
-    if (requestBody.model) {
-      return {
-        main_model: requestBody.model.main_model || this.createDefaultModelConfig('main', defaults),
-        mini_model: requestBody.model.mini_model || this.createDefaultModelConfig('mini', defaults),
-        nano_model: requestBody.model.nano_model || this.createDefaultModelConfig('nano', defaults)
-      };
-    }
+    // If model is provided, support both nested and flat/legacy formats
+    if (requestBody.model) {
+      const m = requestBody.model;
+      // Already nested
+      if (m.main_model || m.mini_model || m.nano_model) {
+        return {
+          main_model: m.main_model || this.createDefaultModelConfig('main', defaults),
+          mini_model: m.mini_model || this.createDefaultModelConfig('mini', defaults),
+          nano_model: m.nano_model || this.createDefaultModelConfig('nano', defaults),
+        };
+      }
+      // Flat string -> main only
+      if (typeof m === 'string') {
+        return {
+          main_model: { provider: defaults.provider || 'openai', model: m, api_key: this.resolveApiKey(defaults.provider || 'openai') },
+          mini_model: this.createDefaultModelConfig('mini', defaults),
+          nano_model: this.createDefaultModelConfig('nano', defaults),
+        };
+      }
+      // Flat object { provider?, model?, api_key?, endpoint? } -> main only
+      if (typeof m === 'object') {
+        return {
+          main_model: {
+            provider: m.provider || (defaults.provider || 'openai'),
+            model: m.model || (defaults.main_model || 'gpt-4'),
+            api_key: m.api_key ?? this.resolveApiKey(m.provider || defaults.provider || 'openai'),
+            endpoint: m.endpoint, // optional for LiteLLM
+          },
+          mini_model: this.createDefaultModelConfig('mini', defaults),
+          nano_model: this.createDefaultModelConfig('nano', defaults),
+        };
+      }
+    }

Add this helper to the class to support provider‑specific keys:

resolveApiKey(provider) {
  switch ((provider || '').toLowerCase()) {
    case 'openai': return process.env.OPENAI_API_KEY;
    case 'groq': return process.env.GROQ_API_KEY;
    case 'openrouter': return process.env.OPENROUTER_API_KEY;
    case 'litellm': return process.env.LITELLM_API_KEY;
    default: return undefined;
  }
}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c585f1d and ff99a7d.

📒 Files selected for processing (4)

eval-server/nodejs/CLAUDE.md (4 hunks)
eval-server/nodejs/README.md (1 hunks)
eval-server/nodejs/src/api-server.js (6 hunks)
eval-server/nodejs/src/client-manager.js (1 hunks)

🧰 Additional context used

🧠 Learnings (7)

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must call Evaluate(request: string) -> string on connected agents

Applied to files:

eval-server/nodejs/src/api-server.js
eval-server/nodejs/CLAUDE.md

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/evaluator.js : LLM evaluator must integrate with the OpenAI API to judge agent responses

Applied to files:

eval-server/nodejs/CLAUDE.md

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must implement JSON-RPC 2.0 protocol

Applied to files:

eval-server/nodejs/CLAUDE.md

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/server.js : WebSocket server must accept agent connections, manage agent lifecycle (connect/ready/disconnect), orchestrate evaluation sessions, and handle bidirectional RPC communication

Applied to files:

eval-server/nodejs/CLAUDE.md

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must correlate requests and responses using unique IDs

Applied to files:

eval-server/nodejs/CLAUDE.md

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must handle timeouts and error conditions

Applied to files:

eval-server/nodejs/CLAUDE.md

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/config.js : Require OPENAI_API_KEY and PORT (default 8080) from environment configuration

Applied to files:

eval-server/nodejs/CLAUDE.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: claude-review

🔇 Additional comments (3)

eval-server/nodejs/CLAUDE.md (2)

110-158: Protocol shape drift: nested vs flat model config across docs.

MODEL-CONFIGS.md shows flat override; this file shows nested per‑tier. Pick one canonical schema and reference the other as legacy/compat with clear conversion rules.

46-51: configure_llm RPC handler present — front-end protocol wired; no server->agent calls found

EvalServer implements handleConfigureLLM (eval-server/nodejs/src/lib/EvalServer.js) and the front-end protocol declares configure_llm (front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts); RpcClient.callMethod is available, but I did not find any server-side outgoing call to 'configure_llm' (only 'evaluate'). If the server is meant to push LLM configs to agents, add a connection.rpcClient.callMethod(..., 'configure_llm', ...) or update the docs to reflect client-initiated configuration.
eval-server/nodejs/src/api-server.js (1)
270-273: Mask API keys before logging nestedModelConfig.

You’re logging plaintext api_key values.

Apply this diff:
-      logger.info('Processing responses request:', {
-        input: requestBody.input,
-        modelConfig: nestedModelConfig
-      });
+      const redactTier = (t) => t ? { ...t, api_key: t.api_key ? `${String(t.api_key).slice(0,4)}…REDACTED` : undefined } : t;
+      logger.info('Processing responses request:', {
+        input: requestBody.input,
+        modelConfig: {
+          main_model: redactTier(nestedModelConfig.main_model),
+          mini_model: redactTier(nestedModelConfig.mini_model),
+          nano_model: redactTier(nestedModelConfig.nano_model),
+        }
+      });

coderabbitai · 2025-09-23T18:24:43Z

eval-server/nodejs/CLAUDE.md

+### Dynamic LLM Configuration
+
+The server supports runtime LLM configuration via the `configure_llm` JSON-RPC method:
+
+```json
+{
+  "jsonrpc": "2.0",
+  "method": "configure_llm",
+  "params": {
+    "provider": "openai|groq|openrouter|litellm",
+    "apiKey": "your-api-key",
+    "endpoint": "endpoint-url-for-litellm",
+    "models": {
+      "main": "main-model-name",
+      "mini": "mini-model-name",
+      "nano": "nano-model-name"
+    },
+    "partial": false
+  },
+  "id": "config-request-id"
+}
+```


⚠️ Potential issue

🧩 Analysis chain

Clarify partial update semantics for configure_llm.

Document which fields are merge‑patched when partial: true and server‑side validation/error behavior for missing credentials per provider.

🏁 Script executed:

#!/bin/bash rg -n -C2 --type=js --type=ts 'handleConfigureLLM|partial\s*[:=]\s*true|validateConfiguration' eval-server/nodejs

Length of output: 811

🏁 Script executed:

#!/bin/bash sed -n '320,460p' eval-server/nodejs/src/lib/EvalServer.js

Length of output: 3888

Fix partial-update semantics and document configure_llm behavior.

In handleConfigureLLM (eval-server/nodejs/src/lib/EvalServer.js — starts at line 384) the code validates provider and models.main before merging; this rejects legitimate partial updates that omit those fields. Change to merge params into the existing config when partial: true, then validate the resulting config (ensure merged.provider ∈ supportedProviders and merged.models.main exists).

If a partial update changes provider, validate that required credentials for the new provider are present after merge (use CONFIG.providers[provider] as fallback) and return a clear error listing any missing credentials.

Update eval-server/nodejs/CLAUDE.md (configure_llm section) to explicitly state which fields are merge-patched (provider, apiKey, endpoint, models.*) and the server-side validation/error rules for missing credentials per provider.

🤖 Prompt for AI Agents

In eval-server/nodejs/src/lib/EvalServer.js around line 384 and eval-server/nodejs/CLAUDE.md lines 160–181, change handleConfigureLLM so that when params.partial === true it first merges the incoming params into the current saved LLM config (provider, apiKey, endpoint, models.*) and then validates the merged config (ensure merged.provider is in supportedProviders and merged.models.main exists); if the merged provider differs from the previous provider, validate that all required credentials for the new provider are present using CONFIG.providers[provider] as fallbacks and return a clear error listing any missing credential keys, and ensure non-partial updates still require provider and models.main; also update CLAUDE.md configure_llm section to document that provider, apiKey, endpoint, and models.* are merge-patched on partial updates and to list the server-side validation and per-provider missing-credential error rules.

coderabbitai · 2025-09-23T18:24:44Z

eval-server/nodejs/src/api-server.js

+  createDefaultModelConfig(tier, defaults) {
+    const defaultModels = {
+      main: defaults.main_model || 'gpt-4',
+      mini: defaults.mini_model || 'gpt-4-mini',
+      nano: defaults.nano_model || 'gpt-3.5-turbo'
+    };
+
+    return {
+      provider: defaults.provider || 'openai',
+      model: defaultModels[tier],
+      api_key: process.env.OPENAI_API_KEY
+    };
+  }


⚠️ Potential issue

Provider‑specific API key resolution (current code hardcodes OPENAI key).

Using OPENAI_API_KEY for non‑OpenAI providers will fail authentication.

Apply this diff:

- return { - provider: defaults.provider || 'openai', - model: defaultModels[tier], - api_key: process.env.OPENAI_API_KEY - }; + const provider = (defaults.provider || 'openai').toLowerCase(); + const apiKey = this.resolveApiKey(provider); + const tierConfig = { + provider, + model: defaultModels[tier], + api_key: apiKey, + }; + // Include endpoint for LiteLLM if available + if (provider === 'litellm' && process.env.LITELLM_ENDPOINT) { + tierConfig.endpoint = process.env.LITELLM_ENDPOINT; + } + return tierConfig;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In eval-server/nodejs/src/api-server.js around lines 338 to 350, the function always uses process.env.OPENAI_API_KEY which breaks authentication for non‑OpenAI providers; update the function to prefer an explicit defaults.api_key if provided, otherwise resolve the key from an environment variable mapped to the chosen provider (e.g. for provider 'openai' use process.env.OPENAI_API_KEY, for others use process.env[provider.toUpperCase() + '_API_KEY'] or a small explicit provider→env map), and return that resolved api_key instead of the hardcoded OPENAI variable so provider-specific keys are used correctly.

coderabbitai · 2025-09-23T18:24:44Z

eval-server/nodejs/src/client-manager.js

      // API model override takes precedence over everything
-      return {
-        ...(this.configDefaults?.model || {}), // Use config as base
-        ...apiModelOverride // API overrides everything
-      };
+      // Ensure nested format is used
+      return apiModelOverride;
    }


⚠️ Potential issue

Normalize to the nested model schema before returning API override.

Downstream now expects nested main_model/mini_model/nano_model objects. Returning the override verbatim risks schema mismatches when callers pass flat/legacy shapes.

Apply this diff:

- // Ensure nested format is used - return apiModelOverride; + // Normalize to nested format + return this.ensureNestedModelConfig(apiModelOverride);

Add this helper to the class (outside this method):

// Normalizes legacy shapes into the canonical nested format. ensureNestedModelConfig(model) { if (!model) return {}; // Already nested if (model.main_model || model.mini_model || model.nano_model) { return { main_model: model.main_model || {}, mini_model: model.mini_model || {}, nano_model: model.nano_model || {}, }; } // Flat string -> main only if (typeof model === 'string') { return { main_model: { provider: 'openai', model, api_key: process.env.OPENAI_API_KEY }, mini_model: {}, nano_model: {} }; } // Flat object { provider?, model?, api_key? } -> main only if (typeof model === 'object' && (model.model || model.provider || model.api_key)) { return { main_model: { provider: model.provider || 'openai', model: model.model, api_key: model.api_key, endpoint: model.endpoint, // if present }, mini_model: {}, nano_model: {}, }; } return {}; }

🤖 Prompt for AI Agents

In eval-server/nodejs/src/client-manager.js around lines 61 to 64, the method currently returns apiModelOverride verbatim which can be a flat/legacy shape; normalize overrides to the nested main_model/mini_model/nano_model schema before returning. Add the provided ensureNestedModelConfig helper to the class (outside the method) and call it on apiModelOverride so that strings, flat objects, and already-nested objects are transformed into the canonical nested structure, preserving fields like provider, model, api_key, and endpoint as described.

coderabbitai · 2025-09-23T18:24:44Z

eval-server/nodejs/src/client-manager.js

+      // Test YAML model takes precedence
+      // Ensure nested format is returned
+      return testModel;
    }


🛠️ Refactor suggestion

Normalize test YAML model to nested schema.

Same issue as API override: return a normalized nested shape to avoid downstream breakage.

Apply this diff:

- // Ensure nested format is returned - return testModel; + // Normalize to nested format + return this.ensureNestedModelConfig(testModel);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Test YAML model takes precedence

// Ensure nested format is returned

return testModel;

}

// Test YAML model takes precedence

- // Ensure nested format is returned

// Normalize to nested format

return this.ensureNestedModelConfig(testModel);

}

🤖 Prompt for AI Agents

In eval-server/nodejs/src/client-manager.js around lines 69 to 72, the function currently returns testModel directly which may be in a non-nested shape; update the code to convert/normalize testModel into the expected nested schema before returning to avoid downstream breakage. Call the existing normalization utility (or add a small helper) that transforms the flat/override shape into the nested structure the API expects, validate required nested keys exist, and return the normalized object instead of the raw testModel. Ensure you import or define the normalizer in this file and keep immutability by returning a new object.

…ial=true, making partial updates unusable. Validate only fields provided for partial updates; require full fields only for full updates.

claude · 2025-09-23T22:21:39Z

Claude finished @olesho's task —— View job

PR Review: Local Dockerised Eval Server

I've completed a comprehensive review of this PR implementing a local dockerized evaluation server. Here's my analysis:

Todo List:

Analyze overall architecture and changes
Review Node.js evaluation server backend code
Review frontend TypeScript integration code
Check configuration management and security
Evaluate documentation and examples
Assess test coverage
Provide comprehensive feedback and recommendations

✅ Strengths

Architecture & Design

Solid WebSocket-based RPC architecture with JSON-RPC 2.0 protocol implementation
Excellent separation of concerns - clear distinction between EvalServer library, API server, and client management
Flexible multi-provider LLM support (OpenAI, Groq, OpenRouter, LiteLLM) with proper abstraction
Well-designed configuration override system enabling both manual and automated evaluation modes
Event-driven architecture with proper EventEmitter usage and tracing support

Code Quality

Comprehensive error handling with structured logging using Winston
Good TypeScript integration with proper type definitions and interfaces
Clean API design for the EvalServer library with intuitive ClientProxy pattern
Proper resource cleanup - WebSocket connections, RPC clients, and intervals are cleaned up
Well-structured configuration management with validation and environment variable support

Documentation

Excellent CLAUDE.md files providing clear architecture overview and usage instructions
Comprehensive MODEL-CONFIGS.md documenting the configuration system thoroughly
Good inline documentation and JSDoc comments throughout the codebase

⚠️ Areas for Improvement

Security Concerns

API Key Exposure Risk (eval-server/nodejs/src/api-server.js:349)
```
api_key: process.env.OPENAI_API_KEY
```
- API keys are passed directly in response payloads and could be logged or exposed
- Recommendation: Sanitize logs and consider token-based authentication
CORS Configuration (eval-server/nodejs/src/api-server.js:61)
```
res.setHeader('Access-Control-Allow-Origin', '*');
```
- Wildcard CORS allows any origin to access the API
- Recommendation: Configure specific allowed origins for production
Weak Default Authentication (eval-server/nodejs/examples/library-usage.js:14)
```
authKey: 'hello',
```
- Examples use weak default secret keys
- Recommendation: Generate cryptographically secure keys and document secure generation

Performance & Reliability

Potential Memory Leaks (eval-server/nodejs/src/lib/EvalServer.js:59-60)
```
this.connectedClients = new Map();
this.activeEvaluations = new Map();
```
- No cleanup mechanism for completed evaluations or stale connections
- Recommendation: Implement periodic cleanup and connection heartbeat monitoring
High Timeout Values (eval-server/nodejs/src/api-server.js:415)
```
timeout: 1500000, // 25 minutes
```
- Very long timeouts could lead to resource exhaustion
- Recommendation: Use more reasonable defaults (5-10 minutes) with configurable limits
Synchronous File Operations (eval-server/nodejs/src/api-server.js:30)
```
const configContent = fs.readFileSync(configPath, 'utf8');
```
- Blocking operations can impact server performance
- Recommendation: Use async file operations (fs.promises.readFile)

Error Handling

Generic Error Responses (eval-server/nodejs/src/lib/EvalServer.js:370-378)
- RPC errors return generic messages that could expose internal details
- Recommendation: Implement proper error categorization and sanitization
Missing Validation (eval-server/nodejs/src/api-server.js:260-305)
- Request body validation is minimal in the responses endpoint
- Recommendation: Add comprehensive input validation and sanitization

Code Organization

Inconsistent Model Configuration Format
- Mix of flat and nested model configuration patterns across the codebase
- Recommendation: Standardize on the nested format and provide migration utilities
Large Files
- EvalServer.js (905 lines) and api-server.js (566 lines) are quite large
- Recommendation: Extract message handlers and validation logic into separate modules

🧪 Test Coverage Issues

No unit tests found for the evaluation server components
No integration tests for WebSocket communication
No security tests for authentication and authorization flows
No performance tests for concurrent evaluation handling

Recommendation: Add comprehensive test suite covering:

Unit tests for core server functionality
Integration tests for WebSocket RPC communication
Security tests for authentication flows
Performance tests for concurrent evaluations

🚀 Performance Considerations

Positive Aspects

Efficient WebSocket connection reuse
Proper event-driven architecture minimizing blocking operations
Good resource management with cleanup handlers

Areas for Optimization

Consider connection pooling for high-throughput scenarios
Implement request rate limiting to prevent abuse
Add metrics collection for monitoring performance

📋 Recommendations Summary

High Priority

Security hardening: Fix CORS configuration, secure API key handling, improve authentication
Add comprehensive test suite: Unit, integration, and security tests
Error handling improvements: Better validation and error sanitization

Medium Priority

Performance optimizations: Async file operations, connection cleanup, reasonable timeouts
Code organization: Break down large files, standardize configuration patterns

Low Priority

Documentation: Add API documentation and deployment guides
Monitoring: Add metrics and observability features

🎯 Overall Assessment

This is a well-architected and feature-rich implementation of a local evaluation server. The code demonstrates solid engineering practices with good separation of concerns, comprehensive configuration management, and excellent documentation.

The main concerns are around security hardening and test coverage, which are critical for production readiness. The architecture is sound and the codebase provides a solid foundation for a robust evaluation system.

Recommendation: Address the security issues and add test coverage before merging to production environments.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

eval-server/nodejs/src/api-server.js (1)

258-266: Broaden input validation to accept Responses API shapes (string | message array | content-part arrays)

The Responses API accepts a top-level string OR an array of message objects; each message.content may be a string or an array of typed content-part objects. The current string-only check in eval-server/nodejs/src/api-server.js (lines 258–266) will reject valid inputs — update validation to accept typeof input === 'string' OR Array.isArray(input) OR (typeof input === 'object' && input !== null), and ensure message.content is either a string or an array of content parts.
eval-server/nodejs/src/lib/EvalServer.js (1)
262-279: Fix JSON‑RPC detection: don’t use truthiness for result/error/id

Responses with result=0/'' or id=0 will be misclassified. Use property existence checks and validate method type.

Apply this diff:
-      // Handle RPC responses
-      if (data.jsonrpc === '2.0' && (data.result || data.error) && data.id) {
+      // Handle RPC responses
+      const has = (k) => Object.prototype.hasOwnProperty.call(data, k);
+      if (data.jsonrpc === '2.0' && (has('result') || has('error')) && has('id')) {
         if (connection.rpcClient.handleResponse(message)) {
           return;
         }
         logger.debug('RPC response could not be handled', {
           connectionId: connection.id,
           clientId: connection.clientId,
           id: data.id
         });
         return;
       }
 
-      // Handle RPC requests from client to server
-      if (data.jsonrpc === '2.0' && data.method && data.id) {
+      // Handle RPC requests from client to server (requests have an id; notifications do not)
+      if (data.jsonrpc === '2.0' && typeof data.method === 'string' && has('id')) {
         await this.handleRpcRequest(connection, data);
         return;
       }

🧹 Nitpick comments (4)

eval-server/nodejs/src/api-server.js (1)

300-302: Avoid logging full evaluation results; log shape/keys instead

Reduce risk of logging large/PII payloads.

Apply this diff:
-      logger.debug('executeEvaluation result:', result);
+      logger.debug('executeEvaluation result keys:', typeof result === 'object' && result ? Object.keys(result) : typeof result);

eval-server/nodejs/src/lib/EvalServer.js (3)

369-387: Return appropriate JSON‑RPC error codes (not always -32603)

Map validation errors to -32602 and auth errors to -32001; reserve -32603 for true internal errors.

Apply this diff:

     } catch (error) {
       logger.error('RPC request failed', {
         connectionId: connection.id,
         clientId: connection.clientId,
         method: request.method,
         requestId: request.id,
         error: error.message
       });
 
       // Send error response
       this.sendMessage(connection.ws, {
         jsonrpc: '2.0',
-        error: {
-          code: -32603, // Internal error
-          message: error.message
-        },
+        error: {
+          // Prefer provided error.code if present; otherwise classify
+          code:
+            (Number.isInteger(error?.code) && error.code) ||
+            (/invalid param|unsupported|required|missing/i.test(error.message) ? -32602 : // Invalid params
+             /unauthorized|forbidden|auth/i.test(error.message) ? -32001 :                // Unauthorized
+             -32603),                                                                     // Internal error
+          message: error.message
+        },
         id: request.id
       });
     }

401-402: Avoid hard‑coding providers; derive from config

Use keys of CONFIG.providers to prevent drift.

Apply this diff:

-    const supportedProviders = ['openai', 'litellm', 'groq', 'openrouter'];
+    const supportedProviders = Object.keys(CONFIG.providers || {});

719-747: Add fallback to CONFIG.defaults when no per‑connection config is present

Comment says “otherwise defaults” but code doesn’t apply them.

Apply this diff:

-      let modelConfig = evaluation.model || {};
+      let modelConfig = evaluation.model || {};
 
-      if (connection.llmConfig) {
+      if (connection.llmConfig) {
         // New nested format: separate config objects for each model tier
         modelConfig = {
           main_model: {
             provider: connection.llmConfig.provider,
             model: connection.llmConfig.models.main,
             api_key: connection.llmConfig.apiKey,
             endpoint: connection.llmConfig.endpoint
           },
           mini_model: {
             provider: connection.llmConfig.provider,
             model: connection.llmConfig.models.mini,
             api_key: connection.llmConfig.apiKey,
             endpoint: connection.llmConfig.endpoint
           },
           nano_model: {
             provider: connection.llmConfig.provider,
             model: connection.llmConfig.models.nano,
             api_key: connection.llmConfig.apiKey,
             endpoint: connection.llmConfig.endpoint
           },
           // Include any evaluation-specific overrides
           ...modelConfig
         };
+      } else if (!modelConfig || Object.keys(modelConfig).length === 0) {
+        // Fallback to server defaults
+        const d = CONFIG.defaults || {};
+        const provider = d.provider;
+        modelConfig = {
+          main_model: { provider, model: d.mainModel },
+          mini_model: { provider, model: d.miniModel },
+          nano_model: { provider, model: d.nanoModel }
+        };
       }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff99a7d and e8b37a3.

📒 Files selected for processing (2)

eval-server/nodejs/src/api-server.js (6 hunks)
eval-server/nodejs/src/lib/EvalServer.js (5 hunks)

🧰 Additional context used

🧠 Learnings (5)

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must implement JSON-RPC 2.0 protocol

Applied to files:

eval-server/nodejs/src/lib/EvalServer.js

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must call Evaluate(request: string) -> string on connected agents

Applied to files:

eval-server/nodejs/src/lib/EvalServer.js
eval-server/nodejs/src/api-server.js

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/server.js : WebSocket server must accept agent connections, manage agent lifecycle (connect/ready/disconnect), orchestrate evaluation sessions, and handle bidirectional RPC communication

Applied to files:

eval-server/nodejs/src/lib/EvalServer.js

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must handle timeouts and error conditions

Applied to files:

eval-server/nodejs/src/lib/EvalServer.js

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/config.js : Require OPENAI_API_KEY and PORT (default 8080) from environment configuration

Applied to files:

eval-server/nodejs/src/api-server.js

🧬 Code graph analysis (1)

eval-server/nodejs/src/lib/EvalServer.js (1)

eval-server/nodejs/src/config.js (2)

CONFIG (5-63)

CONFIG (5-63)

🔇 Additional comments (7)

eval-server/nodejs/src/api-server.js (2)
270-281: Good: API keys are now redacted before logging

Sanitized per-tier configs are logged; avoids secret leakage. Matches prior review guidance.

346-358: Use provider‑specific API keys and consistent default model ids

Hardcoding OPENAI_API_KEY breaks non‑OpenAI providers. Also align fallbacks with the 4.1 defaults used elsewhere.

Apply this diff:
   createDefaultModelConfig(tier, defaults) {
-    const defaultModels = {
-      main: defaults.main_model || 'gpt-4',
-      mini: defaults.mini_model || 'gpt-4-mini',
-      nano: defaults.nano_model || 'gpt-3.5-turbo'
-    };
-
-    return {
-      provider: defaults.provider || 'openai',
-      model: defaultModels[tier],
-      api_key: process.env.OPENAI_API_KEY
-    };
+    const defaultModels = {
+      main: defaults.main_model || 'gpt-4.1',
+      mini: defaults.mini_model || 'gpt-4.1-mini',
+      nano: defaults.nano_model || 'gpt-4.1-nano',
+    };
+    const provider = (defaults.provider || 'openai').toLowerCase();
+    const envMap = {
+      openai: 'OPENAI_API_KEY',
+      anthropic: 'ANTHROPIC_API_KEY',
+      google: 'GOOGLE_API_KEY',
+      gemini: 'GEMINI_API_KEY',
+      litellm: 'LITELLM_API_KEY',
+    };
+    const envVar = envMap[provider] || `${provider.toUpperCase()}_API_KEY`;
+    const apiKey = defaults.api_key || process.env[envVar] || process.env.OPENAI_API_KEY;
+    const tierConfig = {
+      provider,
+      model: defaultModels[tier],
+      api_key: apiKey,
+    };
+    if (provider === 'litellm' && process.env.LITELLM_ENDPOINT) {
+      tierConfig.endpoint = process.env.LITELLM_ENDPOINT;
+    }
+    return tierConfig;
   }
eval-server/nodejs/src/lib/EvalServer.js (5)
199-205: Accessor looks good

Public getClientManager() is a useful addition for external orchestration.

345-360: Unknown method now uses JSON‑RPC -32601 (Method not found)

Spec‑compliant handling and early return look good.

725-742: Confirm sending API keys to clients is acceptable

api_key is included in the RPC payload to the client. Ensure clients are fully trusted and this aligns with your threat model.

758-774: Timeout default mismatch (params 30s vs RPC wait 45s); unify to CONFIG.rpc.timeout

Use a single source of truth to avoid premature timeouts.

Apply this diff:
-          model: modelConfig,
-          timeout: evaluation.timeout || 30000,
+          model: modelConfig,
+          timeout: evaluation.timeout ?? CONFIG.rpc.timeout,
-        evaluation.timeout || 45000
+        evaluation.timeout ?? CONFIG.rpc.timeout
398-418: Partial updates no longer require provider/models — good fix

Validation now respects partial=true semantics.

coderabbitai · 2025-09-23T22:32:56Z

eval-server/nodejs/src/api-server.js

+  processNestedModelConfig(requestBody) {
    const defaults = this.configDefaults?.model || {};
-
+
+    // If nested format is provided, use it directly with fallbacks
+    if (requestBody.model) {
+      return {
+        main_model: requestBody.model.main_model || this.createDefaultModelConfig('main', defaults),
+        mini_model: requestBody.model.mini_model || this.createDefaultModelConfig('mini', defaults),
+        nano_model: requestBody.model.nano_model || this.createDefaultModelConfig('nano', defaults)
+      };
+    }
+
+    // No model config provided, use defaults
    return {
-      main_model: requestBody.main_model || defaults.main_model || 'gpt-4.1',
-      mini_model: requestBody.mini_model || defaults.mini_model || 'gpt-4.1-mini',
-      nano_model: requestBody.nano_model || defaults.nano_model || 'gpt-4.1-nano',
-      provider: requestBody.provider || defaults.provider || 'openai'
+      main_model: this.createDefaultModelConfig('main', defaults),
+      mini_model: this.createDefaultModelConfig('mini', defaults),
+      nano_model: this.createDefaultModelConfig('nano', defaults)
    };


⚠️ Potential issue

Normalize and validate nested model tiers; handle string tier specs

If a tier is provided as a string (model id), the current code passes the string through, violating expected object shape and risking downstream breakage. Normalize strings to objects and merge partial objects with defaults.

Apply this diff:

- processNestedModelConfig(requestBody) { - const defaults = this.configDefaults?.model || {}; - - // If nested format is provided, use it directly with fallbacks - if (requestBody.model) { - return { - main_model: requestBody.model.main_model || this.createDefaultModelConfig('main', defaults), - mini_model: requestBody.model.mini_model || this.createDefaultModelConfig('mini', defaults), - nano_model: requestBody.model.nano_model || this.createDefaultModelConfig('nano', defaults) - }; - } - - // No model config provided, use defaults - return { - main_model: this.createDefaultModelConfig('main', defaults), - mini_model: this.createDefaultModelConfig('mini', defaults), - nano_model: this.createDefaultModelConfig('nano', defaults) - }; - } + processNestedModelConfig(requestBody) { + const defaults = this.configDefaults?.model || {}; + + const normalizeTier = (tierValue, tierName) => { + if (!tierValue) return this.createDefaultModelConfig(tierName, defaults); + // String => treat as model id; merge with defaults + if (typeof tierValue === 'string') { + const base = this.createDefaultModelConfig(tierName, defaults); + return { ...base, model: tierValue, provider: (base.provider || 'openai').toLowerCase() }; + } + // Object => merge with defaults; normalize provider + if (typeof tierValue === 'object') { + const base = this.createDefaultModelConfig(tierName, defaults); + const provider = (tierValue.provider || base.provider || defaults.provider || 'openai').toLowerCase(); + return { ...base, ...tierValue, provider }; + } + return this.createDefaultModelConfig(tierName, defaults); + }; + + if (requestBody.model) { + return { + main_model: normalizeTier(requestBody.model.main_model, 'main'), + mini_model: normalizeTier(requestBody.model.mini_model, 'mini'), + nano_model: normalizeTier(requestBody.model.nano_model, 'nano'), + }; + } + + return { + main_model: this.createDefaultModelConfig('main', defaults), + mini_model: this.createDefaultModelConfig('mini', defaults), + nano_model: this.createDefaultModelConfig('nano', defaults), + }; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

processNestedModelConfig(requestBody) {

const defaults = this.configDefaults?.model || {};

// If nested format is provided, use it directly with fallbacks

if (requestBody.model) {

return {

main_model: requestBody.model.main_model || this.createDefaultModelConfig('main', defaults),

mini_model: requestBody.model.mini_model || this.createDefaultModelConfig('mini', defaults),

nano_model: requestBody.model.nano_model || this.createDefaultModelConfig('nano', defaults)

};

}

// No model config provided, use defaults

return {

main_model: requestBody.main_model || defaults.main_model || 'gpt-4.1',

mini_model: requestBody.mini_model || defaults.mini_model || 'gpt-4.1-mini',

nano_model: requestBody.nano_model || defaults.nano_model || 'gpt-4.1-nano',

provider: requestBody.provider || defaults.provider || 'openai'

main_model: this.createDefaultModelConfig('main', defaults),

mini_model: this.createDefaultModelConfig('mini', defaults),

nano_model: this.createDefaultModelConfig('nano', defaults)

};

processNestedModelConfig(requestBody) {

const defaults = this.configDefaults?.model || {};

const normalizeTier = (tierValue, tierName) => {

if (!tierValue) return this.createDefaultModelConfig(tierName, defaults);

// String => treat as model id; merge with defaults

if (typeof tierValue === 'string') {

const base = this.createDefaultModelConfig(tierName, defaults);

return { ...base, model: tierValue, provider: (base.provider || 'openai').toLowerCase() };

}

// Object => merge with defaults; normalize provider

if (typeof tierValue === 'object') {

const base = this.createDefaultModelConfig(tierName, defaults);

const provider = (tierValue.provider || base.provider || defaults.provider || 'openai').toLowerCase();

return { ...base, ...tierValue, provider };

}

return this.createDefaultModelConfig(tierName, defaults);

};

if (requestBody.model) {

return {

main_model: normalizeTier(requestBody.model.main_model, 'main'),

mini_model: normalizeTier(requestBody.model.mini_model, 'mini'),

nano_model: normalizeTier(requestBody.model.nano_model, 'nano'),

};

}

return {

main_model: this.createDefaultModelConfig('main', defaults),

mini_model: this.createDefaultModelConfig('mini', defaults),

nano_model: this.createDefaultModelConfig('nano', defaults),

};

}

coderabbitai · 2025-09-23T22:32:56Z

eval-server/nodejs/src/api-server.js

+  createDynamicEvaluationNested(input, nestedModelConfig) {
    const evaluationId = `api-eval-${Date.now()}-${Math.random().toString(36).substring(2, 9)}`;
-    
+
    return {
      id: evaluationId,
-      name: 'OpenAI API Request',
-      description: 'Dynamic evaluation created from OpenAI Responses API request',
+      name: 'API Request',
+      description: 'Dynamic evaluation created from API request',
      enabled: true,
      tool: 'chat',
      timeout: 1500000, // 25 minutes
      input: {
-        message: input,
-        reasoning: 'API request processing'
+        message: input
      },
-      model: modelConfig,
+      model: nestedModelConfig,
      validation: {
        type: 'none' // No validation needed for API responses
      },
      metadata: {
-        tags: ['api', 'dynamic', 'openai-responses'],
+        tags: ['api', 'dynamic'],
        priority: 'high',
-        source: 'openai-api'
+        source: 'api'
      }
    };


⚠️ Potential issue

🧩 Analysis chain

Avoid persisting secrets in evaluation.model

Ensure api_key values in nestedModelConfig aren’t persisted to disk or long‑term stores. If they must be included for dispatch, strip them before storage or mask in any persisted representation.

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Searching for api_key usage and potential persistence..." rg -n -C2 '\bapi_key\b' echo "Searching for file writes near evaluation data..." rg -n -C3 -P '(writeFile|writeFileSync|appendFile|appendFileSync|fs\.write)' | sed -n '1,200p' echo "Searching for evaluation serialization..." rg -n -C3 -P 'JSON\.stringify\(.{0,120}(evaluation|evaluations)'

Length of output: 33386

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Showing eval-server/api-server.js (head -> 520) ===" sed -n '1,520p' eval-server/nodejs/src/api-server.js || true echo echo "=== Searching for createDynamicEvaluationNested references ===" rg -n "createDynamicEvaluationNested" || true echo echo "=== Search for evaluation endpoints and handlers in eval-server ===" rg -n --hidden -S --no-ignore --pc '\/evaluate|\/evaluations|\bevaluate\b|\bevaluations\b' eval-server/nodejs/src || true echo echo "=== Searching for persistence (file/db/JSON.stringify) in eval-server/src ===" rg -n --hidden -S --no-ignore 'writeFile|writeFileSync|appendFile|appendFileSync|fs\.write|JSON\.stringify|db|database|lowdb|sqlite|level|nedb|redis' eval-server/nodejs/src || true

Length of output: 16667

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== client-manager.js (around write) ===" sed -n '1,420p' eval-server/nodejs/src/client-manager.js || true echo echo "=== EvalServer.js (around executeEvaluation / send to client) ===" sed -n '1,980p' eval-server/nodejs/src/lib/EvalServer.js || true echo echo "=== Search for where evaluations are persisted/saved in client-manager or eval-server ===" rg -n "save|persist|write.*evaluation|evaluations|yaml" eval-server/nodejs/src || true

Length of output: 56744

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== logger.js ===" sed -n '1,240p' eval-server/nodejs/src/logger.js || true echo echo "=== rpc-client.js ===" sed -n '1,260p' eval-server/nodejs/src/rpc-client.js || true echo echo "=== config.js ===" sed -n '1,240p' eval-server/nodejs/src/config.js || true echo echo "=== Search logger for evaluations.jsonl writes or logEvaluation implementation ===" rg -n "evaluations.jsonl|logEvaluation|evaluation" eval-server/nodejs/src/logger.js || true

Length of output: 9025

Do not persist API keys in evaluation.model — strip or redact before any logging/storage

createDynamicEvaluationNested can include nestedModelConfig.api_key and the server currently logs/saves RPC params and evaluation entries unredacted.

eval-server/nodejs/src/api-server.js — strip model.*.api_key (or set to undefined) before returning/dispatching the dynamic evaluation (createDynamicEvaluationNested / processNestedModelConfig).

eval-server/nodejs/src/lib/EvalServer.js — when assembling modelConfig, avoid merging raw api keys into objects that are logged or retained; keep keys only in-memory for dispatch and never attach them to evaluation objects.

eval-server/nodejs/src/rpc-client.js — sanitize/mask sensitive fields in params before calling logRpcCall (and any other log calls).

eval-server/nodejs/src/logger.js — redact sensitive keys in logEvaluation / evaluationLogger.info so evaluations.jsonl and combined logs never contain full api keys.

Add a single sanitize utility (keys: api_key, apiKey, token, secret, authorization, password) and use consistent masking (e.g. first 4 chars + "...") across the above locations.

🤖 Prompt for AI Agents

In eval-server/nodejs/src/api-server.js around lines 379 to 401, createDynamicEvaluationNested currently can return nestedModelConfig that includes sensitive keys (e.g. api_key) which may be logged/stored; strip or redact any sensitive fields (api_key, apiKey, token, secret, authorization, password) from nestedModelConfig before returning or attaching it to the evaluation object (set to undefined or replace with masked value like first 4 chars + "..."). Also add a shared sanitize utility used across eval-server/nodejs/src/lib/EvalServer.js, eval-server/nodejs/src/rpc-client.js and eval-server/nodejs/src/logger.js to mask those keys consistently (keep raw keys in-memory only for dispatching, never merge them into persisted evaluation objects or log entries), ensure rpc-client sanitizes params before logRpcCall, and ensure logger.redacts sensitive fields before writing evaluations.jsonl or other logs.

coderabbitai · 2025-09-23T22:32:56Z

eval-server/nodejs/src/lib/EvalServer.js

+    if (partial && connection.llmConfig) {
+      // Partial update - merge with existing config
+      connection.llmConfig = {
+        ...connection.llmConfig,
+        provider: provider || connection.llmConfig.provider,
+        apiKey: apiKey || connection.llmConfig.apiKey,
+        endpoint: endpoint || connection.llmConfig.endpoint,
+        models: {
+          ...connection.llmConfig.models,
+          ...models
+        }
+      };


⚠️ Potential issue

Partial update: spreading possibly undefined models will throw; also allow explicit clearing of secrets

Guard the spread and support null to clear apiKey/endpoint. Current || prevents clearing.

Apply this diff:

- if (partial && connection.llmConfig) { + if (partial && connection.llmConfig) { // Partial update - merge with existing config connection.llmConfig = { ...connection.llmConfig, - provider: provider || connection.llmConfig.provider, - apiKey: apiKey || connection.llmConfig.apiKey, - endpoint: endpoint || connection.llmConfig.endpoint, + provider: provider ?? connection.llmConfig.provider, + // Allow explicit clearing with null; keep existing when undefined + apiKey: apiKey === null ? undefined : (apiKey ?? connection.llmConfig.apiKey), + endpoint: endpoint === null ? undefined : (endpoint ?? connection.llmConfig.endpoint), models: { - ...connection.llmConfig.models, - ...models + ...(connection.llmConfig.models || {}), + ...(models || {}) } };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (partial && connection.llmConfig) {

// Partial update - merge with existing config

connection.llmConfig = {

...connection.llmConfig,

provider: provider || connection.llmConfig.provider,

apiKey: apiKey || connection.llmConfig.apiKey,

endpoint: endpoint || connection.llmConfig.endpoint,

models: {

...connection.llmConfig.models,

...models

}

};

if (partial && connection.llmConfig) {

// Partial update - merge with existing config

connection.llmConfig = {

...connection.llmConfig,

provider: provider ?? connection.llmConfig.provider,

// Allow explicit clearing with null; keep existing when undefined

apiKey: apiKey === null ? undefined : (apiKey ?? connection.llmConfig.apiKey),

endpoint: endpoint === null ? undefined : (endpoint ?? connection.llmConfig.endpoint),

models: {

...(connection.llmConfig.models || {}),

...(models || {})

}

};

🤖 Prompt for AI Agents

In eval-server/nodejs/src/lib/EvalServer.js around lines 426 to 437, the partial-update code unconditionally spreads models (which may be undefined) and uses falsy checks for apiKey/endpoint/provider preventing explicit clearing; change the merge to only spread models when models is not undefined, treat models === null as an explicit clear (set to {}), and replace the current "||" checks with explicit undefined checks so that apiKey/endpoint/provider are retained only when the incoming value is undefined (allowing null to clear secrets). Build the new connection.llmConfig by copying existing fields, setting provider = provider === undefined ? connection.llmConfig.provider : provider, apiKey = apiKey === undefined ? connection.llmConfig.apiKey : apiKey, endpoint = endpoint === undefined ? connection.llmConfig.endpoint : endpoint, and for models: if models === undefined use connection.llmConfig.models, else if models === null set models to {}, else merge {...connection.llmConfig.models, ...models}.

claude · 2025-09-23T23:20:55Z

Claude finished @olesho's task —— View job

PR Review: Local Dockerised Eval Server

I've completed a comprehensive review of this PR implementing a local dockerized evaluation server. Here's my analysis:

Todo List:

Analyze overall architecture and changes
Review Node.js evaluation server backend code
Review frontend TypeScript integration code
Check configuration management and security
Evaluate documentation and examples
Assess test coverage
Provide comprehensive feedback and recommendations

✅ Strengths

Architecture & Design

Solid WebSocket-based RPC architecture with JSON-RPC 2.0 protocol implementation
Excellent separation of concerns - clear distinction between EvalServer library, API server, and client management
Flexible multi-provider LLM support (OpenAI, Groq, OpenRouter, LiteLLM) with proper abstraction
Well-designed configuration override system enabling both manual and automated evaluation modes
Event-driven architecture with proper EventEmitter usage and tracing support

Code Quality

Comprehensive error handling with structured logging using Winston
Good TypeScript integration with proper type definitions and interfaces
Clean API design for the EvalServer library with intuitive ClientProxy pattern
Proper resource cleanup - WebSocket connections, RPC clients, and intervals are cleaned up
Well-structured configuration management with validation and environment variable support

Documentation

Excellent CLAUDE.md files providing clear architecture overview and usage instructions
Comprehensive MODEL-CONFIGS.md documenting the configuration system thoroughly
Good inline documentation and JSDoc comments throughout the codebase

⚠️ Areas for Improvement

Security Concerns

API Key Exposure Risk (eval-server/nodejs/src/api-server.js:349)
```
api_key: process.env.OPENAI_API_KEY
```
- API keys are passed directly in response payloads and could be logged or exposed
- Recommendation: Sanitize logs and consider token-based authentication
CORS Configuration (eval-server/nodejs/src/api-server.js:61)
```
res.setHeader('Access-Control-Allow-Origin', '*');
```
- Wildcard CORS allows any origin to access the API
- Recommendation: Configure specific allowed origins for production
Weak Default Authentication (eval-server/nodejs/examples/library-usage.js:14)
```
authKey: 'hello',
```
- Examples use weak default secret keys
- Recommendation: Generate cryptographically secure keys and document secure generation

Performance & Reliability

Potential Memory Leaks (eval-server/nodejs/src/lib/EvalServer.js:59-60)
```
this.connectedClients = new Map();
this.activeEvaluations = new Map();
```
- No cleanup mechanism for completed evaluations or stale connections
- Recommendation: Implement periodic cleanup and connection heartbeat monitoring
High Timeout Values (eval-server/nodejs/src/api-server.js:415)
```
timeout: 1500000, // 25 minutes
```
- Very long timeouts could lead to resource exhaustion
- Recommendation: Use more reasonable defaults (5-10 minutes) with configurable limits
Synchronous File Operations (eval-server/nodejs/src/api-server.js:30)
```
const configContent = fs.readFileSync(configPath, 'utf8');
```
- Blocking operations can impact server performance
- Recommendation: Use async file operations (fs.promises.readFile)

Error Handling

Generic Error Responses (eval-server/nodejs/src/lib/EvalServer.js:370-378)
- RPC errors return generic messages that could expose internal details
- Recommendation: Implement proper error categorization and sanitization
Missing Validation (eval-server/nodejs/src/api-server.js:260-305)
- Request body validation is minimal in the responses endpoint
- Recommendation: Add comprehensive input validation and sanitization

Code Organization

Inconsistent Model Configuration Format
- Mix of flat and nested model configuration patterns across the codebase
- Recommendation: Standardize on the nested format and provide migration utilities
Large Files
- EvalServer.js (905 lines) and api-server.js (566 lines) are quite large
- Recommendation: Extract message handlers and validation logic into separate modules

🧪 Test Coverage Issues

No unit tests found for the evaluation server components
No integration tests for WebSocket communication
No security tests for authentication and authorization flows
No performance tests for concurrent evaluation handling

Recommendation: Add comprehensive test suite covering:

Unit tests for core server functionality
Integration tests for WebSocket RPC communication
Security tests for authentication flows
Performance tests for concurrent evaluations

🚀 Performance Considerations

Positive Aspects

Efficient WebSocket connection reuse
Proper event-driven architecture minimizing blocking operations
Good resource management with cleanup handlers

Areas for Optimization

Consider connection pooling for high-throughput scenarios
Implement request rate limiting to prevent abuse
Add metrics collection for monitoring performance

📋 Recommendations Summary

High Priority

Security hardening: Fix CORS configuration, secure API key handling, improve authentication
Add comprehensive test suite: Unit, integration, and security tests
Error handling improvements: Better validation and error sanitization

Medium Priority

Performance optimizations: Async file operations, connection cleanup, reasonable timeouts
Code organization: Break down large files, standardize configuration patterns

Low Priority

Documentation: Add API documentation and deployment guides
Monitoring: Add metrics and observability features

🎯 Overall Assessment

This is a well-architected and feature-rich implementation of a local evaluation server. The code demonstrates solid engineering practices with good separation of concerns, comprehensive configuration management, and excellent documentation.

The main concerns are around security hardening and test coverage, which are critical for production readiness. The architecture is sound and the codebase provides a solid foundation for a robust evaluation system.

Recommendation: Address the security issues and add test coverage before merging to production environments.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (1)

749-766: Timeout race can double‑update tracing and produce inconsistent results.

If the timeout fires then the tool resolves later, we still update the observation/result. Guard with a settled flag.

@@
-    return new Promise((resolve, reject) => {
-      const timer = setTimeout(() => {
+    return new Promise((resolve, reject) => {
+      let settled = false;
+      const timer = setTimeout(() => {
+        settled = true;
         // Update span with timeout error
@@
-      executePromise
+      executePromise
         .then((result: any) => {
-          clearTimeout(timer);
+          if (settled) return;
+          clearTimeout(timer);
@@
-          resolve(result);
+          settled = true;
+          resolve(result);
         })
         .catch((error: Error) => {
-          clearTimeout(timer);
+          if (settled) return;
+          clearTimeout(timer);
@@
-          reject(error);
+          settled = true;
+          reject(error);
         });

Also applies to: 772-811

🧹 Nitpick comments (1)

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (1)
549-552: Call hasOverride() directly; optional chaining not needed.

Simplifies and avoids accidental undefined.
-        const hasOverride = configManager.hasOverride?.() || false;
+        const hasOverride = configManager.hasOverride();

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8b37a3 and 3aaef1e.

📒 Files selected for processing (2)

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1 hunks)
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (12 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-08-29T23:08:07.342Z

Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/evaluator.js : LLM evaluator must integrate with the OpenAI API to judge agent responses

Applied to files:

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts

🧬 Code graph analysis (2)

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (4)

front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts (6)

isLLMConfigurationRequest (297-299)

createAuthVerifyMessage (197-203)

LLMConfigurationRequest (259-264)

createErrorResponse (240-255)

ErrorCodes (141-155)

createLLMConfigurationResponse (314-334)

front_end/panels/ai_chat/core/BuildConfig.ts (1)

BUILD_CONFIG (11-20)

front_end/panels/ai_chat/core/AgentService.ts (1)

AgentService (45-910)

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (3)

LLMConfigurationManager (42-428)

hasOverride (184-186)

LLMConfig (14-21)

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (4)

front_end/panels/ai_chat/core/AgentService.ts (2)

logger (334-345)

event (810-846)

front_end/panels/ai_chat/ui/AIChatPanel.ts (14)

logger (971-1064)

logger (1372-1442)

logger (1592-1605)

logger (1649-1652)

window (1069-1116)

localStorage (836-859)

provider (1492-1555)

event (1703-1710)

event (1715-1719)

event (1724-1728)

event (1733-1737)

event (1742-1746)

event (1751-1755)

error (1961-1974)

front_end/panels/ai_chat/core/Logger.ts (1)

createLogger (316-318)

front_end/panels/ai_chat/LLM/LLMTypes.ts (1)

LLMProvider (142-142)

🔇 Additional comments (7)

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (3)

339-346: Stop logging secrets (model config and overrides leak API keys).

Line 339 logs params.model verbatim. It may contain api_key.
Lines 370-380 log mainModel which can be an object containing api_key.
Lines 574-583 log overrideConfig verbatim. This leaks secrets.

Apply these diffs to redact:

@@
-    logger.info('Received evaluation request', {
-      evaluationId: params.evaluationId,
-      tool: params.tool,
-      url: params.url,
-      isChat: params.tool === 'chat',
-      modelOverride: params.input?.ai_chat_model,
-      modelConfig: params.model
-    });
+    logger.info('Received evaluation request', {
+      evaluationId: params.evaluationId,
+      tool: params.tool,
+      url: params.url,
+      isChat: params.tool === 'chat',
+      hasModelConfig: !!params.model,
+      hasModelApiKey: !!(((params.model?.main_model as any)?.api_key) || ((params.model as any)?.api_key))
+    });

@@
-      logger.info('DEBUG: Extracted model configuration', {
-        evaluationId: params.evaluationId,
-        mainModel,
-        provider,
-        hasApiKey: !!apiKey,
-        apiKeyLength: apiKey?.length,
-        modelName,
-        miniModel,
-        nanoModel
-      });
+      logger.info('DEBUG: Extracted model configuration', {
+        evaluationId: params.evaluationId,
+        provider,
+        hasApiKey: !!apiKey,
+        modelName,
+        miniModel,
+        nanoModel
+      });

@@
-        logger.info('DEBUG: Created merged input for chat evaluation', {
-          evaluationId: params.evaluationId,
-          hasOverrideConfig: !!overrideConfig,
-          overrideConfig,
-          mergedInput: {
-            ...mergedInput,
-            api_key: mergedInput.api_key ? `${mergedInput.api_key.substring(0, 10)}...` : undefined
-          }
-        });
+        logger.info('DEBUG: Created merged input for chat evaluation', {
+          evaluationId: params.evaluationId,
+          hasOverrideConfig: !!overrideConfig,
+          mergedInput: {
+            ...mergedInput,
+            api_key: mergedInput.api_key ? '(redacted)' : undefined
+          }
+        });

Also applies to: 370-380, 574-583

1055-1112: Validate before persisting; avoid save‑then‑rollback.

Pre‑validate the merged config (respecting AUTOMATED_MODE) before calling save/apply to prevent inconsistent state and unnecessary listener churn.

Apply:

@@
-    try {
-      // Get configuration manager
-      const configManager = LLMConfigurationManager.getInstance();
-
-      // Store current config for potential rollback
-      const currentConfig = configManager.loadConfiguration();
-
-      // Handle configuration update based on partial flag
-      if (params.partial) {
-        // Use the new partial configuration method
-        const partialConfig: Partial<LLMConfig> = {};
-        if (params.provider !== undefined) partialConfig.provider = params.provider;
-        if (params.apiKey !== undefined) partialConfig.apiKey = params.apiKey;
-        if (params.endpoint !== undefined) partialConfig.endpoint = params.endpoint;
-        if (params.models?.main !== undefined) partialConfig.mainModel = params.models.main;
-        if (params.models?.mini !== undefined) partialConfig.miniModel = params.models.mini;
-        if (params.models?.nano !== undefined) partialConfig.nanoModel = params.models.nano;
-
-        configManager.applyPartialConfiguration(partialConfig);
-      } else {
-        // Full configuration update
-        const fullConfig: LLMConfig = {
-          provider: params.provider,
-          apiKey: params.apiKey,
-          endpoint: params.endpoint,
-          mainModel: params.models.main,
-          miniModel: params.models?.mini,
-          nanoModel: params.models?.nano
-        };
-        configManager.saveConfiguration(fullConfig);
-      }
-
-      // Validate the saved configuration
-      const postSaveValidation = configManager.validateConfiguration();
-
-      if (!postSaveValidation.isValid) {
-        // Restore the original config if validation fails
-        configManager.saveConfiguration(currentConfig);
-
-        // Send error response with validation errors
-        const errorResponse = createErrorResponse(
-          id,
-          ErrorCodes.INVALID_PARAMS,
-          'Invalid configuration',
-          { errors: postSaveValidation.errors }
-        );
-
-        if (this.client) {
-          this.client.send(errorResponse);
-        }
-
-        logger.error('Configuration validation failed', {
-          errors: postSaveValidation.errors
-        });
-
-        return;
-      }
+    try {
+      const configManager = LLMConfigurationManager.getInstance();
+      const currentConfig = configManager.loadConfiguration();
+
+      // Build merged config without persisting
+      const mergedConfig: LLMConfig = {
+        provider: params.partial ? (params.provider ?? currentConfig.provider) : params.provider,
+        apiKey: params.partial ? (params.apiKey ?? currentConfig.apiKey) : params.apiKey,
+        endpoint: params.partial ? (params.endpoint ?? currentConfig.endpoint) : params.endpoint,
+        mainModel: params.partial ? (params.models?.main ?? currentConfig.mainModel) : params.models.main,
+        miniModel: params.partial ? (params.models?.mini ?? currentConfig.miniModel) : params.models?.mini,
+        nanoModel: params.partial ? (params.models?.nano ?? currentConfig.nanoModel) : params.models?.nano
+      };
+
+      // Pre-validate (skip credential checks in automated mode)
+      const errors: string[] = [];
+      const provider = mergedConfig.provider as string | undefined;
+      const requiresKey = provider === 'openai' || provider === 'groq' || provider === 'openrouter';
+      if (!provider) errors.push('Provider is required');
+      if (!mergedConfig.mainModel) errors.push('Main model is required');
+      if (provider === 'litellm' && !mergedConfig.endpoint) errors.push('LiteLLM endpoint is required');
+      if (!BUILD_CONFIG.AUTOMATED_MODE && requiresKey && !mergedConfig.apiKey) {
+        errors.push(`${provider} API key is required`);
+      }
+      if (errors.length) {
+        const errorResponse = createErrorResponse(
+          id,
+          ErrorCodes.INVALID_PARAMS,
+          'Invalid configuration',
+          { errors }
+        );
+        this.client?.send(errorResponse);
+        logger.error('Configuration validation failed', { errors });
+        return;
+      }
+
+      // Persist after validation
+      configManager.saveConfiguration(mergedConfig);

282-288: Automated-mode auth bypass: confirm server-side expectations.

Bypassing auth in automated mode can admit unintended clients if misconfigured. Ensure the server enforces environment gating (e.g., local-only, Docker network) before relying on this.

Run to locate server-side gating:

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (4)

299-325: Add provider validation in validateConfiguration.

Currently only checks presence; reject unsupported values early.

@@
-    // Check provider
-    if (!config.provider) {
-      errors.push('Provider is required');
-    }
+    // Check provider
+    if (!config.provider) {
+      errors.push('Provider is required');
+    } else if (!ALLOWED_PROVIDERS.includes(config.provider)) {
+      errors.push(`Unsupported provider: ${config.provider}`);
+    }

28-33: Fix storage key mismatches (MODEL_SELECTION, LITELLM_ENDPOINT).

These diverge from existing keys and will desync reads/writes.

   PROVIDER: 'ai_chat_provider',
-  MODEL_SELECTION: 'ai_chat_model_selection',
+  MODEL_SELECTION: 'selected_model',
   MINI_MODEL: 'ai_chat_mini_model',
   NANO_MODEL: 'ai_chat_nano_model',
   OPENAI_API_KEY: 'ai_chat_api_key',
-  LITELLM_ENDPOINT: 'ai_chat_litellm_endpoint',
+  LITELLM_ENDPOINT: 'litellm_endpoint',
   LITELLM_API_KEY: 'ai_chat_litellm_api_key',

26-36: Guard against invalid provider strings.

Cast‑throughs allow arbitrary values; validate against allowed providers.

@@
 const STORAGE_KEYS = {
@@
 } as const;
+
+const ALLOWED_PROVIDERS: ReadonlyArray<LLMProvider> =
+  ['openai', 'litellm', 'groq', 'openrouter'] as const;
@@
-  getProvider(): LLMProvider {
-    if (this.overrideConfig?.provider) {
-      return this.overrideConfig.provider;
-    }
-    const stored = localStorage.getItem(STORAGE_KEYS.PROVIDER);
-    return (stored as LLMProvider) || 'openai';
-  }
+  getProvider(): LLMProvider {
+    const override = this.overrideConfig?.provider;
+    if (override && ALLOWED_PROVIDERS.includes(override)) {
+      return override;
+    }
+    const stored = localStorage.getItem(STORAGE_KEYS.PROVIDER) as LLMProvider | null;
+    if (stored && ALLOWED_PROVIDERS.includes(stored)) {
+      return stored;
+    }
+    return 'openai';
+  }

Also applies to: 65-71

191-221: Gate credential persistence behind an explicit option (avoid storing API keys by default).

Persisting secrets in localStorage is a security risk. Make it opt‑in and thread through provider‑specific saves.

-  saveConfiguration(config: LLMConfig): void {
+  saveConfiguration(config: LLMConfig, options?: { persistSecrets?: boolean }): void {
     logger.info('Saving configuration to localStorage', {
       provider: config.provider,
       mainModel: config.mainModel,
       hasApiKey: !!config.apiKey,
       hasEndpoint: !!config.endpoint
     });
 
+    const persistSecrets = options?.persistSecrets ?? true;
@@
-    // Save provider-specific settings
-    this.saveProviderSpecificSettings(config);
+    // Save provider-specific settings
+    this.saveProviderSpecificSettings(config, persistSecrets);
@@
-  private saveProviderSpecificSettings(config: LLMConfig): void {
+  private saveProviderSpecificSettings(config: LLMConfig, persistSecrets: boolean): void {
@@
       case 'openai':
-        if (config.apiKey) {
-          localStorage.setItem(STORAGE_KEYS.OPENAI_API_KEY, config.apiKey);
-        } else {
-          localStorage.removeItem(STORAGE_KEYS.OPENAI_API_KEY);
-        }
+        if (persistSecrets) {
+          if (config.apiKey) {
+            localStorage.setItem(STORAGE_KEYS.OPENAI_API_KEY, config.apiKey);
+          } else {
+            localStorage.removeItem(STORAGE_KEYS.OPENAI_API_KEY);
+          }
+        }
         break;
       case 'litellm':
         if (config.endpoint) {
           localStorage.setItem(STORAGE_KEYS.LITELLM_ENDPOINT, config.endpoint);
         } else {
           localStorage.removeItem(STORAGE_KEYS.LITELLM_ENDPOINT);
         }
-        if (config.apiKey) {
-          localStorage.setItem(STORAGE_KEYS.LITELLM_API_KEY, config.apiKey);
-        } else {
-          localStorage.removeItem(STORAGE_KEYS.LITELLM_API_KEY);
-        }
+        if (persistSecrets) {
+          if (config.apiKey) {
+            localStorage.setItem(STORAGE_KEYS.LITELLM_API_KEY, config.apiKey);
+          } else {
+            localStorage.removeItem(STORAGE_KEYS.LITELLM_API_KEY);
+          }
+        }
         break;
       case 'groq':
-        if (config.apiKey) {
-          localStorage.setItem(STORAGE_KEYS.GROQ_API_KEY, config.apiKey);
-        } else {
-          localStorage.removeItem(STORAGE_KEYS.GROQ_API_KEY);
-        }
+        if (persistSecrets) {
+          if (config.apiKey) {
+            localStorage.setItem(STORAGE_KEYS.GROQ_API_KEY, config.apiKey);
+          } else {
+            localStorage.removeItem(STORAGE_KEYS.GROQ_API_KEY);
+          }
+        }
         break;
       case 'openrouter':
-        if (config.apiKey) {
-          localStorage.setItem(STORAGE_KEYS.OPENROUTER_API_KEY, config.apiKey);
-        } else {
-          localStorage.removeItem(STORAGE_KEYS.OPENROUTER_API_KEY);
-        }
+        if (persistSecrets) {
+          if (config.apiKey) {
+            localStorage.setItem(STORAGE_KEYS.OPENROUTER_API_KEY, config.apiKey);
+          } else {
+            localStorage.removeItem(STORAGE_KEYS.OPENROUTER_API_KEY);
+          }
+        }
         break;

Also applies to: 337-374

coderabbitai · 2025-09-23T23:29:23Z

front_end/panels/ai_chat/core/LLMConfigurationManager.ts

+const STORAGE_KEYS = {
+  PROVIDER: 'ai_chat_provider',
+  MODEL_SELECTION: 'ai_chat_model_selection',
+  MINI_MODEL: 'ai_chat_mini_model',
+  NANO_MODEL: 'ai_chat_nano_model',
+  OPENAI_API_KEY: 'ai_chat_api_key',
+  LITELLM_ENDPOINT: 'ai_chat_litellm_endpoint',
+  LITELLM_API_KEY: 'ai_chat_litellm_api_key',
+  GROQ_API_KEY: 'ai_chat_groq_api_key',
+  OPENROUTER_API_KEY: 'ai_chat_openrouter_api_key',
+} as const;


🛠️ Refactor suggestion

🧩 Analysis chain

Search the codebase for old keys to update call sites.

Ensure no lingering references to ai_chat_model_selection or ai_chat_litellm_endpoint.

🏁 Script executed:

#!/bin/bash rg -n -C2 -g '!**/node_modules/**' -P 'ai_chat_model_selection|ai_chat_litellm_endpoint'

Length of output: 148

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "PWD: $(pwd)" echo "Top-level files (first 200 lines):" ls -la | sed -n '1,200p' echo echo "File count (excluding node_modules):" find . -path './node_modules' -prune -o -type f -print | wc -l echo echo "Searching for keys with ripgrep (no ignores, include hidden):" rg -n -C2 --hidden -uu -S -g '!**/node_modules/**' -e 'ai_chat_model_selection' -e 'ai_chat_litellm_endpoint' || echo "rg: no matches" echo echo "Fallback search with grep (excluding node_modules):" grep -RIn --line-number --exclude-dir=node_modules -e 'ai_chat_model_selection' -e 'ai_chat_litellm_endpoint' || echo "grep: no matches"

Length of output: 6377

Replace hard-coded storage keys with STORAGE_KEYS (fix lingering 'ai_chat_model_selection' and 'ai_chat_litellm_endpoint')

Replace raw uses of 'ai_chat_model_selection' and 'ai_chat_litellm_endpoint' with STORAGE_KEYS.MODEL_SELECTION and STORAGE_KEYS.LITELLM_ENDPOINT. Found literal occurrences to update:

front_end/panels/ai_chat/LLM/LiteLLMProvider.ts:52, 415

front_end/panels/ai_chat/core/AgentService.ts:793

front_end/panels/ai_chat/evaluation/runner/EvaluationRunner.ts:107

front_end/panels/ai_chat/ui/SettingsDialog.ts:30

front_end/panels/ai_chat/ui/AIChatPanel.ts:145, 151

front_end/panels/ai_chat/ui/tests/AIChatPanel.test.ts:40, 161, 173, 224, 260, 283, 296, 342

MODEL-CONFIGS.md:18, 25, 29

Key definitions are in front_end/panels/ai_chat/core/LLMConfigurationManager.ts (MODEL_SELECTION, LITELLM_ENDPOINT); update the above call sites, tests, and docs to reference those constants.

🤖 Prompt for AI Agents

In front_end/panels/ai_chat/core/LLMConfigurationManager.ts around lines 26 to 36, STORAGE_KEYS now centralizes storage key constants but several files still use the hard-coded strings 'ai_chat_model_selection' and 'ai_chat_litellm_endpoint'; replace those literal usages with STORAGE_KEYS.MODEL_SELECTION and STORAGE_KEYS.LITELLM_ENDPOINT respectively across the codebase (update front_end/panels/ai_chat/LLM/LiteLLMProvider.ts lines ~52 and ~415, front_end/panels/ai_chat/core/AgentService.ts line ~793, front_end/panels/ai_chat/evaluation/runner/EvaluationRunner.ts line ~107, front_end/panels/ai_chat/ui/SettingsDialog.ts line ~30, front_end/panels/ai_chat/ui/AIChatPanel.ts lines ~145 and ~151, all occurrences in front_end/panels/ai_chat/ui/__tests__/AIChatPanel.test.ts at the listed line numbers, and references in MODEL-CONFIGS.md at the listed lines); ensure imports reference STORAGE_KEYS (or LLMConfigurationManager export) and update tests/docs to use the constant names so all runtime and test references are consistent.

coderabbitai · 2025-09-23T23:29:23Z

front_end/panels/ai_chat/core/LLMConfigurationManager.ts

+  /**
+   * Set override configuration (for automated mode per-request overrides)
+   */
+  setOverride(config: Partial<LLMConfig>): void {
+    logger.info('Setting configuration override', {
+      provider: config.provider,
+      mainModel: config.mainModel,
+      hasApiKey: !!config.apiKey,
+      hasEndpoint: !!config.endpoint
+    });
+
+    this.overrideConfig = { ...config };
+    this.notifyListeners();
+  }
+


🛠️ Refactor suggestion

Provide a scoped override API to support concurrent evaluations.

Consider adding a withOverride helper to ensure overrides are applied and cleared atomically per operation rather than mutating global state.

Example:

public async withOverride<T>(override: Partial<LLMConfig>, fn: () => Promise<T>): Promise<T> { const prev = this.overrideConfig; this.overrideConfig = { ...override }; this.notifyListeners(); try { return await fn(); } finally { this.overrideConfig = prev; this.notifyListeners(); } }

Also applies to: 181-187

coderabbitai · 2025-09-23T23:29:23Z

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts

+    // CRITICAL FIX: Set configuration override early for any model configuration provided
+    // This allows API keys from request body to be available for UI-level validation
+    logger.info('DEBUG: Checking for model configuration override', {
+      evaluationId: params.evaluationId,
+      hasModel: !!params.model,
+      model: params.model
+    });
+
+    if (params.model && (params.model.main_model || params.model.provider || params.model.api_key)) {
+      const configManager = LLMConfigurationManager.getInstance();
+
+      // Extract configuration from nested model structure - handle both flat and nested formats
+      const mainModel = params.model.main_model as any;
+
+      // For nested format: main_model: { provider: "openai", model: "gpt-4", api_key: "key" }
+      // For flat format: { provider: "openai", main_model: "gpt-4", api_key: "key" }
+      const provider = mainModel?.provider || params.model.provider || 'openai';
+      const apiKey = mainModel?.api_key || params.model.api_key;
+      const modelName = mainModel?.model || mainModel || params.model.main_model;
+      const miniModel = (params.model.mini_model as any)?.model || params.model.mini_model;
+      const nanoModel = (params.model.nano_model as any)?.model || params.model.nano_model;
+
+      logger.info('DEBUG: Extracted model configuration', {
+        evaluationId: params.evaluationId,
+        mainModel,
+        provider,
+        hasApiKey: !!apiKey,
+        apiKeyLength: apiKey?.length,
+        modelName,
+        miniModel,
+        nanoModel
+      });
+
+      if (apiKey) {
+        logger.info('Setting early configuration override for evaluation', {
+          evaluationId: params.evaluationId,
+          provider,
+          hasApiKey: !!apiKey,
+          modelName
+        });
+
+        configManager.setOverride({
+          provider,
+          apiKey,
+          mainModel: modelName,
+          miniModel: miniModel || modelName,
+          nanoModel: nanoModel || miniModel || modelName
+        });
+        hasSetEarlyOverride = true;
+
+        logger.info('DEBUG: Early override set successfully', {
+          evaluationId: params.evaluationId
+        });
+      } else {
+        logger.warn('DEBUG: No API key found in model configuration', {
+          evaluationId: params.evaluationId,
+          mainModel,
+          provider
+        });
+      }
+    } else {
+      logger.warn('DEBUG: No model configuration found for override', {
+        evaluationId: params.evaluationId,
+        hasModel: !!params.model,
+        hasMainModel: !!(params.model?.main_model),
+        hasProvider: !!(params.model?.provider),
+        hasApiKey: !!(params.model?.api_key)
+      });
+    }


⚠️ Potential issue

Global override is not concurrency‑safe; race/cross‑talk between evaluations.

LLMConfigurationManager holds a single global override. With maxConcurrency: 3 parallel evaluations can clobber each other's provider/apiKey/models, causing misrouting and potential credential/data leakage. Immediate mitigation: set maxConcurrency to 1. Follow up: adopt scoped overrides keyed by evaluationId or pass config directly to AgentService.

Immediate mitigation:

@@ - maxConcurrency: 3, + maxConcurrency: 1,

Recommended follow-up (manager API sketch to avoid global mutable state):

// In LLMConfigurationManager (new API) withOverride<T>(override: Partial<LLMConfig>, fn: () => Promise<T>): Promise<T> { const prev = this.overrideConfig; this.overrideConfig = { ...override }; this.notifyListeners(); return Promise.resolve(fn()).finally(() => { this.overrideConfig = prev; this.notifyListeners(); }); }

Then use:

await configManager.withOverride({ provider, apiKey, mainModel, miniModel, nanoModel, endpoint }, async () => { // call AgentService.sendMessage(...) });

Also applies to: 912-921, 697-709, 1036-1038, 220-221

🤖 Prompt for AI Agents

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts lines 348-416: the current code calls LLMConfigurationManager.setOverride which mutates a single global override and causes race/cross-talk between parallel evaluations; immediately set maxConcurrency to 1 for safety, and then refactor to use a scoped override or pass config directly to the agent instead of mutating global state. Implement a withOverride-like API on LLMConfigurationManager that temporarily sets the override, calls the async evaluation work, and restores the previous override in finally, or change AgentService/Agent call sites to accept a per-evaluation config object and use that instead of reading the global override; update all other occurrences noted in the comment (lines 912-921, 697-709, 1036-1038, 220-221) to follow the same pattern so overrides are per-evaluation and not global.

# Local Dockerised Eval Server  ## Summary by CodeRabbit - New Features - Centralized LLM configuration with programmatic panel API, per-request overrides, persistent settings, multi-tier models (main/mini/nano), and per-client application during evaluations. - Dynamic configuration via JSON-RPC (configure_llm) for runtime/provider updates. - Documentation - Added comprehensive LLM configuration docs, provider guides, protocol examples, usage snippets, and an environment template. - Chores - Default ports updated for evaluation/HTTP endpoints and examples adjusted for automated mode (auth disabled by default).  --------- Co-authored-by: Claude <noreply@anthropic.com>

olesho and others added 13 commits September 16, 2025 04:52

Skipping secret key verification in automated mode

2a8a461

Initial refactoring of LLM config

e9a4af4

Added plan

700d1a9

Fix the build issues

7ec8302

Extra configs for automated mode

ce6b75f

Fix syntax

7724f65

Bypass key verification for chat panel in automated mode

4cda0cd

Updated package-lock.json

78b020b

Extra verifications skipped

1ab0393

1. ✓ Added skipCredentialChecks parameter to validateConfiguration()

855fb9a

2. ✓ Fixed multi-provider credential preservation in saveProviderSpecificSettings() 3. ✓ Secured logging in both handleStorageChange() and getDebugInfo()

Refactoring and improvements; specifically removed the double overrid…

d588841

…e, implemented single consolidated override

Merged version

38b725c

olesho requested a review from tysonthomas9 September 23, 2025 17:07

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

Refactor documentation and use nested API schema

ff99a7d

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

The current validation forces provider and models.main even when part…

e8b37a3

…ial=true, making partial updates unusable. Validate only fields provided for partial updates; require full fields only for full updates.

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

Allow partial config for eval-server

3aaef1e

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

tysonthomas9 approved these changes Sep 24, 2025

View reviewed changes

olesho merged commit 65494a3 into main Sep 24, 2025
7 checks passed

		PORT=8080
		HOST=127.0.0.1

Local Dockerised Eval Server #52

Local Dockerised Eval Server #52

Uh oh!

Conversation

olesho commented Sep 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Local Dockerised Eval Server

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

claude bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Local Dockerised Eval Server

Todo List:

✅ Strengths

Architecture & Design

Code Quality

Documentation

⚠️ Areas for Improvement

Security Concerns

Performance & Reliability

Error Handling

Code Organization

🧪 Test Coverage Issues

🚀 Performance Considerations

Positive Aspects

Areas for Optimization

📋 Recommendations Summary

High Priority

Medium Priority

Low Priority

🎯 Overall Assessment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Local Dockerised Eval Server

Todo List:

✅ Strengths

Architecture & Design

Code Quality

Documentation

⚠️ Areas for Improvement

Security Concerns

Performance & Reliability

Error Handling

Code Organization

olesho commented Sep 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 23, 2025 •

edited

Loading

claude bot commented Sep 23, 2025 •

edited

Loading

claude bot commented Sep 23, 2025 •

edited

Loading

claude bot commented Sep 23, 2025 •

edited

Loading

claude bot commented Sep 23, 2025 •

edited

Loading