Skip to content

Conversation

olesho
Copy link
Collaborator

@olesho olesho commented Sep 23, 2025

Local Dockerised Eval Server

Summary by CodeRabbit

  • New Features

    • Centralized LLM configuration with programmatic panel API, per-request overrides, persistent settings, multi-tier models (main/mini/nano), and per-client application during evaluations.
    • Dynamic configuration via JSON-RPC (configure_llm) for runtime/provider updates.
  • Documentation

    • Added comprehensive LLM configuration docs, provider guides, protocol examples, usage snippets, and an environment template.
  • Chores

    • Default ports updated for evaluation/HTTP endpoints and examples adjusted for automated mode (auth disabled by default).

olesho and others added 13 commits September 16, 2025 04:52
2. ✓ Fixed multi-provider credential preservation in saveProviderSpecificSettings()
3. ✓ Secured logging in both handleStorageChange() and getDebugInfo()
- AgentNodes.ts: Keep LLMConfigurationManager approach for centralized config handling
- EvaluationAgent.ts: Keep configManager scope for proper cleanup in finally block
- AgentService.ts: Remove duplicate static method/property definitions

The conflicts were between centralized configuration management (HEAD) and
individual parameter passing (upstream/main). The centralized approach is
kept for better override support during evaluations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@olesho olesho requested a review from tysonthomas9 September 23, 2025 17:07
Copy link

coderabbitai bot commented Sep 23, 2025

Walkthrough

Adds a centralized LLMConfigurationManager and JSON‑RPC configure_llm, introduces nested main/mini/nano model schema and per‑request overrides plus persistent storage, updates eval server and frontend evaluation flows and examples, and adds automated‑mode auth/validation shortcuts across agent and UI components.

Changes

Cohort / File(s) Summary
Documentation & Docs
MODEL-CONFIGS.md, eval-server/nodejs/CLAUDE.md
New docs describing LLM configuration architecture, model schema, JSON‑RPC configure_llm, multi‑provider support, model config examples, and protocol details.
Environment & Server Config
eval-server/nodejs/.env.example, eval-server/nodejs/src/config.js, eval-server/nodejs/README.md
Adds example env keys for multiple LLM providers and defaults; restructures CONFIG with providers, defaults, security, clients, evals; README updated to show nested model schema.
Eval Server Core & RPC
eval-server/nodejs/src/api-server.js, eval-server/nodejs/src/lib/EvalServer.js, eval-server/nodejs/src/client-manager.js
Adds nested model config processing and helpers (processNestedModelConfig, createDefaultModelConfig, createDynamicEvaluationNested); adds per‑connection JSON‑RPC handling (configure_llm) and per‑client LLM merging; adjusts model precedence behavior.
Eval Server Examples & Ports
eval-server/nodejs/examples/library-usage.js, eval-server/nodejs/examples/multiple-evals.js, eval-server/nodejs/examples/with-http-wrapper.js
Examples import CONFIG, demonstrate multi‑provider model selection and configure_llm usage; adjusted ports/status messages (8080→8082) and automated‑mode auth messaging.
Frontend Build
front_end/panels/ai_chat/BUILD.gn
Adds core/LLMConfigurationManager.ts to build sources and metadata lists.
Frontend LLM Configuration Manager
front_end/panels/ai_chat/core/LLMConfigurationManager.ts
New singleton LLMConfigurationManager and LLMConfig interface: localStorage persistence, per‑request overrides (set/clear), validation, change listeners, provider‑specific persistence, debug utilities, and redaction/logging.
Agent Core Integration
front_end/panels/ai_chat/core/AgentService.ts, front_end/panels/ai_chat/core/AgentNodes.ts, front_end/panels/ai_chat/LLM/OpenAIProvider.ts
AgentService and AgentNodes now source provider/apiKey/models from LLMConfigurationManager; AgentService subscribes to config changes and adds refreshCredentials; OpenAIProvider short‑circuits credential checks in automated mode.
Evaluation Protocol & Agent
front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts, front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts, front_end/panels/ai_chat/evaluation/EvaluationAgent.ts
Adds configure_llm JSON‑RPC request/response types and helpers; evaluation params support per‑request api_key/endpoint; EvaluationAgent applies early overrides, clears them after runs, and handles configure_llm requests; automated‑mode auth skips added.
UI & Evaluation Config
front_end/panels/ai_chat/ui/AIChatPanel.ts, front_end/panels/ai_chat/common/EvaluationConfig.ts
AIChatPanel integrates config manager, adds setLLMConfiguration API, centralizes model/provider getters with defined fallback logic; EvaluationConfig defaults endpoint changed to ws://localhost:8082 and automated‑mode persistence/secret handling adjusted.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant EvalServerWS as EvalServer (WS)
  participant FrontendAgent as Frontend EvaluationAgent
  participant ConfigMgr as LLMConfigurationManager

  rect rgba(200,230,255,0.25)
    note over Client,EvalServerWS: Dynamic LLM configuration via JSON‑RPC
    Client->>EvalServerWS: {"jsonrpc":"2.0","id":"1","method":"configure_llm","params":{provider, apiKey?, endpoint?, models{main,mini?,nano?}, partial?}}
    EvalServerWS->>EvalServerWS: validate and merge per‑connection LLM config
    EvalServerWS-->>Client: {"jsonrpc":"2.0","id":"1","result":{status:"success",appliedConfig}}
  end

  rect rgba(220,255,220,0.25)
    note over Client,FrontendAgent: Per‑request evaluation with nested/flat model config
    Client->>FrontendAgent: start_evaluation { model: {...} }
    FrontendAgent->>ConfigMgr: setOverride(derived from request)
    FrontendAgent->>FrontendAgent: run chat/tools using configMgr values
    FrontendAgent->>ConfigMgr: clearOverride()
  end
Loading
sequenceDiagram
  autonumber
  participant HTTPClient
  participant APIServer
  participant EvalEngine
  participant Provider

  HTTPClient->>APIServer: POST /responses { model: nested|flat, input }
  APIServer->>APIServer: processNestedModelConfig() -> nestedModelConfig
  APIServer->>EvalEngine: createDynamicEvaluationNested(input, nestedModelConfig)
  EvalEngine->>Provider: execute with selected provider/models/keys
  Provider-->>EvalEngine: result
  EvalEngine-->>APIServer: evaluation summary
  APIServer-->>HTTPClient: OpenAI‑like response with nested model metadata
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

I twitch my whiskers, toggle keys,
New providers humming in the breeze. 🐇
Overrides hop in, then gently go,
Models nest like clover rows. 🌿
JSON whispers, configs flow — hop! 🚀

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Local Dockerised Eval Server" is concise and clearly relates to the eval-server and local-server configuration changes in the diff (new .env example, API/eval server ports, examples and README updates), but it omits other substantial work in this PR such as the new front-end LLMConfigurationManager, protocol extensions, and evaluation integration, so it only partially summarizes the full scope.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/local-eval-api-try3-w-main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

claude bot commented Sep 23, 2025

Claude finished @olesho's task —— View job


PR Review: Local Dockerised Eval Server

I've completed a comprehensive review of this PR implementing a local dockerized evaluation server. Here's my analysis:

Todo List:

  • Analyze overall architecture and changes
  • Review Node.js evaluation server backend code
  • Review frontend TypeScript integration code
  • Check configuration management and security
  • Evaluate documentation and examples
  • Assess test coverage
  • Provide comprehensive feedback and recommendations

Strengths

Architecture & Design

  • Solid WebSocket-based RPC architecture with JSON-RPC 2.0 protocol implementation
  • Excellent separation of concerns - clear distinction between EvalServer library, API server, and client management
  • Flexible multi-provider LLM support (OpenAI, Groq, OpenRouter, LiteLLM) with proper abstraction
  • Well-designed configuration override system enabling both manual and automated evaluation modes
  • Event-driven architecture with proper EventEmitter usage and tracing support

Code Quality

  • Comprehensive error handling with structured logging using Winston
  • Good TypeScript integration with proper type definitions and interfaces
  • Clean API design for the EvalServer library with intuitive ClientProxy pattern
  • Proper resource cleanup - WebSocket connections, RPC clients, and intervals are cleaned up
  • Well-structured configuration management with validation and environment variable support

Documentation

  • Excellent CLAUDE.md files providing clear architecture overview and usage instructions
  • Comprehensive MODEL-CONFIGS.md documenting the configuration system thoroughly
  • Good inline documentation and JSDoc comments throughout the codebase

⚠️ Areas for Improvement

Security Concerns

  1. API Key Exposure Risk (eval-server/nodejs/src/api-server.js:349)

    api_key: process.env.OPENAI_API_KEY
    • API keys are passed directly in response payloads and could be logged or exposed
    • Recommendation: Sanitize logs and consider token-based authentication
  2. CORS Configuration (eval-server/nodejs/src/api-server.js:61)

    res.setHeader('Access-Control-Allow-Origin', '*');
    • Wildcard CORS allows any origin to access the API
    • Recommendation: Configure specific allowed origins for production
  3. Weak Default Authentication (eval-server/nodejs/examples/library-usage.js:14)

    authKey: 'hello',
    • Examples use weak default secret keys
    • Recommendation: Generate cryptographically secure keys and document secure generation

Performance & Reliability

  1. Potential Memory Leaks (eval-server/nodejs/src/lib/EvalServer.js:59-60)

    this.connectedClients = new Map();
    this.activeEvaluations = new Map();
    • No cleanup mechanism for completed evaluations or stale connections
    • Recommendation: Implement periodic cleanup and connection heartbeat monitoring
  2. High Timeout Values (eval-server/nodejs/src/api-server.js:415)

    timeout: 1500000, // 25 minutes
    • Very long timeouts could lead to resource exhaustion
    • Recommendation: Use more reasonable defaults (5-10 minutes) with configurable limits
  3. Synchronous File Operations (eval-server/nodejs/src/api-server.js:30)

    const configContent = fs.readFileSync(configPath, 'utf8');
    • Blocking operations can impact server performance
    • Recommendation: Use async file operations (fs.promises.readFile)

Error Handling

  1. Generic Error Responses (eval-server/nodejs/src/lib/EvalServer.js:370-378)

    • RPC errors return generic messages that could expose internal details
    • Recommendation: Implement proper error categorization and sanitization
  2. Missing Validation (eval-server/nodejs/src/api-server.js:260-305)

    • Request body validation is minimal in the responses endpoint
    • Recommendation: Add comprehensive input validation and sanitization

Code Organization

  1. Inconsistent Model Configuration Format

    • Mix of flat and nested model configuration patterns across the codebase
    • Recommendation: Standardize on the nested format and provide migration utilities
  2. Large Files

    • EvalServer.js (905 lines) and api-server.js (566 lines) are quite large
    • Recommendation: Extract message handlers and validation logic into separate modules

🧪 Test Coverage Issues

  • No unit tests found for the evaluation server components
  • No integration tests for WebSocket communication
  • No security tests for authentication and authorization flows
  • No performance tests for concurrent evaluation handling

Recommendation: Add comprehensive test suite covering:

  • Unit tests for core server functionality
  • Integration tests for WebSocket RPC communication
  • Security tests for authentication flows
  • Performance tests for concurrent evaluations

🚀 Performance Considerations

Positive Aspects

  • Efficient WebSocket connection reuse
  • Proper event-driven architecture minimizing blocking operations
  • Good resource management with cleanup handlers

Areas for Optimization

  • Consider connection pooling for high-throughput scenarios
  • Implement request rate limiting to prevent abuse
  • Add metrics collection for monitoring performance

📋 Recommendations Summary

High Priority

  1. Security hardening: Fix CORS configuration, secure API key handling, improve authentication
  2. Add comprehensive test suite: Unit, integration, and security tests
  3. Error handling improvements: Better validation and error sanitization

Medium Priority

  1. Performance optimizations: Async file operations, connection cleanup, reasonable timeouts
  2. Code organization: Break down large files, standardize configuration patterns

Low Priority

  1. Documentation: Add API documentation and deployment guides
  2. Monitoring: Add metrics and observability features

🎯 Overall Assessment

This is a well-architected and feature-rich implementation of a local evaluation server. The code demonstrates solid engineering practices with good separation of concerns, comprehensive configuration management, and excellent documentation.

The main concerns are around security hardening and test coverage, which are critical for production readiness. The architecture is sound and the codebase provides a solid foundation for a robust evaluation system.

Recommendation: Address the security issues and add test coverage before merging to production environments.


Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
front_end/panels/ai_chat/LLM/OpenAIProvider.ts (1)

421-512: Consider gating speculative model IDs (e.g., GPT‑5) behind an “experimental” flag.

Hardcoding forward-looking IDs can mislead users if unsupported. Mark as experimental or hide unless explicitly enabled.

eval-server/nodejs/src/config.js (1)

65-78: Require at least one LLM provider when requireLLM is true; enforce OPENAI_API_KEY if defaults.provider === 'openai'

validateConfig currently only checks CONFIG.llm.apiKey; expand validation to ensure at least one provider is configured (openai, LiteLLM endpoint+key, GROQ, or OpenRouter) when requireLLM is true, and explicitly require CONFIG.providers.openai.apiKey when CONFIG.defaults.provider === 'openai'. File: eval-server/nodejs/src/config.js (validateConfig). validateConfig is already invoked with requireLLM in judge flows (eval-server/nodejs/src/lib/EvalServer.js — validateConfig(!!this.judge) and validateConfig(true)).

Apply this change:

 export function validateConfig(requireLLM = false) {
   const errors = [];
   
-  // Only require OpenAI API key if LLM judge is explicitly needed
-  if (requireLLM && !CONFIG.llm.apiKey) {
-    errors.push('OPENAI_API_KEY is required when using LLM judge');
-  }
+  // Require at least one provider when LLM is needed
+  if (requireLLM) {
+    const hasAnyProvider =
+      !!CONFIG.providers.openai.apiKey ||
+      (!!CONFIG.providers.litellm.endpoint && !!CONFIG.providers.litellm.apiKey) ||
+      !!CONFIG.providers.groq.apiKey ||
+      !!CONFIG.providers.openrouter.apiKey;
+    if (!hasAnyProvider) {
+      errors.push('No LLM provider configured. Set OPENAI_API_KEY, or another provider (LiteLLM endpoint+key, GROQ_API_KEY, or OPENROUTER_API_KEY).');
+    }
+    // If default provider is openai, require OPENAI_API_KEY explicitly
+    if (CONFIG.defaults.provider === 'openai' && !CONFIG.providers.openai.apiKey) {
+      errors.push('OPENAI_API_KEY is required when default provider is openai.');
+    }
+  }
   
   if (CONFIG.server.port < 1 || CONFIG.server.port > 65535) {
     errors.push('PORT must be between 1 and 65535');
   }
front_end/panels/ai_chat/core/AgentService.ts (1)

771-805: Use config manager for LiteLLM endpoint check (avoid localStorage)

#doesCurrentConfigRequireApiKey should respect overrides and centralized storage. Reading localStorage bypasses overrides and can misclassify requirements.

Apply this diff:

       // For LiteLLM, only require API key if no endpoint is configured
       if (selectedProvider === 'litellm') {
-        const hasLiteLLMEndpoint = Boolean(localStorage.getItem('ai_chat_litellm_endpoint'));
+        const hasLiteLLMEndpoint = !!this.#configManager.getConfiguration().endpoint;
         // If we have an endpoint, API key is optional
         return !hasLiteLLMEndpoint;
       }
🧹 Nitpick comments (14)
front_end/panels/ai_chat/LLM/OpenAIProvider.ts (1)

297-337: Add request timeout to fetch to avoid hanging UI on network stalls.

Wrap fetch with AbortController and a configurable timeout.

eval-server/nodejs/examples/multiple-evals.js (1)

12-12: Provider-aware examples look good; minor consistency nit with auth.

Examples correctly source provider creds from CONFIG. Consider aligning auth usage with other examples (either all on or all off) to reduce confusion.

Also applies to: 17-35, 37-52, 57-68, 79-86, 94-96, 132-141

front_end/panels/ai_chat/common/EvaluationConfig.ts (2)

141-148: Use crypto-based UUIDs.

Math.random isn’t suitable for UUIDs. Prefer crypto.getRandomValues when available.

-  private generateUUID(): string {
-    // Generate UUID v4
-    return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
-      const r = Math.random() * 16 | 0;
-      const v = c === 'x' ? r : (r & 0x3 | 0x8);
-      return v.toString(16);
-    });
-  }
+  private generateUUID(): string {
+    if (typeof crypto !== 'undefined' && 'getRandomValues' in crypto) {
+      const bytes = new Uint8Array(16);
+      crypto.getRandomValues(bytes);
+      // Per RFC 4122
+      bytes[6] = (bytes[6] & 0x0f) | 0x40;
+      bytes[8] = (bytes[8] & 0x3f) | 0x80;
+      const toHex = (n: number) => n.toString(16).padStart(2, '0');
+      const b = Array.from(bytes, toHex).join('');
+      return `${b.slice(0,8)}-${b.slice(8,12)}-${b.slice(12,16)}-${b.slice(16,20)}-${b.slice(20)}`;
+    }
+    // Fallback
+    return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, c => {
+      const r = Math.random() * 16 | 0;
+      const v = c === 'x' ? r : (r & 0x3 | 0x8);
+      return v.toString(16);
+    });
+  }

33-33: Centralize evaluation WebSocket URL and align default to ws://localhost:8080

EvaluationConfig.ts defaults to ws://localhost:8082 while most server docs/examples use ws://localhost:8080 — introduce a single DEFAULT_EVAL_WS (e.g. 'ws://localhost:8080') and replace hardcoded occurrences to avoid connection failures.

  • Update front_end/panels/ai_chat/common/EvaluationConfig.ts (endpoint default, localStorage set/load/fallback).
  • Update front_end/panels/ai_chat/ui/SettingsDialog.ts (placeholder/value/save/load fallbacks).
  • Scan repo for other hardcoded ws://localhost:8080 / ws://localhost:8082 instances and use the centralized constant (note: with-http-wrapper example intentionally runs WS on 8082 when an HTTP wrapper uses 8080).
eval-server/nodejs/CLAUDE.md (1)

25-34: Add security guidance for credentials.

Recommend not logging API keys; prefer env vars for providers; document AUTH_SECRET_KEY usage and how to disable/enable auth in local Docker setups.

Also applies to: 165-172

front_end/panels/ai_chat/ui/AIChatPanel.ts (2)

753-755: Listen for LLM config changes to keep UI/state in sync.

When config changes from another tab or via configure_llm, update selections and re-init AgentService.

   // Initialize configuration manager
   this.#configManager = LLMConfigurationManager.getInstance();
+  this.#configManager.addChangeListener(() => {
+    // Refresh local selections from storage
+    this.#loadModelSelections();
+    this.#validateAndFixModelSelections();
+    // Reinitialize backend clients
+    this.#initializeAgentService();
+    this.performUpdate();
+  });

1122-1137: Update local selections after saving configuration.

Ensure #selectedModel/#miniModel/#nanoModel reflect the newly saved config before reinitialization.

   // Save configuration to localStorage
   this.#configManager.saveConfiguration(config);
 
   // Refresh the agent service with new configuration
-  this.refreshCredentials();
+  this.#loadModelSelections();
+  this.#validateAndFixModelSelections();
+  this.refreshCredentials();
eval-server/nodejs/examples/library-usage.js (4)

95-149: Add timeouts to example evaluations.

Prevent hanging examples by setting an explicit timeout (if supported by client.evaluate API).

       const response = await client.evaluate({
         id: `model_test_${test.model.replace(/[^a-z0-9]/g, '_')}`,
         name: `${test.model} ${test.task}`,
         tool: "chat",
+        timeout: 60000,
         input: {
           message: test.message
         },
         model: {

Please confirm evaluate(params) supports a top-level timeout (per EvaluationParams). If not, I’ll adjust the suggestion.


35-54: Provider detection: tolerate missing keys and avoid implying readiness.

Consider logging which providers are active vs. partially configured, and skip passing undefined endpoints except for LiteLLM. This keeps example output less confusing.


152-237: Demonstrate persistent configuration via configure_llm

Server implements the RPC (eval-server/nodejs/src/lib/EvalServer.js:346–384) and provider defaults live in eval-server/nodejs/src/config.js (~line 17). Add a short example in eval-server/nodejs/examples/library-usage.js that calls configure_llm once (JSON‑RPC or the client helper if one exists) and then runs multiple evaluate calls to show the persistent configuration taking effect.


14-17: Use CONFIG.server.host/port (or env) to avoid front/back-end port drift

Many examples, tests and docs hardcode ws://127.0.0.1:8080 while the front-end defaults to ws://localhost:8082 — centralize host/port via eval-server/nodejs/src/config.js (CONFIG.server.host / CONFIG.server.port or process.env) and reference that in examples, logs and client scripts so they stay in sync.

Key places to update (examples/docs/clients + front-end mismatch):

  • eval-server/nodejs/examples/library-usage.js (server instantiation + logs)
  • eval-server/nodejs/examples/multiple-evals.js
  • eval-server/nodejs/examples/with-http-wrapper.js (make both WS/HTTP URLs config-driven)
  • eval-server/python/test_client.py, eval-server/python/quick_test.py
  • eval-server/nodejs/docs/CLIENT_SETUP.md, README files
  • front_end/panels/ai_chat/common/EvaluationConfig.ts (default ws://localhost:8082) vs front_end/panels/ai_chat/ui/SettingsDialog.ts (placeholders/values use ws://localhost:8080) — align to read the configured endpoint

Use CONFIG.server.host and CONFIG.server.port (or process.env) when constructing connection URLs and console logs.

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (3)

144-153: Return undefined for absent optional fields in getConfiguration.

Prevents leaking empty strings into optional fields and simplifies downstream checks.

Apply this diff:

   getConfiguration(): LLMConfig {
     return {
       provider: this.getProvider(),
       apiKey: this.getApiKey(),
       endpoint: this.getEndpoint(),
       mainModel: this.getMainModel(),
-      miniModel: this.getMiniModel(),
-      nanoModel: this.getNanoModel(),
+      miniModel: this.getMiniModel() || undefined,
+      nanoModel: this.getNanoModel() || undefined,
     };
   }

341-351: Redact endpoints in storage-change logs.

Endpoints may include internal hosts or query params; treat them as sensitive.

Apply this diff:

   private handleStorageChange(event: StorageEvent): void {
     if (event.key && Object.values(STORAGE_KEYS).includes(event.key as any)) {
       const sensitiveKeys = new Set([
         STORAGE_KEYS.OPENAI_API_KEY,
         STORAGE_KEYS.LITELLM_API_KEY,
         STORAGE_KEYS.GROQ_API_KEY,
         STORAGE_KEYS.OPENROUTER_API_KEY,
+        STORAGE_KEYS.LITELLM_ENDPOINT,
       ]);

26-36: Avoid duplicating storage key sources of truth.

Keys here overlap with core/Constants.ts and provider classes (getCredentialStorageKeys). Centralize to prevent drift.

Consider importing keys from a single module (e.g., core/Constants.ts) and/or using provider.getCredentialStorageKeys() for API key names instead of hardcoding them here.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c561c79 and c585f1d.

⛔ Files ignored due to path filters (1)
  • eval-server/nodejs/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (19)
  • MODEL-CONFIGS.md (1 hunks)
  • eval-server/nodejs/.env.example (1 hunks)
  • eval-server/nodejs/CLAUDE.md (4 hunks)
  • eval-server/nodejs/examples/library-usage.js (3 hunks)
  • eval-server/nodejs/examples/multiple-evals.js (4 hunks)
  • eval-server/nodejs/examples/with-http-wrapper.js (1 hunks)
  • eval-server/nodejs/src/api-server.js (6 hunks)
  • eval-server/nodejs/src/config.js (1 hunks)
  • eval-server/nodejs/src/lib/EvalServer.js (5 hunks)
  • front_end/panels/ai_chat/BUILD.gn (2 hunks)
  • front_end/panels/ai_chat/LLM/OpenAIProvider.ts (2 hunks)
  • front_end/panels/ai_chat/common/EvaluationConfig.ts (3 hunks)
  • front_end/panels/ai_chat/core/AgentNodes.ts (2 hunks)
  • front_end/panels/ai_chat/core/AgentService.ts (10 hunks)
  • front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1 hunks)
  • front_end/panels/ai_chat/evaluation/EvaluationAgent.ts (3 hunks)
  • front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (12 hunks)
  • front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts (2 hunks)
  • front_end/panels/ai_chat/ui/AIChatPanel.ts (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
eval-server/nodejs/src/config.js

📄 CodeRabbit inference engine (eval-server/nodejs/CLAUDE.md)

eval-server/nodejs/src/config.js: Centralize configuration via environment variables in src/config.js
Require OPENAI_API_KEY and PORT (default 8080) from environment configuration

Files:

  • eval-server/nodejs/src/config.js
🧠 Learnings (9)
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/server.js : WebSocket server must accept agent connections, manage agent lifecycle (connect/ready/disconnect), orchestrate evaluation sessions, and handle bidirectional RPC communication

Applied to files:

  • front_end/panels/ai_chat/evaluation/EvaluationAgent.ts
  • eval-server/nodejs/examples/with-http-wrapper.js
  • eval-server/nodejs/src/lib/EvalServer.js
  • eval-server/nodejs/CLAUDE.md
  • front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/config.js : Centralize configuration via environment variables in src/config.js

Applied to files:

  • front_end/panels/ai_chat/evaluation/EvaluationAgent.ts
  • eval-server/nodejs/.env.example
  • eval-server/nodejs/examples/library-usage.js
  • eval-server/nodejs/src/config.js
  • eval-server/nodejs/examples/multiple-evals.js
  • front_end/panels/ai_chat/common/EvaluationConfig.ts
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/config.js : Require OPENAI_API_KEY and PORT (default 8080) from environment configuration

Applied to files:

  • eval-server/nodejs/.env.example
  • eval-server/nodejs/examples/with-http-wrapper.js
  • eval-server/nodejs/examples/library-usage.js
  • eval-server/nodejs/src/config.js
  • front_end/panels/ai_chat/core/AgentService.ts
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/evaluator.js : LLM evaluator must integrate with the OpenAI API to judge agent responses

Applied to files:

  • eval-server/nodejs/examples/library-usage.js
  • eval-server/nodejs/examples/multiple-evals.js
  • eval-server/nodejs/CLAUDE.md
  • front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must call Evaluate(request: string) -> string on connected agents

Applied to files:

  • eval-server/nodejs/src/api-server.js
  • eval-server/nodejs/CLAUDE.md
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must implement JSON-RPC 2.0 protocol

Applied to files:

  • eval-server/nodejs/src/lib/EvalServer.js
  • eval-server/nodejs/CLAUDE.md
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/evaluator.js : LLM evaluator must return structured JSON with scores and reasoning

Applied to files:

  • eval-server/nodejs/examples/multiple-evals.js
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must correlate requests and responses using unique IDs

Applied to files:

  • eval-server/nodejs/CLAUDE.md
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must handle timeouts and error conditions

Applied to files:

  • eval-server/nodejs/CLAUDE.md
🧬 Code graph analysis (11)
front_end/panels/ai_chat/core/AgentNodes.ts (1)
front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1)
  • LLMConfigurationManager (42-390)
front_end/panels/ai_chat/evaluation/EvaluationAgent.ts (1)
front_end/panels/ai_chat/core/BuildConfig.ts (1)
  • BUILD_CONFIG (11-20)
eval-server/nodejs/examples/library-usage.js (1)
eval-server/nodejs/src/config.js (2)
  • CONFIG (5-63)
  • CONFIG (5-63)
eval-server/nodejs/src/lib/EvalServer.js (1)
eval-server/nodejs/src/config.js (2)
  • CONFIG (5-63)
  • CONFIG (5-63)
eval-server/nodejs/examples/multiple-evals.js (2)
eval-server/nodejs/src/config.js (2)
  • CONFIG (5-63)
  • CONFIG (5-63)
eval-server/nodejs/examples/library-usage.js (3)
  • response (63-71)
  • response (121-135)
  • response (195-223)
front_end/panels/ai_chat/core/AgentService.ts (3)
front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1)
  • LLMConfigurationManager (42-390)
front_end/panels/ai_chat/core/BuildConfig.ts (1)
  • BUILD_CONFIG (11-20)
front_end/panels/ai_chat/core/Graph.ts (1)
  • createAgentGraph (24-42)
front_end/panels/ai_chat/LLM/OpenAIProvider.ts (1)
front_end/panels/ai_chat/core/BuildConfig.ts (1)
  • BUILD_CONFIG (11-20)
front_end/panels/ai_chat/ui/AIChatPanel.ts (2)
front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1)
  • LLMConfigurationManager (42-390)
front_end/panels/ai_chat/core/AgentService.ts (1)
  • logger (334-345)
front_end/panels/ai_chat/common/EvaluationConfig.ts (1)
front_end/panels/ai_chat/core/BuildConfig.ts (1)
  • BUILD_CONFIG (11-20)
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (5)
front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts (5)
  • isLLMConfigurationRequest (297-299)
  • LLMConfigurationRequest (259-264)
  • createErrorResponse (240-255)
  • ErrorCodes (141-155)
  • createLLMConfigurationResponse (314-334)
front_end/panels/ai_chat/core/BuildConfig.ts (1)
  • BUILD_CONFIG (11-20)
front_end/panels/ai_chat/core/AgentService.ts (1)
  • AgentService (45-910)
front_end/panels/ai_chat/core/LLMConfigurationManager.ts (2)
  • LLMConfigurationManager (42-390)
  • hasOverride (184-186)
eval-server/nodejs/examples/library-usage.js (3)
  • response (63-71)
  • response (121-135)
  • response (195-223)
front_end/panels/ai_chat/core/LLMConfigurationManager.ts (4)
front_end/panels/ai_chat/core/AgentService.ts (1)
  • logger (334-345)
front_end/panels/ai_chat/core/Logger.ts (1)
  • createLogger (316-318)
front_end/panels/ai_chat/LLM/LLMTypes.ts (1)
  • LLMProvider (142-142)
front_end/panels/ai_chat/core/Constants.ts (1)
  • STORAGE_KEYS (39-45)
🪛 dotenv-linter (3.3.0)
eval-server/nodejs/.env.example

[warning] 6-6: [UnorderedKey] The HOST key should go before the PORT key

(UnorderedKey)


[warning] 16-16: [UnorderedKey] The LITELLM_API_KEY key should go before the LITELLM_ENDPOINT key

(UnorderedKey)


[warning] 27-27: [UnorderedKey] The DEFAULT_MAIN_MODEL key should go before the DEFAULT_PROVIDER key

(UnorderedKey)


[warning] 28-28: [UnorderedKey] The DEFAULT_MINI_MODEL key should go before the DEFAULT_PROVIDER key

(UnorderedKey)


[warning] 29-29: [UnorderedKey] The DEFAULT_NANO_MODEL key should go before the DEFAULT_PROVIDER key

(UnorderedKey)


[warning] 33-33: [UnorderedKey] The LOG_DIR key should go before the LOG_LEVEL key

(UnorderedKey)


[warning] 45-45: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

🪛 markdownlint-cli2 (0.18.1)
MODEL-CONFIGS.md

56-56: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


62-62: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


384-384: Multiple headings with the same content

(MD024, no-duplicate-heading)

🔇 Additional comments (23)
front_end/panels/ai_chat/BUILD.gn (1)

61-61: LLMConfigurationManager is correctly added to sources and metadata.

Inclusion looks consistent across module and metadata lists.

Also applies to: 218-218

front_end/panels/ai_chat/evaluation/EvaluationAgent.ts (1)

277-281: Log redaction and semantics look good.

The verification log avoids printing secrets and includes booleans only.

eval-server/nodejs/examples/with-http-wrapper.js (1)

14-17: Ports swapped: verify README and compose files reflect ws:8082/http:8080.

Ensure docs/scripts point to the updated URLs so users don’t connect to stale ports.

Also applies to: 21-22, 28-38

eval-server/nodejs/src/config.js (2)

17-32: Providers block is a solid centralization of credentials.

Clear separation by provider with env-driven values. LGTM.


34-41: Defaults section is helpful; ensure these align with frontend defaults.

No action needed if synced via docs; otherwise add a note in README.

front_end/panels/ai_chat/core/AgentNodes.ts (2)

18-18: Good: central manager import for unified LLM config.

Keeps provider/model/apiKey sourcing consistent.


817-827: Nice integration: tool execution sourced from LLMConfigurationManager.

Supports per‑request overrides and automated mode without leaking provider creds.

If createAgentNode’s model/provider are also driven by the same manager upstream, everything stays in sync. Please confirm.

front_end/panels/ai_chat/LLM/OpenAIProvider.ts (1)

525-533: Automated-mode skip is safe — runtime API key is supplied via config override.

EvaluationAgent sets LLMConfigurationManager.setOverride(...) with the per-request api_key before calling AgentService.sendMessage; AgentService.#initializeLLMClient reads config.apiKey and passes it into LLMClient.initialize, which constructs OpenAIProvider with that apiKey (no 401 risk if override is provided). See front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts and front_end/panels/ai_chat/core/AgentService.ts.

eval-server/nodejs/CLAUDE.md (1)

21-24: Document real default port to match the frontend.

This doc states default PORT 8080. Frontend defaults to 8082. Clarify which is canonical (or note Docker defaults) to avoid confusion.

front_end/panels/ai_chat/ui/AIChatPanel.ts (4)

335-341: Centralized mini-model fallback looks good.


343-351: Nano-model fallback hierarchy reads correctly.


353-362: Provider derivation via LLMConfigurationManager is appropriate.


365-373: Mini-model/provider pairing via manager is consistent.

front_end/panels/ai_chat/common/EvaluationConfig.ts (1)

69-70: Automated-mode safe — server skips secret validation; confirm front-end sends auth_verify or enable server auth

EvalServer.handleAuthVerification calls clientManager.registerClient(..., true) (skipSecretValidation=true), so a missing client secret will not block registration; CONFIG.security.authSecretKey is defined but not referenced by EvalServer. Ensure the front-end emits the 'auth_verify' message to complete registration, or enable server-side auth (AUTH_SECRET_KEY / EvalServer auth) if you require strict secret checks.

Likely an incorrect or invalid review comment.

eval-server/nodejs/src/lib/EvalServer.js (2)

199-205: Public accessor looks good

Exposing the client manager via a getter is appropriate and scoped.


275-280: Good JSON‑RPC request dispatch hook

Recognizes client→server JSON‑RPC requests early and delegates cleanly.

front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (1)

282-288: Automated mode auth bypass acknowledged

Skipping secret-key verification in AUTOMATED_MODE is acceptable for Docker/CI. Ensure this cannot be enabled in user builds.

Confirm BUILD_CONFIG.AUTOMATED_MODE is only flipped at build time for Docker images and not in user releases.

eval-server/nodejs/src/api-server.js (2)

310-334: Nested/legacy model processing LGTM

Clear precedence and sane defaults.


339-351: Default provider key usage is acceptable; avoid logging later

Using env OPENAI_API_KEY as default is fine for local flows. Ensure it’s never logged (see previous comment).

front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts (2)

93-95: Per-request api_key/endpoint fields added appropriately

Matches nested model override needs and LiteLLM routing.


257-334: LLM configure_llm JSON‑RPC additions look consistent

Types, guard, and helpers align with server method name and JSON‑RPC 2.0.

front_end/panels/ai_chat/core/AgentService.ts (1)

151-209: Centralized config + provider registration logic LGTM

Validation flow with AUTOMATED_MODE bypass and provider list assembly is sound.

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1)

106-124: No action needed — credential storage keys match provider implementations.

OpenAI, LiteLLM, Groq and OpenRouter providers' getCredentialStorageKeys() return the same ai_chat_* keys used by LLMConfigurationManager, so AgentService will see consistent credential names.

Comment on lines +5 to +6
PORT=8080
HOST=127.0.0.1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Address dotenv-linter warnings (ordering + trailing newline).

Reorder keys and add final newline for cleanliness. Functionality unaffected.

Apply this diff:

-PORT=8080
-HOST=127.0.0.1
+HOST=127.0.0.1
+PORT=8080
@@
-LITELLM_ENDPOINT=http://localhost:4000
-LITELLM_API_KEY=your-litellm-api-key-here
+LITELLM_API_KEY=your-litellm-api-key-here
+LITELLM_ENDPOINT=http://localhost:4000
@@
-DEFAULT_PROVIDER=openai
-DEFAULT_MAIN_MODEL=gpt-4
-DEFAULT_MINI_MODEL=gpt-4-mini
-DEFAULT_NANO_MODEL=gpt-3.5-turbo
+DEFAULT_MAIN_MODEL=gpt-4
+DEFAULT_MINI_MODEL=gpt-4-mini
+DEFAULT_NANO_MODEL=gpt-3.5-turbo
+DEFAULT_PROVIDER=openai
@@
-LOG_LEVEL=info
-LOG_DIR=./logs
+LOG_DIR=./logs
+LOG_LEVEL=info
@@
-AUTH_SECRET_KEY=
+AUTH_SECRET_KEY=
+

Also applies to: 15-16, 27-29, 33-34, 45-45

🧰 Tools
🪛 dotenv-linter (3.3.0)

[warning] 6-6: [UnorderedKey] The HOST key should go before the PORT key

(UnorderedKey)

🤖 Prompt for AI Agents
In eval-server/nodejs/.env.example around lines 5-6 (and similarly at 15-16,
27-29, 33-34, 45-45), the .env entries are out of the linter's expected ordering
and the file lacks a trailing newline; reorder the environment keys into the
expected (alphabetical) order within each block/section and ensure the file ends
with a single final newline character so dotenv-linter warnings are resolved.

Comment on lines +23 to +33
### Current localStorage Keys
- `ai_chat_provider`: Selected provider ('openai', 'litellm', 'groq', 'openrouter')
- `ai_chat_model_selection`: Main model name
- `ai_chat_mini_model`: Mini model name
- `ai_chat_nano_model`: Nano model name
- `ai_chat_api_key`: OpenAI API key
- `ai_chat_litellm_endpoint`: LiteLLM server endpoint
- `ai_chat_litellm_api_key`: LiteLLM API key
- `ai_chat_groq_api_key`: Groq API key
- `ai_chat_openrouter_api_key`: OpenRouter API key

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid persisting API keys in localStorage.

Storing provider keys in localStorage is high risk (readable by any script in origin; survives XSS). Prefer:

  • session or memory-only overrides for automation;
  • tighter-scoped storage (e.g., chrome.storage.session or in-memory singleton);
  • if persistence is required, encrypt at rest with a user-secret not available to page JS.

At minimum, document the security trade-offs and recommend non-persistent modes for CI.

Also applies to: 34-41, 360-377

🤖 Prompt for AI Agents
In MODEL-CONFIGS.md around lines 23 to 33 (and also apply same guidance to lines
34-41 and 360-377), the document currently lists API key localStorage keys which
encourages insecure persistence; update the file to remove/replace any
recommendation to store provider API keys in localStorage and instead state
explicit secure alternatives: recommend only session- or memory-only storage
(e.g., in-memory singleton or session-scoped browser APIs), recommend
platform-scoped secure storage (e.g., chrome.storage.session or
OS/keychain-backed storage) for short-lived keys, and if persistence is required
document encrypt-at-rest with a user-provided secret not available to page JS,
plus explicitly call out the XSS risk and recommend non-persistent modes for
CI/automation; make these changes as concise replacements of the listed keys and
add a brief note pointing to secure implementation patterns rather than code
examples.

Comment on lines +56 to +62
**Per-Request Override (Temporary)**
- Configuration override for individual evaluation requests
- No localStorage changes
- Each tab can have different overrides simultaneously
- Clean up after request completion

**Persistent Configuration (New)**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Fix MD036: use headings instead of emphasis.

Replace emphasized labels with proper headings, e.g., “#### Per-Request Override (Temporary)” and “#### Persistent Configuration (New)”.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

56-56: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


62-62: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents
In MODEL-CONFIGS.md around lines 56 to 62, the section labels use emphasis
instead of proper headings; replace the emphasized labels like "**Per-Request
Override (Temporary)**" and "**Persistent Configuration (New)**" with
appropriate Markdown headings (for example "#### Per-Request Override
(Temporary)" and "#### Persistent Configuration (New)"), leaving the bullet
lists unchanged so the structure uses real headings for accessibility and
linting compliance.

Comment on lines +100 to +104
// Persistence (manual mode & persistent automated mode)
saveConfiguration(config: LLMConfig): void;
loadConfiguration(): LLMConfig;
}
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Support partial persistent updates or remove the partial flag.

LLMConfigurationManager only exposes saveConfiguration(config: LLMConfig) which implies full writes, while the protocol adds partial?: boolean. Implement a partial-merge path to avoid unintentionally clearing fields on partial updates.

Suggested addition (outside this doc, in LLMConfigurationManager.ts):

applyPartialConfiguration(partial: Partial<LLMConfig>): void {
  const current = this.loadConfiguration();
  const next: LLMConfig = {
    ...current,
    ...partial,
    mainModel: partial.mainModel ?? current.mainModel,
    miniModel: partial.miniModel ?? current.miniModel,
    nanoModel: partial.nanoModel ?? current.nanoModel,
    provider: partial.provider ?? current.provider,
    apiKey: partial.apiKey ?? current.apiKey,
    endpoint: partial.endpoint ?? current.endpoint,
  };
  this.saveConfiguration(next);
}

Also applies to: 133-134

Comment on lines +166 to +187
export interface EvaluationParams {
evaluationId: string;
name: string;
url: string;
tool: string;
input: any;
model?: {
main_model?: string;
mini_model?: string;
nano_model?: string;
provider?: string;
api_key?: string; // New: per-request API key
endpoint?: string; // New: per-request endpoint (LiteLLM)
};
timeout: number;
metadata: {
tags: string[];
retries: number;
priority?: 'low' | 'normal' | 'high';
};
}
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unify per-request model schema (flat vs nested).

Docs show two incompatible shapes:

  • Flat override: model: { main_model?: string; mini_model?: string; provider?: string; api_key?: string; endpoint?: string }
  • Nested objects: model: { main_model: { provider, model, api_key, endpoint }, mini_model: {…}, nano_model: {…} }

Pick one and update all examples and types to match, or clearly version the protocols. This ambiguity will cause client/server decode failures.

Also applies to: 133-135, 137-161

🤖 Prompt for AI Agents
In MODEL-CONFIGS.md around lines 166-187 (and also update sections 133-135 and
137-161), the model field is ambiguous between a flat per-request override shape
and a nested-per-model object shape; choose one schema and make all types and
examples consistent: either (A) flatten model to a single object with optional
keys main_model/mini_model/nano_model as strings plus provider/api_key/endpoint
at the same level, or (B) switch to the nested shape where
main_model/mini_model/nano_model are objects containing provider, model,
api_key, endpoint; then update the EvaluationParams interface and every example
block to match the chosen shape, bump or version the protocol section header if
you need to preserve the old shape, and run a quick scan to ensure there are no
remaining mixed examples or type declarations.

Comment on lines +384 to +386
## Usage Examples

### Manual Mode (Settings UI)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Fix MD024: duplicate “Usage Examples” heading.

There are two “## Usage Examples” sections. Rename or merge to avoid duplication.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

384-384: Multiple headings with the same content

(MD024, no-duplicate-heading)

🤖 Prompt for AI Agents
In MODEL-CONFIGS.md around lines 384-386, there are duplicate "## Usage
Examples" headings; remove or rename one to avoid the MD024 duplicate-heading
error—either merge the content under a single "## Usage Examples" section or
rename the second heading to something specific like "### Manual Mode (Settings
UI)" (or another distinct heading) and update any cross-references accordingly.

Copy link

claude bot commented Sep 23, 2025

Claude finished @olesho's task —— View job


PR Review: Local Dockerised Eval Server

I've completed a comprehensive review of this PR implementing a local dockerized evaluation server. Here's my analysis:

Todo List:

  • Analyze overall architecture and changes
  • Review Node.js evaluation server backend code
  • Review frontend TypeScript integration code
  • Check configuration management and security
  • Evaluate documentation and examples
  • Assess test coverage
  • Provide comprehensive feedback and recommendations

Strengths

Architecture & Design

  • Solid WebSocket-based RPC architecture with JSON-RPC 2.0 protocol implementation
  • Excellent separation of concerns - clear distinction between EvalServer library, API server, and client management
  • Flexible multi-provider LLM support (OpenAI, Groq, OpenRouter, LiteLLM) with proper abstraction
  • Well-designed configuration override system enabling both manual and automated evaluation modes
  • Event-driven architecture with proper EventEmitter usage and tracing support

Code Quality

  • Comprehensive error handling with structured logging using Winston
  • Good TypeScript integration with proper type definitions and interfaces
  • Clean API design for the EvalServer library with intuitive ClientProxy pattern
  • Proper resource cleanup - WebSocket connections, RPC clients, and intervals are cleaned up
  • Well-structured configuration management with validation and environment variable support

Documentation

  • Excellent CLAUDE.md files providing clear architecture overview and usage instructions
  • Comprehensive MODEL-CONFIGS.md documenting the configuration system thoroughly
  • Good inline documentation and JSDoc comments throughout the codebase

⚠️ Areas for Improvement

Security Concerns

  1. API Key Exposure Risk (eval-server/nodejs/src/api-server.js:349)

    api_key: process.env.OPENAI_API_KEY
    • API keys are passed directly in response payloads and could be logged or exposed
    • Recommendation: Sanitize logs and consider token-based authentication
  2. CORS Configuration (eval-server/nodejs/src/api-server.js:61)

    res.setHeader('Access-Control-Allow-Origin', '*');
    • Wildcard CORS allows any origin to access the API
    • Recommendation: Configure specific allowed origins for production
  3. Weak Default Authentication (eval-server/nodejs/examples/library-usage.js:14)

    authKey: 'hello',
    • Examples use weak default secret keys
    • Recommendation: Generate cryptographically secure keys and document secure generation

Performance & Reliability

  1. Potential Memory Leaks (eval-server/nodejs/src/lib/EvalServer.js:59-60)

    this.connectedClients = new Map();
    this.activeEvaluations = new Map();
    • No cleanup mechanism for completed evaluations or stale connections
    • Recommendation: Implement periodic cleanup and connection heartbeat monitoring
  2. High Timeout Values (eval-server/nodejs/src/api-server.js:415)

    timeout: 1500000, // 25 minutes
    • Very long timeouts could lead to resource exhaustion
    • Recommendation: Use more reasonable defaults (5-10 minutes) with configurable limits
  3. Synchronous File Operations (eval-server/nodejs/src/api-server.js:30)

    const configContent = fs.readFileSync(configPath, 'utf8');
    • Blocking operations can impact server performance
    • Recommendation: Use async file operations (fs.promises.readFile)

Error Handling

  1. Generic Error Responses (eval-server/nodejs/src/lib/EvalServer.js:370-378)

    • RPC errors return generic messages that could expose internal details
    • Recommendation: Implement proper error categorization and sanitization
  2. Missing Validation (eval-server/nodejs/src/api-server.js:260-305)

    • Request body validation is minimal in the responses endpoint
    • Recommendation: Add comprehensive input validation and sanitization

Code Organization

  1. Inconsistent Model Configuration Format

    • Mix of flat and nested model configuration patterns across the codebase
    • Recommendation: Standardize on the nested format and provide migration utilities
  2. Large Files

    • EvalServer.js (905 lines) and api-server.js (566 lines) are quite large
    • Recommendation: Extract message handlers and validation logic into separate modules

🧪 Test Coverage Issues

  • No unit tests found for the evaluation server components
  • No integration tests for WebSocket communication
  • No security tests for authentication and authorization flows
  • No performance tests for concurrent evaluation handling

Recommendation: Add comprehensive test suite covering:

  • Unit tests for core server functionality
  • Integration tests for WebSocket RPC communication
  • Security tests for authentication flows
  • Performance tests for concurrent evaluations

🚀 Performance Considerations

Positive Aspects

  • Efficient WebSocket connection reuse
  • Proper event-driven architecture minimizing blocking operations
  • Good resource management with cleanup handlers

Areas for Optimization

  • Consider connection pooling for high-throughput scenarios
  • Implement request rate limiting to prevent abuse
  • Add metrics collection for monitoring performance

📋 Recommendations Summary

High Priority

  1. Security hardening: Fix CORS configuration, secure API key handling, improve authentication
  2. Add comprehensive test suite: Unit, integration, and security tests
  3. Error handling improvements: Better validation and error sanitization

Medium Priority

  1. Performance optimizations: Async file operations, connection cleanup, reasonable timeouts
  2. Code organization: Break down large files, standardize configuration patterns

Low Priority

  1. Documentation: Add API documentation and deployment guides
  2. Monitoring: Add metrics and observability features

🎯 Overall Assessment

This is a well-architected and feature-rich implementation of a local evaluation server. The code demonstrates solid engineering practices with good separation of concerns, comprehensive configuration management, and excellent documentation.

The main concerns are around security hardening and test coverage, which are critical for production readiness. The architecture is sound and the codebase provides a solid foundation for a robust evaluation system.

Recommendation: Address the security issues and add test coverage before merging to production environments.


Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
eval-server/nodejs/README.md (1)

148-164: Doc example: include LiteLLM endpoint when demonstrating multi‑provider config.

If any tier uses LiteLLM, document the optional endpoint field in that tier’s object so readers know how to route requests.

Example snippet:

"mini_model": {
  "provider": "litellm",
  "model": "gpt-4o-mini",
  "api_key": "sk-...",
  "endpoint": "http://localhost:4000"
}
eval-server/nodejs/src/api-server.js (1)

312-322: Accept legacy/flat model shapes and wrap into nested tiers.

Current code assumes request.model is already nested. Add handling for string/flat objects to avoid 400s with legacy clients.

Apply this diff:

-    // If nested format is provided, use it directly with fallbacks
-    if (requestBody.model) {
-      return {
-        main_model: requestBody.model.main_model || this.createDefaultModelConfig('main', defaults),
-        mini_model: requestBody.model.mini_model || this.createDefaultModelConfig('mini', defaults),
-        nano_model: requestBody.model.nano_model || this.createDefaultModelConfig('nano', defaults)
-      };
-    }
+    // If model is provided, support both nested and flat/legacy formats
+    if (requestBody.model) {
+      const m = requestBody.model;
+      // Already nested
+      if (m.main_model || m.mini_model || m.nano_model) {
+        return {
+          main_model: m.main_model || this.createDefaultModelConfig('main', defaults),
+          mini_model: m.mini_model || this.createDefaultModelConfig('mini', defaults),
+          nano_model: m.nano_model || this.createDefaultModelConfig('nano', defaults),
+        };
+      }
+      // Flat string -> main only
+      if (typeof m === 'string') {
+        return {
+          main_model: { provider: defaults.provider || 'openai', model: m, api_key: this.resolveApiKey(defaults.provider || 'openai') },
+          mini_model: this.createDefaultModelConfig('mini', defaults),
+          nano_model: this.createDefaultModelConfig('nano', defaults),
+        };
+      }
+      // Flat object { provider?, model?, api_key?, endpoint? } -> main only
+      if (typeof m === 'object') {
+        return {
+          main_model: {
+            provider: m.provider || (defaults.provider || 'openai'),
+            model: m.model || (defaults.main_model || 'gpt-4'),
+            api_key: m.api_key ?? this.resolveApiKey(m.provider || defaults.provider || 'openai'),
+            endpoint: m.endpoint, // optional for LiteLLM
+          },
+          mini_model: this.createDefaultModelConfig('mini', defaults),
+          nano_model: this.createDefaultModelConfig('nano', defaults),
+        };
+      }
+    }

Add this helper to the class to support provider‑specific keys:

resolveApiKey(provider) {
  switch ((provider || '').toLowerCase()) {
    case 'openai': return process.env.OPENAI_API_KEY;
    case 'groq': return process.env.GROQ_API_KEY;
    case 'openrouter': return process.env.OPENROUTER_API_KEY;
    case 'litellm': return process.env.LITELLM_API_KEY;
    default: return undefined;
  }
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c585f1d and ff99a7d.

📒 Files selected for processing (4)
  • eval-server/nodejs/CLAUDE.md (4 hunks)
  • eval-server/nodejs/README.md (1 hunks)
  • eval-server/nodejs/src/api-server.js (6 hunks)
  • eval-server/nodejs/src/client-manager.js (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must call Evaluate(request: string) -> string on connected agents

Applied to files:

  • eval-server/nodejs/src/api-server.js
  • eval-server/nodejs/CLAUDE.md
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/evaluator.js : LLM evaluator must integrate with the OpenAI API to judge agent responses

Applied to files:

  • eval-server/nodejs/CLAUDE.md
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must implement JSON-RPC 2.0 protocol

Applied to files:

  • eval-server/nodejs/CLAUDE.md
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/server.js : WebSocket server must accept agent connections, manage agent lifecycle (connect/ready/disconnect), orchestrate evaluation sessions, and handle bidirectional RPC communication

Applied to files:

  • eval-server/nodejs/CLAUDE.md
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must correlate requests and responses using unique IDs

Applied to files:

  • eval-server/nodejs/CLAUDE.md
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must handle timeouts and error conditions

Applied to files:

  • eval-server/nodejs/CLAUDE.md
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/config.js : Require OPENAI_API_KEY and PORT (default 8080) from environment configuration

Applied to files:

  • eval-server/nodejs/CLAUDE.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: claude-review
🔇 Additional comments (3)
eval-server/nodejs/CLAUDE.md (2)

110-158: Protocol shape drift: nested vs flat model config across docs.

MODEL-CONFIGS.md shows flat override; this file shows nested per‑tier. Pick one canonical schema and reference the other as legacy/compat with clear conversion rules.


46-51: configure_llm RPC handler present — front-end protocol wired; no server->agent calls found

EvalServer implements handleConfigureLLM (eval-server/nodejs/src/lib/EvalServer.js) and the front-end protocol declares configure_llm (front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts); RpcClient.callMethod is available, but I did not find any server-side outgoing call to 'configure_llm' (only 'evaluate'). If the server is meant to push LLM configs to agents, add a connection.rpcClient.callMethod(..., 'configure_llm', ...) or update the docs to reflect client-initiated configuration.

eval-server/nodejs/src/api-server.js (1)

270-273: Mask API keys before logging nestedModelConfig.

You’re logging plaintext api_key values.

Apply this diff:

-      logger.info('Processing responses request:', {
-        input: requestBody.input,
-        modelConfig: nestedModelConfig
-      });
+      const redactTier = (t) => t ? { ...t, api_key: t.api_key ? `${String(t.api_key).slice(0,4)}…REDACTED` : undefined } : t;
+      logger.info('Processing responses request:', {
+        input: requestBody.input,
+        modelConfig: {
+          main_model: redactTier(nestedModelConfig.main_model),
+          mini_model: redactTier(nestedModelConfig.mini_model),
+          nano_model: redactTier(nestedModelConfig.nano_model),
+        }
+      });

Comment on lines +160 to +181
### Dynamic LLM Configuration

The server supports runtime LLM configuration via the `configure_llm` JSON-RPC method:

```json
{
"jsonrpc": "2.0",
"method": "configure_llm",
"params": {
"provider": "openai|groq|openrouter|litellm",
"apiKey": "your-api-key",
"endpoint": "endpoint-url-for-litellm",
"models": {
"main": "main-model-name",
"mini": "mini-model-name",
"nano": "nano-model-name"
},
"partial": false
},
"id": "config-request-id"
}
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Clarify partial update semantics for configure_llm.

Document which fields are merge‑patched when partial: true and server‑side validation/error behavior for missing credentials per provider.


🏁 Script executed:

#!/bin/bash
rg -n -C2 --type=js --type=ts 'handleConfigureLLM|partial\s*[:=]\s*true|validateConfiguration' eval-server/nodejs

Length of output: 811


🏁 Script executed:

#!/bin/bash
sed -n '320,460p' eval-server/nodejs/src/lib/EvalServer.js

Length of output: 3888


Fix partial-update semantics and document configure_llm behavior.

  • In handleConfigureLLM (eval-server/nodejs/src/lib/EvalServer.js — starts at line 384) the code validates provider and models.main before merging; this rejects legitimate partial updates that omit those fields. Change to merge params into the existing config when partial: true, then validate the resulting config (ensure merged.provider ∈ supportedProviders and merged.models.main exists).
  • If a partial update changes provider, validate that required credentials for the new provider are present after merge (use CONFIG.providers[provider] as fallback) and return a clear error listing any missing credentials.
  • Update eval-server/nodejs/CLAUDE.md (configure_llm section) to explicitly state which fields are merge-patched (provider, apiKey, endpoint, models.*) and the server-side validation/error rules for missing credentials per provider.
🤖 Prompt for AI Agents
In eval-server/nodejs/src/lib/EvalServer.js around line 384 and
eval-server/nodejs/CLAUDE.md lines 160–181, change handleConfigureLLM so that
when params.partial === true it first merges the incoming params into the
current saved LLM config (provider, apiKey, endpoint, models.*) and then
validates the merged config (ensure merged.provider is in supportedProviders and
merged.models.main exists); if the merged provider differs from the previous
provider, validate that all required credentials for the new provider are
present using CONFIG.providers[provider] as fallbacks and return a clear error
listing any missing credential keys, and ensure non-partial updates still
require provider and models.main; also update CLAUDE.md configure_llm section to
document that provider, apiKey, endpoint, and models.* are merge-patched on
partial updates and to list the server-side validation and per-provider
missing-credential error rules.

Comment on lines +338 to +350
createDefaultModelConfig(tier, defaults) {
const defaultModels = {
main: defaults.main_model || 'gpt-4',
mini: defaults.mini_model || 'gpt-4-mini',
nano: defaults.nano_model || 'gpt-3.5-turbo'
};

return {
provider: defaults.provider || 'openai',
model: defaultModels[tier],
api_key: process.env.OPENAI_API_KEY
};
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Provider‑specific API key resolution (current code hardcodes OPENAI key).

Using OPENAI_API_KEY for non‑OpenAI providers will fail authentication.

Apply this diff:

-    return {
-      provider: defaults.provider || 'openai',
-      model: defaultModels[tier],
-      api_key: process.env.OPENAI_API_KEY
-    };
+    const provider = (defaults.provider || 'openai').toLowerCase();
+    const apiKey = this.resolveApiKey(provider);
+    const tierConfig = {
+      provider,
+      model: defaultModels[tier],
+      api_key: apiKey,
+    };
+    // Include endpoint for LiteLLM if available
+    if (provider === 'litellm' && process.env.LITELLM_ENDPOINT) {
+      tierConfig.endpoint = process.env.LITELLM_ENDPOINT;
+    }
+    return tierConfig;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In eval-server/nodejs/src/api-server.js around lines 338 to 350, the function
always uses process.env.OPENAI_API_KEY which breaks authentication for
non‑OpenAI providers; update the function to prefer an explicit defaults.api_key
if provided, otherwise resolve the key from an environment variable mapped to
the chosen provider (e.g. for provider 'openai' use process.env.OPENAI_API_KEY,
for others use process.env[provider.toUpperCase() + '_API_KEY'] or a small
explicit provider→env map), and return that resolved api_key instead of the
hardcoded OPENAI variable so provider-specific keys are used correctly.

Comment on lines 61 to 64
// API model override takes precedence over everything
return {
...(this.configDefaults?.model || {}), // Use config as base
...apiModelOverride // API overrides everything
};
// Ensure nested format is used
return apiModelOverride;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Normalize to the nested model schema before returning API override.

Downstream now expects nested main_model/mini_model/nano_model objects. Returning the override verbatim risks schema mismatches when callers pass flat/legacy shapes.

Apply this diff:

-      // Ensure nested format is used
-      return apiModelOverride;
+      // Normalize to nested format
+      return this.ensureNestedModelConfig(apiModelOverride);

Add this helper to the class (outside this method):

// Normalizes legacy shapes into the canonical nested format.
ensureNestedModelConfig(model) {
  if (!model) return {};
  // Already nested
  if (model.main_model || model.mini_model || model.nano_model) {
    return {
      main_model: model.main_model || {},
      mini_model: model.mini_model || {},
      nano_model: model.nano_model || {},
    };
  }
  // Flat string -> main only
  if (typeof model === 'string') {
    return { main_model: { provider: 'openai', model, api_key: process.env.OPENAI_API_KEY }, mini_model: {}, nano_model: {} };
  }
  // Flat object { provider?, model?, api_key? } -> main only
  if (typeof model === 'object' && (model.model || model.provider || model.api_key)) {
    return {
      main_model: {
        provider: model.provider || 'openai',
        model: model.model,
        api_key: model.api_key,
        endpoint: model.endpoint, // if present
      },
      mini_model: {},
      nano_model: {},
    };
  }
  return {};
}
🤖 Prompt for AI Agents
In eval-server/nodejs/src/client-manager.js around lines 61 to 64, the method
currently returns apiModelOverride verbatim which can be a flat/legacy shape;
normalize overrides to the nested main_model/mini_model/nano_model schema before
returning. Add the provided ensureNestedModelConfig helper to the class (outside
the method) and call it on apiModelOverride so that strings, flat objects, and
already-nested objects are transformed into the canonical nested structure,
preserving fields like provider, model, api_key, and endpoint as described.

Comment on lines +69 to 72
// Test YAML model takes precedence
// Ensure nested format is returned
return testModel;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Normalize test YAML model to nested schema.

Same issue as API override: return a normalized nested shape to avoid downstream breakage.

Apply this diff:

-      // Ensure nested format is returned
-      return testModel;
+      // Normalize to nested format
+      return this.ensureNestedModelConfig(testModel);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Test YAML model takes precedence
// Ensure nested format is returned
return testModel;
}
// Test YAML model takes precedence
- // Ensure nested format is returned
// Normalize to nested format
return this.ensureNestedModelConfig(testModel);
}
🤖 Prompt for AI Agents
In eval-server/nodejs/src/client-manager.js around lines 69 to 72, the function
currently returns testModel directly which may be in a non-nested shape; update
the code to convert/normalize testModel into the expected nested schema before
returning to avoid downstream breakage. Call the existing normalization utility
(or add a small helper) that transforms the flat/override shape into the nested
structure the API expects, validate required nested keys exist, and return the
normalized object instead of the raw testModel. Ensure you import or define the
normalizer in this file and keep immutability by returning a new object.

…ial=true, making partial updates unusable. Validate only fields provided for partial updates; require full fields only for full updates.
Copy link

claude bot commented Sep 23, 2025

Claude finished @olesho's task —— View job


PR Review: Local Dockerised Eval Server

I've completed a comprehensive review of this PR implementing a local dockerized evaluation server. Here's my analysis:

Todo List:

  • Analyze overall architecture and changes
  • Review Node.js evaluation server backend code
  • Review frontend TypeScript integration code
  • Check configuration management and security
  • Evaluate documentation and examples
  • Assess test coverage
  • Provide comprehensive feedback and recommendations

Strengths

Architecture & Design

  • Solid WebSocket-based RPC architecture with JSON-RPC 2.0 protocol implementation
  • Excellent separation of concerns - clear distinction between EvalServer library, API server, and client management
  • Flexible multi-provider LLM support (OpenAI, Groq, OpenRouter, LiteLLM) with proper abstraction
  • Well-designed configuration override system enabling both manual and automated evaluation modes
  • Event-driven architecture with proper EventEmitter usage and tracing support

Code Quality

  • Comprehensive error handling with structured logging using Winston
  • Good TypeScript integration with proper type definitions and interfaces
  • Clean API design for the EvalServer library with intuitive ClientProxy pattern
  • Proper resource cleanup - WebSocket connections, RPC clients, and intervals are cleaned up
  • Well-structured configuration management with validation and environment variable support

Documentation

  • Excellent CLAUDE.md files providing clear architecture overview and usage instructions
  • Comprehensive MODEL-CONFIGS.md documenting the configuration system thoroughly
  • Good inline documentation and JSDoc comments throughout the codebase

⚠️ Areas for Improvement

Security Concerns

  1. API Key Exposure Risk (eval-server/nodejs/src/api-server.js:349)

    api_key: process.env.OPENAI_API_KEY
    • API keys are passed directly in response payloads and could be logged or exposed
    • Recommendation: Sanitize logs and consider token-based authentication
  2. CORS Configuration (eval-server/nodejs/src/api-server.js:61)

    res.setHeader('Access-Control-Allow-Origin', '*');
    • Wildcard CORS allows any origin to access the API
    • Recommendation: Configure specific allowed origins for production
  3. Weak Default Authentication (eval-server/nodejs/examples/library-usage.js:14)

    authKey: 'hello',
    • Examples use weak default secret keys
    • Recommendation: Generate cryptographically secure keys and document secure generation

Performance & Reliability

  1. Potential Memory Leaks (eval-server/nodejs/src/lib/EvalServer.js:59-60)

    this.connectedClients = new Map();
    this.activeEvaluations = new Map();
    • No cleanup mechanism for completed evaluations or stale connections
    • Recommendation: Implement periodic cleanup and connection heartbeat monitoring
  2. High Timeout Values (eval-server/nodejs/src/api-server.js:415)

    timeout: 1500000, // 25 minutes
    • Very long timeouts could lead to resource exhaustion
    • Recommendation: Use more reasonable defaults (5-10 minutes) with configurable limits
  3. Synchronous File Operations (eval-server/nodejs/src/api-server.js:30)

    const configContent = fs.readFileSync(configPath, 'utf8');
    • Blocking operations can impact server performance
    • Recommendation: Use async file operations (fs.promises.readFile)

Error Handling

  1. Generic Error Responses (eval-server/nodejs/src/lib/EvalServer.js:370-378)

    • RPC errors return generic messages that could expose internal details
    • Recommendation: Implement proper error categorization and sanitization
  2. Missing Validation (eval-server/nodejs/src/api-server.js:260-305)

    • Request body validation is minimal in the responses endpoint
    • Recommendation: Add comprehensive input validation and sanitization

Code Organization

  1. Inconsistent Model Configuration Format

    • Mix of flat and nested model configuration patterns across the codebase
    • Recommendation: Standardize on the nested format and provide migration utilities
  2. Large Files

    • EvalServer.js (905 lines) and api-server.js (566 lines) are quite large
    • Recommendation: Extract message handlers and validation logic into separate modules

🧪 Test Coverage Issues

  • No unit tests found for the evaluation server components
  • No integration tests for WebSocket communication
  • No security tests for authentication and authorization flows
  • No performance tests for concurrent evaluation handling

Recommendation: Add comprehensive test suite covering:

  • Unit tests for core server functionality
  • Integration tests for WebSocket RPC communication
  • Security tests for authentication flows
  • Performance tests for concurrent evaluations

🚀 Performance Considerations

Positive Aspects

  • Efficient WebSocket connection reuse
  • Proper event-driven architecture minimizing blocking operations
  • Good resource management with cleanup handlers

Areas for Optimization

  • Consider connection pooling for high-throughput scenarios
  • Implement request rate limiting to prevent abuse
  • Add metrics collection for monitoring performance

📋 Recommendations Summary

High Priority

  1. Security hardening: Fix CORS configuration, secure API key handling, improve authentication
  2. Add comprehensive test suite: Unit, integration, and security tests
  3. Error handling improvements: Better validation and error sanitization

Medium Priority

  1. Performance optimizations: Async file operations, connection cleanup, reasonable timeouts
  2. Code organization: Break down large files, standardize configuration patterns

Low Priority

  1. Documentation: Add API documentation and deployment guides
  2. Monitoring: Add metrics and observability features

🎯 Overall Assessment

This is a well-architected and feature-rich implementation of a local evaluation server. The code demonstrates solid engineering practices with good separation of concerns, comprehensive configuration management, and excellent documentation.

The main concerns are around security hardening and test coverage, which are critical for production readiness. The architecture is sound and the codebase provides a solid foundation for a robust evaluation system.

Recommendation: Address the security issues and add test coverage before merging to production environments.


Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
eval-server/nodejs/src/api-server.js (1)

258-266: Broaden input validation to accept Responses API shapes (string | message array | content-part arrays)

The Responses API accepts a top-level string OR an array of message objects; each message.content may be a string or an array of typed content-part objects. The current string-only check in eval-server/nodejs/src/api-server.js (lines 258–266) will reject valid inputs — update validation to accept typeof input === 'string' OR Array.isArray(input) OR (typeof input === 'object' && input !== null), and ensure message.content is either a string or an array of content parts.

eval-server/nodejs/src/lib/EvalServer.js (1)

262-279: Fix JSON‑RPC detection: don’t use truthiness for result/error/id

Responses with result=0/'' or id=0 will be misclassified. Use property existence checks and validate method type.

Apply this diff:

-      // Handle RPC responses
-      if (data.jsonrpc === '2.0' && (data.result || data.error) && data.id) {
+      // Handle RPC responses
+      const has = (k) => Object.prototype.hasOwnProperty.call(data, k);
+      if (data.jsonrpc === '2.0' && (has('result') || has('error')) && has('id')) {
         if (connection.rpcClient.handleResponse(message)) {
           return;
         }
         logger.debug('RPC response could not be handled', {
           connectionId: connection.id,
           clientId: connection.clientId,
           id: data.id
         });
         return;
       }
 
-      // Handle RPC requests from client to server
-      if (data.jsonrpc === '2.0' && data.method && data.id) {
+      // Handle RPC requests from client to server (requests have an id; notifications do not)
+      if (data.jsonrpc === '2.0' && typeof data.method === 'string' && has('id')) {
         await this.handleRpcRequest(connection, data);
         return;
       }
🧹 Nitpick comments (4)
eval-server/nodejs/src/api-server.js (1)

300-302: Avoid logging full evaluation results; log shape/keys instead

Reduce risk of logging large/PII payloads.

Apply this diff:

-      logger.debug('executeEvaluation result:', result);
+      logger.debug('executeEvaluation result keys:', typeof result === 'object' && result ? Object.keys(result) : typeof result);
eval-server/nodejs/src/lib/EvalServer.js (3)

369-387: Return appropriate JSON‑RPC error codes (not always -32603)

Map validation errors to -32602 and auth errors to -32001; reserve -32603 for true internal errors.

Apply this diff:

     } catch (error) {
       logger.error('RPC request failed', {
         connectionId: connection.id,
         clientId: connection.clientId,
         method: request.method,
         requestId: request.id,
         error: error.message
       });
 
       // Send error response
       this.sendMessage(connection.ws, {
         jsonrpc: '2.0',
-        error: {
-          code: -32603, // Internal error
-          message: error.message
-        },
+        error: {
+          // Prefer provided error.code if present; otherwise classify
+          code:
+            (Number.isInteger(error?.code) && error.code) ||
+            (/invalid param|unsupported|required|missing/i.test(error.message) ? -32602 : // Invalid params
+             /unauthorized|forbidden|auth/i.test(error.message) ? -32001 :                // Unauthorized
+             -32603),                                                                     // Internal error
+          message: error.message
+        },
         id: request.id
       });
     }

401-402: Avoid hard‑coding providers; derive from config

Use keys of CONFIG.providers to prevent drift.

Apply this diff:

-    const supportedProviders = ['openai', 'litellm', 'groq', 'openrouter'];
+    const supportedProviders = Object.keys(CONFIG.providers || {});

719-747: Add fallback to CONFIG.defaults when no per‑connection config is present

Comment says “otherwise defaults” but code doesn’t apply them.

Apply this diff:

-      let modelConfig = evaluation.model || {};
+      let modelConfig = evaluation.model || {};
 
-      if (connection.llmConfig) {
+      if (connection.llmConfig) {
         // New nested format: separate config objects for each model tier
         modelConfig = {
           main_model: {
             provider: connection.llmConfig.provider,
             model: connection.llmConfig.models.main,
             api_key: connection.llmConfig.apiKey,
             endpoint: connection.llmConfig.endpoint
           },
           mini_model: {
             provider: connection.llmConfig.provider,
             model: connection.llmConfig.models.mini,
             api_key: connection.llmConfig.apiKey,
             endpoint: connection.llmConfig.endpoint
           },
           nano_model: {
             provider: connection.llmConfig.provider,
             model: connection.llmConfig.models.nano,
             api_key: connection.llmConfig.apiKey,
             endpoint: connection.llmConfig.endpoint
           },
           // Include any evaluation-specific overrides
           ...modelConfig
         };
+      } else if (!modelConfig || Object.keys(modelConfig).length === 0) {
+        // Fallback to server defaults
+        const d = CONFIG.defaults || {};
+        const provider = d.provider;
+        modelConfig = {
+          main_model: { provider, model: d.mainModel },
+          mini_model: { provider, model: d.miniModel },
+          nano_model: { provider, model: d.nanoModel }
+        };
       }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff99a7d and e8b37a3.

📒 Files selected for processing (2)
  • eval-server/nodejs/src/api-server.js (6 hunks)
  • eval-server/nodejs/src/lib/EvalServer.js (5 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must implement JSON-RPC 2.0 protocol

Applied to files:

  • eval-server/nodejs/src/lib/EvalServer.js
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must call Evaluate(request: string) -> string on connected agents

Applied to files:

  • eval-server/nodejs/src/lib/EvalServer.js
  • eval-server/nodejs/src/api-server.js
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/server.js : WebSocket server must accept agent connections, manage agent lifecycle (connect/ready/disconnect), orchestrate evaluation sessions, and handle bidirectional RPC communication

Applied to files:

  • eval-server/nodejs/src/lib/EvalServer.js
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/rpc-client.js : RPC client must handle timeouts and error conditions

Applied to files:

  • eval-server/nodejs/src/lib/EvalServer.js
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/config.js : Require OPENAI_API_KEY and PORT (default 8080) from environment configuration

Applied to files:

  • eval-server/nodejs/src/api-server.js
🧬 Code graph analysis (1)
eval-server/nodejs/src/lib/EvalServer.js (1)
eval-server/nodejs/src/config.js (2)
  • CONFIG (5-63)
  • CONFIG (5-63)
🔇 Additional comments (7)
eval-server/nodejs/src/api-server.js (2)

270-281: Good: API keys are now redacted before logging

Sanitized per-tier configs are logged; avoids secret leakage. Matches prior review guidance.


346-358: Use provider‑specific API keys and consistent default model ids

Hardcoding OPENAI_API_KEY breaks non‑OpenAI providers. Also align fallbacks with the 4.1 defaults used elsewhere.

Apply this diff:

   createDefaultModelConfig(tier, defaults) {
-    const defaultModels = {
-      main: defaults.main_model || 'gpt-4',
-      mini: defaults.mini_model || 'gpt-4-mini',
-      nano: defaults.nano_model || 'gpt-3.5-turbo'
-    };
-
-    return {
-      provider: defaults.provider || 'openai',
-      model: defaultModels[tier],
-      api_key: process.env.OPENAI_API_KEY
-    };
+    const defaultModels = {
+      main: defaults.main_model || 'gpt-4.1',
+      mini: defaults.mini_model || 'gpt-4.1-mini',
+      nano: defaults.nano_model || 'gpt-4.1-nano',
+    };
+    const provider = (defaults.provider || 'openai').toLowerCase();
+    const envMap = {
+      openai: 'OPENAI_API_KEY',
+      anthropic: 'ANTHROPIC_API_KEY',
+      google: 'GOOGLE_API_KEY',
+      gemini: 'GEMINI_API_KEY',
+      litellm: 'LITELLM_API_KEY',
+    };
+    const envVar = envMap[provider] || `${provider.toUpperCase()}_API_KEY`;
+    const apiKey = defaults.api_key || process.env[envVar] || process.env.OPENAI_API_KEY;
+    const tierConfig = {
+      provider,
+      model: defaultModels[tier],
+      api_key: apiKey,
+    };
+    if (provider === 'litellm' && process.env.LITELLM_ENDPOINT) {
+      tierConfig.endpoint = process.env.LITELLM_ENDPOINT;
+    }
+    return tierConfig;
   }
eval-server/nodejs/src/lib/EvalServer.js (5)

199-205: Accessor looks good

Public getClientManager() is a useful addition for external orchestration.


345-360: Unknown method now uses JSON‑RPC -32601 (Method not found)

Spec‑compliant handling and early return look good.


725-742: Confirm sending API keys to clients is acceptable

api_key is included in the RPC payload to the client. Ensure clients are fully trusted and this aligns with your threat model.


758-774: Timeout default mismatch (params 30s vs RPC wait 45s); unify to CONFIG.rpc.timeout

Use a single source of truth to avoid premature timeouts.

Apply this diff:

-          model: modelConfig,
-          timeout: evaluation.timeout || 30000,
+          model: modelConfig,
+          timeout: evaluation.timeout ?? CONFIG.rpc.timeout,
-        evaluation.timeout || 45000
+        evaluation.timeout ?? CONFIG.rpc.timeout

398-418: Partial updates no longer require provider/models — good fix

Validation now respects partial=true semantics.

Comment on lines +320 to 337
processNestedModelConfig(requestBody) {
const defaults = this.configDefaults?.model || {};


// If nested format is provided, use it directly with fallbacks
if (requestBody.model) {
return {
main_model: requestBody.model.main_model || this.createDefaultModelConfig('main', defaults),
mini_model: requestBody.model.mini_model || this.createDefaultModelConfig('mini', defaults),
nano_model: requestBody.model.nano_model || this.createDefaultModelConfig('nano', defaults)
};
}

// No model config provided, use defaults
return {
main_model: requestBody.main_model || defaults.main_model || 'gpt-4.1',
mini_model: requestBody.mini_model || defaults.mini_model || 'gpt-4.1-mini',
nano_model: requestBody.nano_model || defaults.nano_model || 'gpt-4.1-nano',
provider: requestBody.provider || defaults.provider || 'openai'
main_model: this.createDefaultModelConfig('main', defaults),
mini_model: this.createDefaultModelConfig('mini', defaults),
nano_model: this.createDefaultModelConfig('nano', defaults)
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Normalize and validate nested model tiers; handle string tier specs

If a tier is provided as a string (model id), the current code passes the string through, violating expected object shape and risking downstream breakage. Normalize strings to objects and merge partial objects with defaults.

Apply this diff:

-  processNestedModelConfig(requestBody) {
-    const defaults = this.configDefaults?.model || {};
-
-    // If nested format is provided, use it directly with fallbacks
-    if (requestBody.model) {
-      return {
-        main_model: requestBody.model.main_model || this.createDefaultModelConfig('main', defaults),
-        mini_model: requestBody.model.mini_model || this.createDefaultModelConfig('mini', defaults),
-        nano_model: requestBody.model.nano_model || this.createDefaultModelConfig('nano', defaults)
-      };
-    }
-
-    // No model config provided, use defaults
-    return {
-      main_model: this.createDefaultModelConfig('main', defaults),
-      mini_model: this.createDefaultModelConfig('mini', defaults),
-      nano_model: this.createDefaultModelConfig('nano', defaults)
-    };
-  }
+  processNestedModelConfig(requestBody) {
+    const defaults = this.configDefaults?.model || {};
+
+    const normalizeTier = (tierValue, tierName) => {
+      if (!tierValue) return this.createDefaultModelConfig(tierName, defaults);
+      // String => treat as model id; merge with defaults
+      if (typeof tierValue === 'string') {
+        const base = this.createDefaultModelConfig(tierName, defaults);
+        return { ...base, model: tierValue, provider: (base.provider || 'openai').toLowerCase() };
+      }
+      // Object => merge with defaults; normalize provider
+      if (typeof tierValue === 'object') {
+        const base = this.createDefaultModelConfig(tierName, defaults);
+        const provider = (tierValue.provider || base.provider || defaults.provider || 'openai').toLowerCase();
+        return { ...base, ...tierValue, provider };
+      }
+      return this.createDefaultModelConfig(tierName, defaults);
+    };
+
+    if (requestBody.model) {
+      return {
+        main_model: normalizeTier(requestBody.model.main_model, 'main'),
+        mini_model: normalizeTier(requestBody.model.mini_model, 'mini'),
+        nano_model: normalizeTier(requestBody.model.nano_model, 'nano'),
+      };
+    }
+
+    return {
+      main_model: this.createDefaultModelConfig('main', defaults),
+      mini_model: this.createDefaultModelConfig('mini', defaults),
+      nano_model: this.createDefaultModelConfig('nano', defaults),
+    };
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
processNestedModelConfig(requestBody) {
const defaults = this.configDefaults?.model || {};
// If nested format is provided, use it directly with fallbacks
if (requestBody.model) {
return {
main_model: requestBody.model.main_model || this.createDefaultModelConfig('main', defaults),
mini_model: requestBody.model.mini_model || this.createDefaultModelConfig('mini', defaults),
nano_model: requestBody.model.nano_model || this.createDefaultModelConfig('nano', defaults)
};
}
// No model config provided, use defaults
return {
main_model: requestBody.main_model || defaults.main_model || 'gpt-4.1',
mini_model: requestBody.mini_model || defaults.mini_model || 'gpt-4.1-mini',
nano_model: requestBody.nano_model || defaults.nano_model || 'gpt-4.1-nano',
provider: requestBody.provider || defaults.provider || 'openai'
main_model: this.createDefaultModelConfig('main', defaults),
mini_model: this.createDefaultModelConfig('mini', defaults),
nano_model: this.createDefaultModelConfig('nano', defaults)
};
processNestedModelConfig(requestBody) {
const defaults = this.configDefaults?.model || {};
const normalizeTier = (tierValue, tierName) => {
if (!tierValue) return this.createDefaultModelConfig(tierName, defaults);
// String => treat as model id; merge with defaults
if (typeof tierValue === 'string') {
const base = this.createDefaultModelConfig(tierName, defaults);
return { ...base, model: tierValue, provider: (base.provider || 'openai').toLowerCase() };
}
// Object => merge with defaults; normalize provider
if (typeof tierValue === 'object') {
const base = this.createDefaultModelConfig(tierName, defaults);
const provider = (tierValue.provider || base.provider || defaults.provider || 'openai').toLowerCase();
return { ...base, ...tierValue, provider };
}
return this.createDefaultModelConfig(tierName, defaults);
};
if (requestBody.model) {
return {
main_model: normalizeTier(requestBody.model.main_model, 'main'),
mini_model: normalizeTier(requestBody.model.mini_model, 'mini'),
nano_model: normalizeTier(requestBody.model.nano_model, 'nano'),
};
}
return {
main_model: this.createDefaultModelConfig('main', defaults),
mini_model: this.createDefaultModelConfig('mini', defaults),
nano_model: this.createDefaultModelConfig('nano', defaults),
};
}

Comment on lines +379 to 401
createDynamicEvaluationNested(input, nestedModelConfig) {
const evaluationId = `api-eval-${Date.now()}-${Math.random().toString(36).substring(2, 9)}`;

return {
id: evaluationId,
name: 'OpenAI API Request',
description: 'Dynamic evaluation created from OpenAI Responses API request',
name: 'API Request',
description: 'Dynamic evaluation created from API request',
enabled: true,
tool: 'chat',
timeout: 1500000, // 25 minutes
input: {
message: input,
reasoning: 'API request processing'
message: input
},
model: modelConfig,
model: nestedModelConfig,
validation: {
type: 'none' // No validation needed for API responses
},
metadata: {
tags: ['api', 'dynamic', 'openai-responses'],
tags: ['api', 'dynamic'],
priority: 'high',
source: 'openai-api'
source: 'api'
}
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Avoid persisting secrets in evaluation.model

Ensure api_key values in nestedModelConfig aren’t persisted to disk or long‑term stores. If they must be included for dispatch, strip them before storage or mask in any persisted representation.


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "Searching for api_key usage and potential persistence..."
rg -n -C2 '\bapi_key\b'
echo "Searching for file writes near evaluation data..."
rg -n -C3 -P '(writeFile|writeFileSync|appendFile|appendFileSync|fs\.write)' | sed -n '1,200p'
echo "Searching for evaluation serialization..."
rg -n -C3 -P 'JSON\.stringify\(.{0,120}(evaluation|evaluations)' 

Length of output: 33386


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "=== Showing eval-server/api-server.js (head -> 520) ==="
sed -n '1,520p' eval-server/nodejs/src/api-server.js || true

echo
echo "=== Searching for createDynamicEvaluationNested references ==="
rg -n "createDynamicEvaluationNested" || true

echo
echo "=== Search for evaluation endpoints and handlers in eval-server ==="
rg -n --hidden -S --no-ignore --pc '\/evaluate|\/evaluations|\bevaluate\b|\bevaluations\b' eval-server/nodejs/src || true

echo
echo "=== Searching for persistence (file/db/JSON.stringify) in eval-server/src ==="
rg -n --hidden -S --no-ignore 'writeFile|writeFileSync|appendFile|appendFileSync|fs\.write|JSON\.stringify|db|database|lowdb|sqlite|level|nedb|redis' eval-server/nodejs/src || true

Length of output: 16667


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "=== client-manager.js (around write) ==="
sed -n '1,420p' eval-server/nodejs/src/client-manager.js || true

echo
echo "=== EvalServer.js (around executeEvaluation / send to client) ==="
sed -n '1,980p' eval-server/nodejs/src/lib/EvalServer.js || true

echo
echo "=== Search for where evaluations are persisted/saved in client-manager or eval-server ==="
rg -n "save|persist|write.*evaluation|evaluations|yaml" eval-server/nodejs/src || true

Length of output: 56744


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "=== logger.js ==="
sed -n '1,240p' eval-server/nodejs/src/logger.js || true

echo
echo "=== rpc-client.js ==="
sed -n '1,260p' eval-server/nodejs/src/rpc-client.js || true

echo
echo "=== config.js ==="
sed -n '1,240p' eval-server/nodejs/src/config.js || true

echo
echo "=== Search logger for evaluations.jsonl writes or logEvaluation implementation ==="
rg -n "evaluations.jsonl|logEvaluation|evaluation" eval-server/nodejs/src/logger.js || true

Length of output: 9025


Do not persist API keys in evaluation.model — strip or redact before any logging/storage

createDynamicEvaluationNested can include nestedModelConfig.api_key and the server currently logs/saves RPC params and evaluation entries unredacted.

  • eval-server/nodejs/src/api-server.js — strip model.*.api_key (or set to undefined) before returning/dispatching the dynamic evaluation (createDynamicEvaluationNested / processNestedModelConfig).
  • eval-server/nodejs/src/lib/EvalServer.js — when assembling modelConfig, avoid merging raw api keys into objects that are logged or retained; keep keys only in-memory for dispatch and never attach them to evaluation objects.
  • eval-server/nodejs/src/rpc-client.js — sanitize/mask sensitive fields in params before calling logRpcCall (and any other log calls).
  • eval-server/nodejs/src/logger.js — redact sensitive keys in logEvaluation / evaluationLogger.info so evaluations.jsonl and combined logs never contain full api keys.
  • Add a single sanitize utility (keys: api_key, apiKey, token, secret, authorization, password) and use consistent masking (e.g. first 4 chars + "...") across the above locations.
🤖 Prompt for AI Agents
In eval-server/nodejs/src/api-server.js around lines 379 to 401,
createDynamicEvaluationNested currently can return nestedModelConfig that
includes sensitive keys (e.g. api_key) which may be logged/stored; strip or
redact any sensitive fields (api_key, apiKey, token, secret, authorization,
password) from nestedModelConfig before returning or attaching it to the
evaluation object (set to undefined or replace with masked value like first 4
chars + "..."). Also add a shared sanitize utility used across
eval-server/nodejs/src/lib/EvalServer.js, eval-server/nodejs/src/rpc-client.js
and eval-server/nodejs/src/logger.js to mask those keys consistently (keep raw
keys in-memory only for dispatching, never merge them into persisted evaluation
objects or log entries), ensure rpc-client sanitizes params before logRpcCall,
and ensure logger.redacts sensitive fields before writing evaluations.jsonl or
other logs.

Comment on lines +426 to +437
if (partial && connection.llmConfig) {
// Partial update - merge with existing config
connection.llmConfig = {
...connection.llmConfig,
provider: provider || connection.llmConfig.provider,
apiKey: apiKey || connection.llmConfig.apiKey,
endpoint: endpoint || connection.llmConfig.endpoint,
models: {
...connection.llmConfig.models,
...models
}
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Partial update: spreading possibly undefined models will throw; also allow explicit clearing of secrets

Guard the spread and support null to clear apiKey/endpoint. Current || prevents clearing.

Apply this diff:

-    if (partial && connection.llmConfig) {
+    if (partial && connection.llmConfig) {
       // Partial update - merge with existing config
       connection.llmConfig = {
         ...connection.llmConfig,
-        provider: provider || connection.llmConfig.provider,
-        apiKey: apiKey || connection.llmConfig.apiKey,
-        endpoint: endpoint || connection.llmConfig.endpoint,
+        provider: provider ?? connection.llmConfig.provider,
+        // Allow explicit clearing with null; keep existing when undefined
+        apiKey: apiKey === null ? undefined : (apiKey ?? connection.llmConfig.apiKey),
+        endpoint: endpoint === null ? undefined : (endpoint ?? connection.llmConfig.endpoint),
         models: {
-          ...connection.llmConfig.models,
-          ...models
+          ...(connection.llmConfig.models || {}),
+          ...(models || {})
         }
       };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (partial && connection.llmConfig) {
// Partial update - merge with existing config
connection.llmConfig = {
...connection.llmConfig,
provider: provider || connection.llmConfig.provider,
apiKey: apiKey || connection.llmConfig.apiKey,
endpoint: endpoint || connection.llmConfig.endpoint,
models: {
...connection.llmConfig.models,
...models
}
};
if (partial && connection.llmConfig) {
// Partial update - merge with existing config
connection.llmConfig = {
...connection.llmConfig,
provider: provider ?? connection.llmConfig.provider,
// Allow explicit clearing with null; keep existing when undefined
apiKey: apiKey === null ? undefined : (apiKey ?? connection.llmConfig.apiKey),
endpoint: endpoint === null ? undefined : (endpoint ?? connection.llmConfig.endpoint),
models: {
...(connection.llmConfig.models || {}),
...(models || {})
}
};
🤖 Prompt for AI Agents
In eval-server/nodejs/src/lib/EvalServer.js around lines 426 to 437, the
partial-update code unconditionally spreads models (which may be undefined) and
uses falsy checks for apiKey/endpoint/provider preventing explicit clearing;
change the merge to only spread models when models is not undefined, treat
models === null as an explicit clear (set to {}), and replace the current "||"
checks with explicit undefined checks so that apiKey/endpoint/provider are
retained only when the incoming value is undefined (allowing null to clear
secrets). Build the new connection.llmConfig by copying existing fields, setting
provider = provider === undefined ? connection.llmConfig.provider : provider,
apiKey = apiKey === undefined ? connection.llmConfig.apiKey : apiKey, endpoint =
endpoint === undefined ? connection.llmConfig.endpoint : endpoint, and for
models: if models === undefined use connection.llmConfig.models, else if models
=== null set models to {}, else merge {...connection.llmConfig.models,
...models}.

Copy link

claude bot commented Sep 23, 2025

Claude finished @olesho's task —— View job


PR Review: Local Dockerised Eval Server

I've completed a comprehensive review of this PR implementing a local dockerized evaluation server. Here's my analysis:

Todo List:

  • Analyze overall architecture and changes
  • Review Node.js evaluation server backend code
  • Review frontend TypeScript integration code
  • Check configuration management and security
  • Evaluate documentation and examples
  • Assess test coverage
  • Provide comprehensive feedback and recommendations

Strengths

Architecture & Design

  • Solid WebSocket-based RPC architecture with JSON-RPC 2.0 protocol implementation
  • Excellent separation of concerns - clear distinction between EvalServer library, API server, and client management
  • Flexible multi-provider LLM support (OpenAI, Groq, OpenRouter, LiteLLM) with proper abstraction
  • Well-designed configuration override system enabling both manual and automated evaluation modes
  • Event-driven architecture with proper EventEmitter usage and tracing support

Code Quality

  • Comprehensive error handling with structured logging using Winston
  • Good TypeScript integration with proper type definitions and interfaces
  • Clean API design for the EvalServer library with intuitive ClientProxy pattern
  • Proper resource cleanup - WebSocket connections, RPC clients, and intervals are cleaned up
  • Well-structured configuration management with validation and environment variable support

Documentation

  • Excellent CLAUDE.md files providing clear architecture overview and usage instructions
  • Comprehensive MODEL-CONFIGS.md documenting the configuration system thoroughly
  • Good inline documentation and JSDoc comments throughout the codebase

⚠️ Areas for Improvement

Security Concerns

  1. API Key Exposure Risk (eval-server/nodejs/src/api-server.js:349)

    api_key: process.env.OPENAI_API_KEY
    • API keys are passed directly in response payloads and could be logged or exposed
    • Recommendation: Sanitize logs and consider token-based authentication
  2. CORS Configuration (eval-server/nodejs/src/api-server.js:61)

    res.setHeader('Access-Control-Allow-Origin', '*');
    • Wildcard CORS allows any origin to access the API
    • Recommendation: Configure specific allowed origins for production
  3. Weak Default Authentication (eval-server/nodejs/examples/library-usage.js:14)

    authKey: 'hello',
    • Examples use weak default secret keys
    • Recommendation: Generate cryptographically secure keys and document secure generation

Performance & Reliability

  1. Potential Memory Leaks (eval-server/nodejs/src/lib/EvalServer.js:59-60)

    this.connectedClients = new Map();
    this.activeEvaluations = new Map();
    • No cleanup mechanism for completed evaluations or stale connections
    • Recommendation: Implement periodic cleanup and connection heartbeat monitoring
  2. High Timeout Values (eval-server/nodejs/src/api-server.js:415)

    timeout: 1500000, // 25 minutes
    • Very long timeouts could lead to resource exhaustion
    • Recommendation: Use more reasonable defaults (5-10 minutes) with configurable limits
  3. Synchronous File Operations (eval-server/nodejs/src/api-server.js:30)

    const configContent = fs.readFileSync(configPath, 'utf8');
    • Blocking operations can impact server performance
    • Recommendation: Use async file operations (fs.promises.readFile)

Error Handling

  1. Generic Error Responses (eval-server/nodejs/src/lib/EvalServer.js:370-378)

    • RPC errors return generic messages that could expose internal details
    • Recommendation: Implement proper error categorization and sanitization
  2. Missing Validation (eval-server/nodejs/src/api-server.js:260-305)

    • Request body validation is minimal in the responses endpoint
    • Recommendation: Add comprehensive input validation and sanitization

Code Organization

  1. Inconsistent Model Configuration Format

    • Mix of flat and nested model configuration patterns across the codebase
    • Recommendation: Standardize on the nested format and provide migration utilities
  2. Large Files

    • EvalServer.js (905 lines) and api-server.js (566 lines) are quite large
    • Recommendation: Extract message handlers and validation logic into separate modules

🧪 Test Coverage Issues

  • No unit tests found for the evaluation server components
  • No integration tests for WebSocket communication
  • No security tests for authentication and authorization flows
  • No performance tests for concurrent evaluation handling

Recommendation: Add comprehensive test suite covering:

  • Unit tests for core server functionality
  • Integration tests for WebSocket RPC communication
  • Security tests for authentication flows
  • Performance tests for concurrent evaluations

🚀 Performance Considerations

Positive Aspects

  • Efficient WebSocket connection reuse
  • Proper event-driven architecture minimizing blocking operations
  • Good resource management with cleanup handlers

Areas for Optimization

  • Consider connection pooling for high-throughput scenarios
  • Implement request rate limiting to prevent abuse
  • Add metrics collection for monitoring performance

📋 Recommendations Summary

High Priority

  1. Security hardening: Fix CORS configuration, secure API key handling, improve authentication
  2. Add comprehensive test suite: Unit, integration, and security tests
  3. Error handling improvements: Better validation and error sanitization

Medium Priority

  1. Performance optimizations: Async file operations, connection cleanup, reasonable timeouts
  2. Code organization: Break down large files, standardize configuration patterns

Low Priority

  1. Documentation: Add API documentation and deployment guides
  2. Monitoring: Add metrics and observability features

🎯 Overall Assessment

This is a well-architected and feature-rich implementation of a local evaluation server. The code demonstrates solid engineering practices with good separation of concerns, comprehensive configuration management, and excellent documentation.

The main concerns are around security hardening and test coverage, which are critical for production readiness. The architecture is sound and the codebase provides a solid foundation for a robust evaluation system.

Recommendation: Address the security issues and add test coverage before merging to production environments.


Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (1)

749-766: Timeout race can double‑update tracing and produce inconsistent results.

If the timeout fires then the tool resolves later, we still update the observation/result. Guard with a settled flag.

@@
-    return new Promise((resolve, reject) => {
-      const timer = setTimeout(() => {
+    return new Promise((resolve, reject) => {
+      let settled = false;
+      const timer = setTimeout(() => {
+        settled = true;
         // Update span with timeout error
@@
-      executePromise
+      executePromise
         .then((result: any) => {
-          clearTimeout(timer);
+          if (settled) return;
+          clearTimeout(timer);
@@
-          resolve(result);
+          settled = true;
+          resolve(result);
         })
         .catch((error: Error) => {
-          clearTimeout(timer);
+          if (settled) return;
+          clearTimeout(timer);
@@
-          reject(error);
+          settled = true;
+          reject(error);
         });

Also applies to: 772-811

🧹 Nitpick comments (1)
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (1)

549-552: Call hasOverride() directly; optional chaining not needed.

Simplifies and avoids accidental undefined.

-        const hasOverride = configManager.hasOverride?.() || false;
+        const hasOverride = configManager.hasOverride();
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8b37a3 and 3aaef1e.

📒 Files selected for processing (2)
  • front_end/panels/ai_chat/core/LLMConfigurationManager.ts (1 hunks)
  • front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (12 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-29T23:08:07.342Z
Learnt from: CR
PR: BrowserOperator/browser-operator-core#0
File: eval-server/nodejs/CLAUDE.md:0-0
Timestamp: 2025-08-29T23:08:07.342Z
Learning: Applies to eval-server/nodejs/src/evaluator.js : LLM evaluator must integrate with the OpenAI API to judge agent responses

Applied to files:

  • front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts
🧬 Code graph analysis (2)
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (4)
front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.ts (6)
  • isLLMConfigurationRequest (297-299)
  • createAuthVerifyMessage (197-203)
  • LLMConfigurationRequest (259-264)
  • createErrorResponse (240-255)
  • ErrorCodes (141-155)
  • createLLMConfigurationResponse (314-334)
front_end/panels/ai_chat/core/BuildConfig.ts (1)
  • BUILD_CONFIG (11-20)
front_end/panels/ai_chat/core/AgentService.ts (1)
  • AgentService (45-910)
front_end/panels/ai_chat/core/LLMConfigurationManager.ts (3)
  • LLMConfigurationManager (42-428)
  • hasOverride (184-186)
  • LLMConfig (14-21)
front_end/panels/ai_chat/core/LLMConfigurationManager.ts (4)
front_end/panels/ai_chat/core/AgentService.ts (2)
  • logger (334-345)
  • event (810-846)
front_end/panels/ai_chat/ui/AIChatPanel.ts (14)
  • logger (971-1064)
  • logger (1372-1442)
  • logger (1592-1605)
  • logger (1649-1652)
  • window (1069-1116)
  • localStorage (836-859)
  • provider (1492-1555)
  • event (1703-1710)
  • event (1715-1719)
  • event (1724-1728)
  • event (1733-1737)
  • event (1742-1746)
  • event (1751-1755)
  • error (1961-1974)
front_end/panels/ai_chat/core/Logger.ts (1)
  • createLogger (316-318)
front_end/panels/ai_chat/LLM/LLMTypes.ts (1)
  • LLMProvider (142-142)
🔇 Additional comments (7)
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts (3)

339-346: Stop logging secrets (model config and overrides leak API keys).

  • Line 339 logs params.model verbatim. It may contain api_key.
  • Lines 370-380 log mainModel which can be an object containing api_key.
  • Lines 574-583 log overrideConfig verbatim. This leaks secrets.

Apply these diffs to redact:

@@
-    logger.info('Received evaluation request', {
-      evaluationId: params.evaluationId,
-      tool: params.tool,
-      url: params.url,
-      isChat: params.tool === 'chat',
-      modelOverride: params.input?.ai_chat_model,
-      modelConfig: params.model
-    });
+    logger.info('Received evaluation request', {
+      evaluationId: params.evaluationId,
+      tool: params.tool,
+      url: params.url,
+      isChat: params.tool === 'chat',
+      hasModelConfig: !!params.model,
+      hasModelApiKey: !!(((params.model?.main_model as any)?.api_key) || ((params.model as any)?.api_key))
+    });
@@
-      logger.info('DEBUG: Extracted model configuration', {
-        evaluationId: params.evaluationId,
-        mainModel,
-        provider,
-        hasApiKey: !!apiKey,
-        apiKeyLength: apiKey?.length,
-        modelName,
-        miniModel,
-        nanoModel
-      });
+      logger.info('DEBUG: Extracted model configuration', {
+        evaluationId: params.evaluationId,
+        provider,
+        hasApiKey: !!apiKey,
+        modelName,
+        miniModel,
+        nanoModel
+      });
@@
-        logger.info('DEBUG: Created merged input for chat evaluation', {
-          evaluationId: params.evaluationId,
-          hasOverrideConfig: !!overrideConfig,
-          overrideConfig,
-          mergedInput: {
-            ...mergedInput,
-            api_key: mergedInput.api_key ? `${mergedInput.api_key.substring(0, 10)}...` : undefined
-          }
-        });
+        logger.info('DEBUG: Created merged input for chat evaluation', {
+          evaluationId: params.evaluationId,
+          hasOverrideConfig: !!overrideConfig,
+          mergedInput: {
+            ...mergedInput,
+            api_key: mergedInput.api_key ? '(redacted)' : undefined
+          }
+        });

Also applies to: 370-380, 574-583


1055-1112: Validate before persisting; avoid save‑then‑rollback.

Pre‑validate the merged config (respecting AUTOMATED_MODE) before calling save/apply to prevent inconsistent state and unnecessary listener churn.

Apply:

@@
-    try {
-      // Get configuration manager
-      const configManager = LLMConfigurationManager.getInstance();
-
-      // Store current config for potential rollback
-      const currentConfig = configManager.loadConfiguration();
-
-      // Handle configuration update based on partial flag
-      if (params.partial) {
-        // Use the new partial configuration method
-        const partialConfig: Partial<LLMConfig> = {};
-        if (params.provider !== undefined) partialConfig.provider = params.provider;
-        if (params.apiKey !== undefined) partialConfig.apiKey = params.apiKey;
-        if (params.endpoint !== undefined) partialConfig.endpoint = params.endpoint;
-        if (params.models?.main !== undefined) partialConfig.mainModel = params.models.main;
-        if (params.models?.mini !== undefined) partialConfig.miniModel = params.models.mini;
-        if (params.models?.nano !== undefined) partialConfig.nanoModel = params.models.nano;
-
-        configManager.applyPartialConfiguration(partialConfig);
-      } else {
-        // Full configuration update
-        const fullConfig: LLMConfig = {
-          provider: params.provider,
-          apiKey: params.apiKey,
-          endpoint: params.endpoint,
-          mainModel: params.models.main,
-          miniModel: params.models?.mini,
-          nanoModel: params.models?.nano
-        };
-        configManager.saveConfiguration(fullConfig);
-      }
-
-      // Validate the saved configuration
-      const postSaveValidation = configManager.validateConfiguration();
-
-      if (!postSaveValidation.isValid) {
-        // Restore the original config if validation fails
-        configManager.saveConfiguration(currentConfig);
-
-        // Send error response with validation errors
-        const errorResponse = createErrorResponse(
-          id,
-          ErrorCodes.INVALID_PARAMS,
-          'Invalid configuration',
-          { errors: postSaveValidation.errors }
-        );
-
-        if (this.client) {
-          this.client.send(errorResponse);
-        }
-
-        logger.error('Configuration validation failed', {
-          errors: postSaveValidation.errors
-        });
-
-        return;
-      }
+    try {
+      const configManager = LLMConfigurationManager.getInstance();
+      const currentConfig = configManager.loadConfiguration();
+
+      // Build merged config without persisting
+      const mergedConfig: LLMConfig = {
+        provider: params.partial ? (params.provider ?? currentConfig.provider) : params.provider,
+        apiKey: params.partial ? (params.apiKey ?? currentConfig.apiKey) : params.apiKey,
+        endpoint: params.partial ? (params.endpoint ?? currentConfig.endpoint) : params.endpoint,
+        mainModel: params.partial ? (params.models?.main ?? currentConfig.mainModel) : params.models.main,
+        miniModel: params.partial ? (params.models?.mini ?? currentConfig.miniModel) : params.models?.mini,
+        nanoModel: params.partial ? (params.models?.nano ?? currentConfig.nanoModel) : params.models?.nano
+      };
+
+      // Pre-validate (skip credential checks in automated mode)
+      const errors: string[] = [];
+      const provider = mergedConfig.provider as string | undefined;
+      const requiresKey = provider === 'openai' || provider === 'groq' || provider === 'openrouter';
+      if (!provider) errors.push('Provider is required');
+      if (!mergedConfig.mainModel) errors.push('Main model is required');
+      if (provider === 'litellm' && !mergedConfig.endpoint) errors.push('LiteLLM endpoint is required');
+      if (!BUILD_CONFIG.AUTOMATED_MODE && requiresKey && !mergedConfig.apiKey) {
+        errors.push(`${provider} API key is required`);
+      }
+      if (errors.length) {
+        const errorResponse = createErrorResponse(
+          id,
+          ErrorCodes.INVALID_PARAMS,
+          'Invalid configuration',
+          { errors }
+        );
+        this.client?.send(errorResponse);
+        logger.error('Configuration validation failed', { errors });
+        return;
+      }
+
+      // Persist after validation
+      configManager.saveConfiguration(mergedConfig);

282-288: Automated-mode auth bypass: confirm server-side expectations.

Bypassing auth in automated mode can admit unintended clients if misconfigured. Ensure the server enforces environment gating (e.g., local-only, Docker network) before relying on this.

Run to locate server-side gating:

front_end/panels/ai_chat/core/LLMConfigurationManager.ts (4)

299-325: Add provider validation in validateConfiguration.

Currently only checks presence; reject unsupported values early.

@@
-    // Check provider
-    if (!config.provider) {
-      errors.push('Provider is required');
-    }
+    // Check provider
+    if (!config.provider) {
+      errors.push('Provider is required');
+    } else if (!ALLOWED_PROVIDERS.includes(config.provider)) {
+      errors.push(`Unsupported provider: ${config.provider}`);
+    }

28-33: Fix storage key mismatches (MODEL_SELECTION, LITELLM_ENDPOINT).

These diverge from existing keys and will desync reads/writes.

   PROVIDER: 'ai_chat_provider',
-  MODEL_SELECTION: 'ai_chat_model_selection',
+  MODEL_SELECTION: 'selected_model',
   MINI_MODEL: 'ai_chat_mini_model',
   NANO_MODEL: 'ai_chat_nano_model',
   OPENAI_API_KEY: 'ai_chat_api_key',
-  LITELLM_ENDPOINT: 'ai_chat_litellm_endpoint',
+  LITELLM_ENDPOINT: 'litellm_endpoint',
   LITELLM_API_KEY: 'ai_chat_litellm_api_key',

26-36: Guard against invalid provider strings.

Cast‑throughs allow arbitrary values; validate against allowed providers.

@@
 const STORAGE_KEYS = {
@@
 } as const;
+
+const ALLOWED_PROVIDERS: ReadonlyArray<LLMProvider> =
+  ['openai', 'litellm', 'groq', 'openrouter'] as const;
@@
-  getProvider(): LLMProvider {
-    if (this.overrideConfig?.provider) {
-      return this.overrideConfig.provider;
-    }
-    const stored = localStorage.getItem(STORAGE_KEYS.PROVIDER);
-    return (stored as LLMProvider) || 'openai';
-  }
+  getProvider(): LLMProvider {
+    const override = this.overrideConfig?.provider;
+    if (override && ALLOWED_PROVIDERS.includes(override)) {
+      return override;
+    }
+    const stored = localStorage.getItem(STORAGE_KEYS.PROVIDER) as LLMProvider | null;
+    if (stored && ALLOWED_PROVIDERS.includes(stored)) {
+      return stored;
+    }
+    return 'openai';
+  }

Also applies to: 65-71


191-221: Gate credential persistence behind an explicit option (avoid storing API keys by default).

Persisting secrets in localStorage is a security risk. Make it opt‑in and thread through provider‑specific saves.

-  saveConfiguration(config: LLMConfig): void {
+  saveConfiguration(config: LLMConfig, options?: { persistSecrets?: boolean }): void {
     logger.info('Saving configuration to localStorage', {
       provider: config.provider,
       mainModel: config.mainModel,
       hasApiKey: !!config.apiKey,
       hasEndpoint: !!config.endpoint
     });
 
+    const persistSecrets = options?.persistSecrets ?? true;
@@
-    // Save provider-specific settings
-    this.saveProviderSpecificSettings(config);
+    // Save provider-specific settings
+    this.saveProviderSpecificSettings(config, persistSecrets);
@@
-  private saveProviderSpecificSettings(config: LLMConfig): void {
+  private saveProviderSpecificSettings(config: LLMConfig, persistSecrets: boolean): void {
@@
       case 'openai':
-        if (config.apiKey) {
-          localStorage.setItem(STORAGE_KEYS.OPENAI_API_KEY, config.apiKey);
-        } else {
-          localStorage.removeItem(STORAGE_KEYS.OPENAI_API_KEY);
-        }
+        if (persistSecrets) {
+          if (config.apiKey) {
+            localStorage.setItem(STORAGE_KEYS.OPENAI_API_KEY, config.apiKey);
+          } else {
+            localStorage.removeItem(STORAGE_KEYS.OPENAI_API_KEY);
+          }
+        }
         break;
       case 'litellm':
         if (config.endpoint) {
           localStorage.setItem(STORAGE_KEYS.LITELLM_ENDPOINT, config.endpoint);
         } else {
           localStorage.removeItem(STORAGE_KEYS.LITELLM_ENDPOINT);
         }
-        if (config.apiKey) {
-          localStorage.setItem(STORAGE_KEYS.LITELLM_API_KEY, config.apiKey);
-        } else {
-          localStorage.removeItem(STORAGE_KEYS.LITELLM_API_KEY);
-        }
+        if (persistSecrets) {
+          if (config.apiKey) {
+            localStorage.setItem(STORAGE_KEYS.LITELLM_API_KEY, config.apiKey);
+          } else {
+            localStorage.removeItem(STORAGE_KEYS.LITELLM_API_KEY);
+          }
+        }
         break;
       case 'groq':
-        if (config.apiKey) {
-          localStorage.setItem(STORAGE_KEYS.GROQ_API_KEY, config.apiKey);
-        } else {
-          localStorage.removeItem(STORAGE_KEYS.GROQ_API_KEY);
-        }
+        if (persistSecrets) {
+          if (config.apiKey) {
+            localStorage.setItem(STORAGE_KEYS.GROQ_API_KEY, config.apiKey);
+          } else {
+            localStorage.removeItem(STORAGE_KEYS.GROQ_API_KEY);
+          }
+        }
         break;
       case 'openrouter':
-        if (config.apiKey) {
-          localStorage.setItem(STORAGE_KEYS.OPENROUTER_API_KEY, config.apiKey);
-        } else {
-          localStorage.removeItem(STORAGE_KEYS.OPENROUTER_API_KEY);
-        }
+        if (persistSecrets) {
+          if (config.apiKey) {
+            localStorage.setItem(STORAGE_KEYS.OPENROUTER_API_KEY, config.apiKey);
+          } else {
+            localStorage.removeItem(STORAGE_KEYS.OPENROUTER_API_KEY);
+          }
+        }
         break;

Also applies to: 337-374

Comment on lines +26 to +36
const STORAGE_KEYS = {
PROVIDER: 'ai_chat_provider',
MODEL_SELECTION: 'ai_chat_model_selection',
MINI_MODEL: 'ai_chat_mini_model',
NANO_MODEL: 'ai_chat_nano_model',
OPENAI_API_KEY: 'ai_chat_api_key',
LITELLM_ENDPOINT: 'ai_chat_litellm_endpoint',
LITELLM_API_KEY: 'ai_chat_litellm_api_key',
GROQ_API_KEY: 'ai_chat_groq_api_key',
OPENROUTER_API_KEY: 'ai_chat_openrouter_api_key',
} as const;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Search the codebase for old keys to update call sites.

Ensure no lingering references to ai_chat_model_selection or ai_chat_litellm_endpoint.


🏁 Script executed:

#!/bin/bash
rg -n -C2 -g '!**/node_modules/**' -P 'ai_chat_model_selection|ai_chat_litellm_endpoint'

Length of output: 148


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "PWD: $(pwd)"
echo "Top-level files (first 200 lines):"
ls -la | sed -n '1,200p'

echo
echo "File count (excluding node_modules):"
find . -path './node_modules' -prune -o -type f -print | wc -l

echo
echo "Searching for keys with ripgrep (no ignores, include hidden):"
rg -n -C2 --hidden -uu -S -g '!**/node_modules/**' -e 'ai_chat_model_selection' -e 'ai_chat_litellm_endpoint' || echo "rg: no matches"

echo
echo "Fallback search with grep (excluding node_modules):"
grep -RIn --line-number --exclude-dir=node_modules -e 'ai_chat_model_selection' -e 'ai_chat_litellm_endpoint' || echo "grep: no matches"

Length of output: 6377


Replace hard-coded storage keys with STORAGE_KEYS (fix lingering 'ai_chat_model_selection' and 'ai_chat_litellm_endpoint')

Replace raw uses of 'ai_chat_model_selection' and 'ai_chat_litellm_endpoint' with STORAGE_KEYS.MODEL_SELECTION and STORAGE_KEYS.LITELLM_ENDPOINT. Found literal occurrences to update:

  • front_end/panels/ai_chat/LLM/LiteLLMProvider.ts:52, 415
  • front_end/panels/ai_chat/core/AgentService.ts:793
  • front_end/panels/ai_chat/evaluation/runner/EvaluationRunner.ts:107
  • front_end/panels/ai_chat/ui/SettingsDialog.ts:30
  • front_end/panels/ai_chat/ui/AIChatPanel.ts:145, 151
  • front_end/panels/ai_chat/ui/tests/AIChatPanel.test.ts:40, 161, 173, 224, 260, 283, 296, 342
  • MODEL-CONFIGS.md:18, 25, 29

Key definitions are in front_end/panels/ai_chat/core/LLMConfigurationManager.ts (MODEL_SELECTION, LITELLM_ENDPOINT); update the above call sites, tests, and docs to reference those constants.

🤖 Prompt for AI Agents
In front_end/panels/ai_chat/core/LLMConfigurationManager.ts around lines 26 to
36, STORAGE_KEYS now centralizes storage key constants but several files still
use the hard-coded strings 'ai_chat_model_selection' and
'ai_chat_litellm_endpoint'; replace those literal usages with
STORAGE_KEYS.MODEL_SELECTION and STORAGE_KEYS.LITELLM_ENDPOINT respectively
across the codebase (update front_end/panels/ai_chat/LLM/LiteLLMProvider.ts
lines ~52 and ~415, front_end/panels/ai_chat/core/AgentService.ts line ~793,
front_end/panels/ai_chat/evaluation/runner/EvaluationRunner.ts line ~107,
front_end/panels/ai_chat/ui/SettingsDialog.ts line ~30,
front_end/panels/ai_chat/ui/AIChatPanel.ts lines ~145 and ~151, all occurrences
in front_end/panels/ai_chat/ui/__tests__/AIChatPanel.test.ts at the listed line
numbers, and references in MODEL-CONFIGS.md at the listed lines); ensure imports
reference STORAGE_KEYS (or LLMConfigurationManager export) and update tests/docs
to use the constant names so all runtime and test references are consistent.

Comment on lines +155 to +169
/**
* Set override configuration (for automated mode per-request overrides)
*/
setOverride(config: Partial<LLMConfig>): void {
logger.info('Setting configuration override', {
provider: config.provider,
mainModel: config.mainModel,
hasApiKey: !!config.apiKey,
hasEndpoint: !!config.endpoint
});

this.overrideConfig = { ...config };
this.notifyListeners();
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Provide a scoped override API to support concurrent evaluations.

Consider adding a withOverride helper to ensure overrides are applied and cleared atomically per operation rather than mutating global state.

Example:

public async withOverride<T>(override: Partial<LLMConfig>, fn: () => Promise<T>): Promise<T> {
  const prev = this.overrideConfig;
  this.overrideConfig = { ...override };
  this.notifyListeners();
  try {
    return await fn();
  } finally {
    this.overrideConfig = prev;
    this.notifyListeners();
  }
}

Also applies to: 181-187

Comment on lines +348 to +416
// CRITICAL FIX: Set configuration override early for any model configuration provided
// This allows API keys from request body to be available for UI-level validation
logger.info('DEBUG: Checking for model configuration override', {
evaluationId: params.evaluationId,
hasModel: !!params.model,
model: params.model
});

if (params.model && (params.model.main_model || params.model.provider || params.model.api_key)) {
const configManager = LLMConfigurationManager.getInstance();

// Extract configuration from nested model structure - handle both flat and nested formats
const mainModel = params.model.main_model as any;

// For nested format: main_model: { provider: "openai", model: "gpt-4", api_key: "key" }
// For flat format: { provider: "openai", main_model: "gpt-4", api_key: "key" }
const provider = mainModel?.provider || params.model.provider || 'openai';
const apiKey = mainModel?.api_key || params.model.api_key;
const modelName = mainModel?.model || mainModel || params.model.main_model;
const miniModel = (params.model.mini_model as any)?.model || params.model.mini_model;
const nanoModel = (params.model.nano_model as any)?.model || params.model.nano_model;

logger.info('DEBUG: Extracted model configuration', {
evaluationId: params.evaluationId,
mainModel,
provider,
hasApiKey: !!apiKey,
apiKeyLength: apiKey?.length,
modelName,
miniModel,
nanoModel
});

if (apiKey) {
logger.info('Setting early configuration override for evaluation', {
evaluationId: params.evaluationId,
provider,
hasApiKey: !!apiKey,
modelName
});

configManager.setOverride({
provider,
apiKey,
mainModel: modelName,
miniModel: miniModel || modelName,
nanoModel: nanoModel || miniModel || modelName
});
hasSetEarlyOverride = true;

logger.info('DEBUG: Early override set successfully', {
evaluationId: params.evaluationId
});
} else {
logger.warn('DEBUG: No API key found in model configuration', {
evaluationId: params.evaluationId,
mainModel,
provider
});
}
} else {
logger.warn('DEBUG: No model configuration found for override', {
evaluationId: params.evaluationId,
hasModel: !!params.model,
hasMainModel: !!(params.model?.main_model),
hasProvider: !!(params.model?.provider),
hasApiKey: !!(params.model?.api_key)
});
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Global override is not concurrency‑safe; race/cross‑talk between evaluations.

LLMConfigurationManager holds a single global override. With maxConcurrency: 3 parallel evaluations can clobber each other's provider/apiKey/models, causing misrouting and potential credential/data leakage. Immediate mitigation: set maxConcurrency to 1. Follow up: adopt scoped overrides keyed by evaluationId or pass config directly to AgentService.

Immediate mitigation:

@@
-        maxConcurrency: 3,
+        maxConcurrency: 1,

Recommended follow-up (manager API sketch to avoid global mutable state):

// In LLMConfigurationManager (new API)
withOverride<T>(override: Partial<LLMConfig>, fn: () => Promise<T>): Promise<T> {
  const prev = this.overrideConfig;
  this.overrideConfig = { ...override };
  this.notifyListeners();
  return Promise.resolve(fn()).finally(() => {
    this.overrideConfig = prev;
    this.notifyListeners();
  });
}

Then use:

await configManager.withOverride({ provider, apiKey, mainModel, miniModel, nanoModel, endpoint }, async () => {
  // call AgentService.sendMessage(...)
});

Also applies to: 912-921, 697-709, 1036-1038, 220-221

🤖 Prompt for AI Agents
front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.ts lines 348-416: the
current code calls LLMConfigurationManager.setOverride which mutates a single
global override and causes race/cross-talk between parallel evaluations;
immediately set maxConcurrency to 1 for safety, and then refactor to use a
scoped override or pass config directly to the agent instead of mutating global
state. Implement a withOverride-like API on LLMConfigurationManager that
temporarily sets the override, calls the async evaluation work, and restores the
previous override in finally, or change AgentService/Agent call sites to accept
a per-evaluation config object and use that instead of reading the global
override; update all other occurrences noted in the comment (lines 912-921,
697-709, 1036-1038, 220-221) to follow the same pattern so overrides are
per-evaluation and not global.

@olesho olesho merged commit 65494a3 into main Sep 24, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants