Skip to content

Env setup/cleanup ability and verify through script#62

Merged
tisnik merged 8 commits intolightspeed-core:mainfrom
asamal4:script-handler
Sep 29, 2025
Merged

Env setup/cleanup ability and verify through script#62
tisnik merged 8 commits intolightspeed-core:mainfrom
asamal4:script-handler

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Sep 26, 2025

  • Ability to do env set up & clean up using script at conversation group level
  • This is applicable when API is enabled.
  • Evaluate deterministic action through a script
  • Added mock/sample data just to provide reference about the structure

Summary by CodeRabbit

  • New Features

    • Script-based evaluations: per-turn verify scripts and optional conversation-level setup/cleanup scripts; new script metric (script:action_eval) with pass/fail scoring and API toggle.
    • Script execution manager and ScriptExecutionError exposed; sample setup/verify/cleanup scripts included.
  • Refactor

    • Evaluation flow now runs setup → per-turn evaluation → cleanup with centralized components, error short-circuiting, and metric routing.
    • Evaluation data schema extended to include script fields and structured metric entries.
  • Documentation

    • Added guide, YAML examples, script path resolution rules, and KUBECONFIG note.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 26, 2025

Walkthrough

Adds script-based evaluation support: new ScriptExecutionManager and ScriptExecutionError; ScriptEvalMetrics; model fields for setup/cleanup/verify scripts; validator script-path checks (YAML-dir resolution, executability); pipeline/evaluator/processor wiring to run setup scripts, per-turn verify scripts, and cleanup; README and sample scripts added.

Changes

Cohort / File(s) Summary of changes
Docs
README.md
Added Script-Based Evaluations section, YAML examples, setup/cleanup/verify guidance, KUBECONFIG note, and updated metrics/data-field docs.
Config data & metadata
config/evaluation_data.yaml, config/system.yaml
Reworked evaluation_data entries (map-style metric entries, new script_eval_example and tool_eval_example), added conversation/turn script fields (setup_script, cleanup_script, verify_script); added script:action_eval metric metadata.
Public exports
src/lightspeed_evaluation/__init__.py, src/lightspeed_evaluation/core/script/__init__.py
Exported ScriptExecutionManager and ScriptExecutionError on package surfaces.
Script execution subsystem
src/lightspeed_evaluation/core/script/manager.py, src/lightspeed_evaluation/core/system/exceptions.py
New ScriptExecutionManager (resolves/validates script paths, checks executability, runs subprocess with timeout, captures output, returns success/fail, raises ScriptExecutionError); new ScriptExecutionError class.
Models
src/lightspeed_evaluation/core/models/data.py
Added setup_script and cleanup_script to EvaluationData; added verify_script to TurnData (accepts str or Path).
Validation
src/lightspeed_evaluation/core/system/validator.py
Added _validate_scripts and _validate_single_script; YAML-dir-relative path resolution, existence/type/executable checks (attempt chmod), contextual error messages; integrated into load_evaluation_data; added script:action_eval to METRIC_REQUIREMENTS.
Metrics
src/lightspeed_evaluation/core/metrics/script_eval.py
Added ScriptEvalMetrics to evaluate script:action_eval by invoking ScriptExecutionManager; enforces turn-level scope, respects API-enabled gating, returns numeric pass/fail and messages, handles ScriptExecutionError.
Pipeline integration
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py, src/lightspeed_evaluation/pipeline/evaluation/pipeline.py, src/lightspeed_evaluation/pipeline/evaluation/processor.py
Wired ScriptExecutionManager into pipeline and ProcessorComponents; MetricsEvaluator accepts script_manager; Processor runs setup before amendment/evaluation, invokes per-turn verify scripts via metrics, then runs cleanup (cleanup errors logged; setup failures short-circuit). Signatures updated to pass script_manager through components.
Sample scripts
config/sample_scripts/*
Added setup.sh, verify.sh, cleanup.sh demonstrating setup/verify/cleanup behavior for script-based evaluations.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Pipeline
  participant Validator
  participant Processor
  participant ScriptMgr as ScriptExecutionManager
  participant Metrics as MetricsEvaluator

  User->>Pipeline: start(evaluation_data)
  Pipeline->>Validator: validate(data)  %% includes script path checks
  Pipeline->>Processor: process(conversation)

  Note over Processor: Setup phase (if provided)
  Processor->>ScriptMgr: run_script(setup_script)?
  alt Setup fails
    ScriptMgr-->>Processor: error / raises ScriptExecutionError
    Processor-->>Pipeline: return ERROR (short-circuit)
  else Setup OK or absent
    ScriptMgr-->>Processor: success/skip
    loop For each turn
      Processor->>Metrics: evaluate(turn_metrics)
      alt metric == script:action_eval
        Metrics->>ScriptMgr: run_script(verify_script)
        ScriptMgr-->>Metrics: pass / fail
      end
    end
    Processor->>Metrics: evaluate(conversation_metrics)
    Processor-->>Pipeline: return RESULTS
  end

  Note over Processor: Cleanup (always attempt)
  Processor->>ScriptMgr: run_script(cleanup_script)?
  ScriptMgr-->>Processor: success / warn
Loading
sequenceDiagram
  autonumber
  participant ScriptEval as ScriptEvalMetrics
  participant ScriptMgr as ScriptExecutionManager
  participant OS as Subprocess

  ScriptEval->>ScriptEval: validate scope & api_enabled
  alt API disabled
    ScriptEval-->>Caller: (None, "API disabled")
  else no verify_script
    ScriptEval-->>Caller: (None, "no verify script")
  else run verify script
    ScriptEval->>ScriptMgr: run_script(path)
    ScriptMgr->>OS: subprocess.run(script, env, timeout)
    alt exit code 0
      OS-->>ScriptMgr: success (code 0)
      ScriptMgr-->>ScriptEval: True
      ScriptEval-->>Caller: (1.0, "passed")
    else non-zero / error
      OS-->>ScriptMgr: failure / stderr
      ScriptMgr-->>ScriptEval: False or raises ScriptExecutionError
      ScriptEval-->>Caller: (0.0, "failed" / error message)
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • Turn metric override #55 — Modifies pipeline/component wiring (ProcessorComponents, MetricsEvaluator) that overlaps with script_manager wiring and signature changes.
  • agent eval: multi-turn & refactoring #22 — Implements per-conversation setup/cleanup and per-turn script execution; overlaps on script validation and run semantics.
  • feat: add agent e2e eval #19 — Adds script-based evaluation features (script runner, ScriptExecutionError, script-based metrics) with strong code-level overlap.

Suggested reviewers

  • VladimirKadlec
  • tisnik

Poem

I hop to the scripts, I scurry and test,
Setup at sunrise, verify each quest.
Cleanup at dusk when the checks are all done,
Carrots for success — crunchy, warm fun. 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly describes the primary enhancements introduced by the changeset—enabling environment setup and cleanup scripts as well as script-based verification—and clearly conveys the main intent of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

120-122: Fix potential AttributeError when turn_data is None (conversation-level).

Operator precedence causes request.turn_data.response to be accessed even when turn_data is None.

Apply this diff in both locations:

-                response=request.turn_data.response or "" if request.turn_data else "",
+                response=(request.turn_data.response or "") if request.turn_data else "",

And similarly in _create_error_result:

-            response=request.turn_data.response or "" if request.turn_data else "",
+            response=(request.turn_data.response or "") if request.turn_data else "",

Also applies to: 144-146

config/evaluation_data.yaml (1)

57-69: Ensure referenced scripts exist and are executable
The following files are referenced in config/evaluation_data.yaml but are missing:

  • scripts/verify_infrastructure.sh
  • scripts/setup_test_env.sh
  • scripts/cleanup_test_env.sh
    Add each script (with a valid #! shebang) and mark them executable, or correct their paths.
🧹 Nitpick comments (10)
config/system.yaml (1)

76-79: Consider setting an explicit threshold for binary script metrics.

Without an explicit threshold, MetricsEvaluator defaults to 0.5; for a binary 0/1 verify script, it's clearer to require success (1.0).

Apply this diff:

-    "script:action_eval":
-      description: "Script-based evaluation for infrastructure/environment validation"
+    "script:action_eval":
+      threshold: 1.0
+      description: "Script-based evaluation for infrastructure/environment validation"
src/lightspeed_evaluation/core/system/exceptions.py (1)

28-41: Add str override for richer error context.

Include script_path and return_code in the string form to improve logs and surfaced messages.

 class ScriptExecutionError(EvaluationError):
     """Exception raised when script execution fails."""
 
     def __init__(self, message: str, script_path: str = "", return_code: int = -1):
         """Initialize script execution error.
@@
         super().__init__(message)
         self.script_path = script_path
         self.return_code = return_code
+
+    def __str__(self) -> str:
+        base = super().__str__()
+        extras = []
+        if self.script_path:
+            extras.append(f"script={self.script_path}")
+        if self.return_code != -1:
+            extras.append(f"return_code={self.return_code}")
+        return f"{base} ({', '.join(extras)})" if extras else base
README.md (1)

295-301: Tighten wording for the API-enabled scripts note.

Minor grammar fix.

-The framework supports script-based evaluations.
-**Note: Scripts only execute when API is enabled** - they're designed to test the actual with actual environment changes.
+The framework supports script-based evaluations.
+**Note: Scripts only execute when API is enabled** — they're designed to test with actual environment changes.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

30-31: Update the constructor docstring for accuracy.

Now initializes multiple managers and handlers, not just LLM.

-        """Initialize with LLM manager."""
+        """Initialize evaluator with system config, metric manager, and script manager."""
src/lightspeed_evaluation/core/models/data.py (2)

76-80: Normalize/validate script paths at the model layer (optional).

Consider a field_validator to strip whitespace and coerce to Path for verify_script to reduce downstream checks.


164-173: Mirror normalization for setup/cleanup scripts.

Same recommendation for setup_script and cleanup_script for consistency and early feedback.

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)

61-67: Make script timeout configurable.

Expose a script execution timeout in SystemConfig and pass it here to avoid hardcoding 300s.

Example (once config is added):

-        script_manager = ScriptExecutionManager()
+        script_manager = ScriptExecutionManager(timeout=config.script.timeout)
src/lightspeed_evaluation/core/system/validator.py (1)

288-295: Avoid chmod in validator; fail fast instead

Silently mutating file permissions in a validator introduces side effects and potential security posture issues (e.g., chmod on shared files). Prefer failing with a clear error and let the caller fix permissions (or perform opt‑in chmod at execution time).

Apply this diff:

-        # Check if script is executable or can be made executable
-        if not os.access(script_file, os.X_OK):
-            try:
-                script_file.chmod(0o755)
-            except (OSError, PermissionError) as exc:
-                raise DataValidationError(
-                    f"Conversation {context}: {script_type} script is not executable: {script_file}"
-                ) from exc
+        # Check if script is executable
+        if not os.access(script_file, os.X_OK):
+            raise DataValidationError(
+                f"Conversation {context}: {script_type} script is not executable: {script_file}"
+            )
src/lightspeed_evaluation/core/script/manager.py (2)

62-69: Run scripts with cwd set to the script’s directory

Many scripts assume relative paths to neighbors; setting cwd prevents surprises and improves isolation.

Apply this diff:

-            result = subprocess.run(
-                [str(script_path)],
-                text=True,
-                capture_output=True,
-                env=env,
-                timeout=self.timeout,
-                check=False,
-            )
+            result = subprocess.run(
+                [str(script_path)],
+                text=True,
+                capture_output=True,
+                env=env,
+                cwd=str(script_path.parent),
+                timeout=self.timeout,
+                check=False,
+            )

71-79: Truncate stdout/stderr in logs to avoid log flooding

Large outputs can overwhelm logs. Truncate at a sensible size while keeping full capture in memory if needed later.

Apply this diff:

-            # Log output
-            if result.stdout:
-                logger.debug("Script stdout: %s", result.stdout.strip())
-            if result.stderr:
-                if result.returncode != 0:
-                    logger.error("Script stderr: %s", result.stderr.strip())
-                else:
-                    logger.debug("Script stderr: %s", result.stderr.strip())
+            # Log output (truncate to avoid flooding)
+            def _truncate(s: str, limit: int = 8192) -> str:
+                s = s.strip()
+                return (s[:limit] + "... (truncated)") if len(s) > limit else s
+            if result.stdout:
+                logger.debug("Script stdout: %s", _truncate(result.stdout))
+            if result.stderr:
+                msg = _truncate(result.stderr)
+                if result.returncode != 0:
+                    logger.error("Script stderr: %s", msg)
+                else:
+                    logger.debug("Script stderr: %s", msg)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f75eb6c and b838b10.

📒 Files selected for processing (13)
  • README.md (2 hunks)
  • config/evaluation_data.yaml (1 hunks)
  • config/system.yaml (1 hunks)
  • src/lightspeed_evaluation/__init__.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
  • src/lightspeed_evaluation/core/models/data.py (3 hunks)
  • src/lightspeed_evaluation/core/script/__init__.py (1 hunks)
  • src/lightspeed_evaluation/core/script/manager.py (1 hunks)
  • src/lightspeed_evaluation/core/system/exceptions.py (1 hunks)
  • src/lightspeed_evaluation/core/system/validator.py (5 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • config/evaluation_data.yaml
📚 Learning: 2025-09-09T08:08:40.654Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.

Applied to files:

  • README.md
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • config/system.yaml
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/system/validator.py
🧬 Code graph analysis (8)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/metrics/script_eval.py (1)
  • ScriptEvalMetrics (13-60)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (21-156)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
  • ProcessorComponents (19-26)
src/lightspeed_evaluation/core/metrics/script_eval.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationScope (227-238)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/__init__.py (2)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/core/system/exceptions.py (3)
  • APIError (8-9)
  • DataValidationError (16-17)
  • EvaluationError (4-5)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (3)
  • MetricLevel (10-14)
  • resolve_metrics (24-48)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationData (135-182)
  • EvaluationResult (185-224)
  • TurnData (35-132)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
  • mark_all_metrics_as_error (17-77)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • amend_conversation_data (20-69)
src/lightspeed_evaluation/core/script/__init__.py (2)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • ScriptExecutionError (28-41)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/core/script/manager.py (1)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • ScriptExecutionError (28-41)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationData (135-182)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • DataValidationError (16-17)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/script/manager.py

[notice] 5-5:
Consider possible security implications associated with the subprocess module.


[notice] 62-63:
subprocess call - check for execution of untrusted input.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: mypy
  • GitHub Check: tests (3.13)
🔇 Additional comments (16)
src/lightspeed_evaluation/core/script/__init__.py (1)

1-9: Public API exposure looks good.

Clean exports for ScriptExecutionManager and ScriptExecutionError.

src/lightspeed_evaluation/__init__.py (1)

27-36: Expose script execution types at top-level — LGTM.

Re-exporting ScriptExecutionManager and ScriptExecutionError is consistent and convenient.

Also applies to: 58-60

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

44-47: Script metric wiring — LGTM.

Properly instantiates ScriptEvalMetrics with API-enabled flag and routes under "script".

Also applies to: 53-54

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)

64-67: Pipeline wiring — LGTM.

Creating and passing ScriptExecutionManager into MetricsEvaluator and ProcessorComponents is clean.


69-76: Setup and cleanup scripts correctly gated by api.enabled.

src/lightspeed_evaluation/core/system/validator.py (3)

51-54: Registering script metric requirements looks good

The script:action_eval requirement for verify_script is correctly captured and aligns with the new metric.


123-126: Correctly gating script validation by API flag

Running script path validation only when API is enabled matches the intended behavior and avoids unnecessary failures offline.


202-205: Good: skipping script metric checks when API is disabled

This prevents false positives in non-API runs.

src/lightspeed_evaluation/core/script/manager.py (1)

5-5: Bandit: subprocess usage is acceptable here (no shell), ensure sources are trusted

You correctly avoid shell=True. Inputs are validated earlier (existence, file type). Keep script paths restricted to validated config.

Please confirm that all script paths originate from validated YAML (and not any runtime/user input).

src/lightspeed_evaluation/core/metrics/script_eval.py (2)

35-38: Metric routing is consistent

Handling "action_eval" under the script namespace matches the registry pattern.


47-51: Graceful no-op when API disabled

Skipping execution when API is off aligns with the validator’s behavior.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)

22-27: Components container extension looks good

Adding script_manager keeps wiring centralized and clean.


80-101: Happy path + API error short-circuit logic reads well

Short-circuiting to ERROR on API amendment failure, with cleanup in finally, is solid.


164-187: Setup runner is robust

Good logging and error propagation via string reason; integrates with manager exceptions cleanly.


188-215: Cleanup runner behavior is appropriate

Non-blocking on failure and always-attempt semantics are good.

config/evaluation_data.yaml (1)

3-29: YAML structure aligns with model and validator

Fields and nulls (verify_script/setup_script/cleanup_script) match the new schema.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

119-121: Fix potential AttributeError when turn_data is None (bad conditional expression precedence).
Current expression can evaluate request.turn_data.response even when turn_data is None. Apply:

-                query=request.turn_data.query if request.turn_data else "",
-                response=request.turn_data.response or "" if request.turn_data else "",
+                query=request.turn_data.query if request.turn_data else "",
+                response=(request.turn_data.response or "") if request.turn_data else "",

And similarly in _create_error_result:

-            query=request.turn_data.query if request.turn_data else "",
-            response=request.turn_data.response or "" if request.turn_data else "",
+            query=request.turn_data.query if request.turn_data else "",
+            response=(request.turn_data.response or "") if request.turn_data else "",

Also applies to: 143-145

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)

128-136: Propagate original_data_path to DataValidator and validate scripts for programmatic inputs.
Currently _validate_scripts runs only in load_evaluation_data; pipeline path using validate_evaluation_data skips script validation, risking wrong CWD resolution and missing exec checks.

Apply:

         self.original_data_path = original_data_path
         logger.info("Starting evaluation")
         results: list[EvaluationResult] = []
 
         # Step 1: Validate data
         logger.info("Validating data")
+        # Ensure validator can resolve relative script paths against the YAML file dir
+        self.data_validator.original_data_path = self.original_data_path
         if not self.validate_data(evaluation_data):
             raise ValueError("Data validation failed. Cannot proceed with evaluation.")
+        # Also validate/normalize scripts when API is enabled and data was provided programmatically
+        config = self.config_loader.system_config
+        if config and config.api.enabled:
+            # Internal method is acceptable here until a public hook is exposed
+            self.data_validator._validate_scripts(evaluation_data)  # pylint: disable=protected-access

Alternatively, expose a public validate_scripts() on DataValidator and call it here.

🧹 Nitpick comments (6)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

73-79: Optional: guard split on malformed metric identifiers.
Avoid ValueError on missing ":".

-            framework, metric_name = request.metric_identifier.split(":", 1)
+            if ":" not in request.metric_identifier:
+                execution_time = time.time() - start_time
+                return self._create_error_result(
+                    request, f"Invalid metric identifier: {request.metric_identifier}", execution_time
+                )
+            framework, metric_name = request.metric_identifier.split(":", 1)
src/lightspeed_evaluation/core/system/validator.py (3)

123-126: Scripts validated only in load_evaluation_data; also invoke during validate_evaluation_data.
To cover programmatic inputs, mirror this check in validate_evaluation_data when api_enabled.

You can add at the end of validate_evaluation_data:

         if self.validation_errors:
             ...
         validation_msg = "✅ All data validation passed"
         if self.api_enabled:
             validation_msg += " (API mode - data will be enhanced via API)"
         print(validation_msg)
+        # Normalize/validate script paths in API mode even when data is pre-loaded
+        if self.api_enabled:
+            self._validate_scripts(evaluation_data)
         return True

Based on learnings


202-205: Use the method parameter api_enabled consistently.
Currently mixes self.api_enabled and api_enabled; prefer the parameter for clarity.

-                if metric.startswith("script:") and not self.api_enabled:
+                if metric.startswith("script:") and not api_enabled:
                     continue

288-294: Be cautious about chmod in a validator.
Changing file modes may be undesirable in read-only or shared repos; consider warning instead of mutating, or make this opt-in.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)

84-87: Prefer fail-safe handling over raising when config is missing

Raising here will bubble an exception and skip error classification. Either validate system_config in __init__ (fail-fast) or mark all metrics as ERROR here and return.

Apply this diff to convert to ERROR results (Option B):

-            if self.config is None:
-                raise ValueError("SystemConfig must be loaded")
+            if self.config is None:
+                logger.error("SystemConfig not loaded - marking all metrics as ERROR")
+                error_results = self.components.error_handler.mark_all_metrics_as_error(
+                    conv_data,
+                    "SystemConfig not loaded",
+                    resolved_turn_metrics=resolved_turn_metrics,
+                    resolved_conversation_metrics=resolved_conversation_metrics,
+                )
+                return error_results

172-176: Don’t couple script execution to API.enabled

Gating setup/cleanup scripts on api.enabled is surprising; script eval may be useful even without API calls. Suggest always running scripts when provided (or introduce a dedicated scripts.enabled flag in config).

Apply this diff to remove the API gate:

-        # Skip script execution if API is disabled
-        if self.config and not self.config.api.enabled:
-            logger.debug("Skipping setup script (API disabled): %s", setup_script)
-            return None
-        # Skip script execution if API is disabled
-        if self.config and not self.config.api.enabled:
-            logger.debug("Skipping cleanup script (API disabled): %s", cleanup_script)
-            return

If the intent was to avoid running placeholder scripts in sample data, consider a dedicated flag instead. To verify intent, please confirm expected behavior when api.enabled=false.

Also applies to: 199-203

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b838b10 and 572d145.

📒 Files selected for processing (10)
  • README.md (2 hunks)
  • config/evaluation_data.yaml (1 hunks)
  • config/system.yaml (1 hunks)
  • src/lightspeed_evaluation/__init__.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
  • src/lightspeed_evaluation/core/models/data.py (3 hunks)
  • src/lightspeed_evaluation/core/system/validator.py (5 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • config/system.yaml
  • README.md
  • src/lightspeed_evaluation/init.py
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • config/evaluation_data.yaml
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T10:41:09.399Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/processor.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (21-156)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
  • ProcessorComponents (19-26)
src/lightspeed_evaluation/core/metrics/script_eval.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationScope (227-238)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/metrics/script_eval.py (1)
  • ScriptEvalMetrics (13-60)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationData (135-182)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • DataValidationError (16-17)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6)
src/lightspeed_evaluation/core/metrics/manager.py (4)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/core/models/data.py (4)
  • EvaluationData (135-182)
  • EvaluationRequest (241-294)
  • EvaluationResult (185-224)
  • TurnData (35-132)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
  • mark_all_metrics_as_error (17-77)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • amend_conversation_data (20-69)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • evaluate_metric (58-129)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (21)
src/lightspeed_evaluation/core/models/data.py (3)

4-5: LGTM on imports and typing.
Path/Union imports align with new fields.


76-80: Turn-level verify_script field looks good.
Optional Path/str is appropriate; validator normalizes.


164-173: Conversation-level setup/cleanup scripts added correctly.
Matches downstream ScriptExecutionManager usage.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

13-16: Correctly wires ScriptEvalMetrics and ScriptExecutionManager.
Keeps handler map cohesive.


28-31: All MetricsEvaluator instantiations updated
Found only one call in src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (line 65) using three arguments; no legacy two‐arg usages remain.

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)

10-10: Import of ScriptExecutionManager is correct.


61-67: Good: centralized ScriptExecutionManager and passed into evaluator.


75-76: Passing script_manager through ProcessorComponents is appropriate.

src/lightspeed_evaluation/core/system/validator.py (3)

3-6: LGTM on imports and API gating setup.


51-55: Requirement mapping for script:action_eval is correct.
Ensures verify_script presence only when applicable.


242-297: Resolve relative script paths against the YAML file directory, not CWD.
As-is, Path.resolve() will use process CWD; breaks when running from other dirs.

Apply:

         if isinstance(script_file, str):
             script_file = Path(script_file)
 
-        script_file = script_file.resolve()
+        # Resolve relative paths against the YAML file directory (fallback to CWD)
+        if not script_file.is_absolute():
+            base_dir = (
+                Path(self.original_data_path).resolve().parent
+                if self.original_data_path
+                else Path.cwd()
+            )
+            script_file = (base_dir / script_file).resolve()
+        else:
+            script_file = script_file.resolve()

Based on learnings

src/lightspeed_evaluation/core/metrics/script_eval.py (3)

13-22: Constructor and API gating look good.
Aligns with system_config.api.enabled wiring.


23-39: Scope checks and metric routing are appropriate.
Turn-only restriction and unknown-metric messaging are clear.


40-61: Execution path and error handling are solid.
return (1.0/0.0/None) semantics fit existing evaluator contract.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)

67-80: Good fix: attempt cleanup even when setup fails

Cleanup is now attempted before returning on setup failure. This addresses prior feedback.


93-103: Cleanup still runs on API error path

Returning inside try is fine; finally ensures cleanup runs. LGTM.

Also applies to: 130-133


148-151: Evaluator wiring looks correct

Turn-level evaluation now delegates through components.metrics_evaluator. LGTM.


218-221: Summary delegation is consistent

Delegating to metric_manager.count_metrics_for_conversation is appropriate. LGTM.

config/evaluation_data.yaml (3)

20-22: Metrics supported: ragas:faithfulness, ragas:response_relevancy, and ragas:context_precision_without_reference are registered in system config and handled by the RAGAS pipeline.


61-68: Add missing scripts or gate their execution
config/evaluation_data.yaml references verify_infrastructure.sh, setup_test_env.sh, and cleanup_test_env.sh, but none exist in the repo. Either include these scripts under scripts/ or wrap their invocation behind a feature flag.

⛔ Skipped due to learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/data/script/eval3/setup.sh:1-3
Timestamp: 2025-07-16T09:42:00.691Z
Learning: Scripts in the lsc_agent_eval/data directory are meant to be simple examples/samples for teams to customize according to their needs, not production-ready code.

45-50: Confirmed deepeval conversation metrics registration and handler implementation

Both deepeval:conversation_completeness and deepeval:conversation_relevancy are defined under conversation_level in config/system.yaml, and DeepEvalMetrics is implemented and wired under the “deepeval” key in src/lightspeed_evaluation/pipeline/evaluation/evaluator.py.

@asamal4 asamal4 changed the title Script handler Env setup/cleanup ability and verify through script Sep 28, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/system/validator.py (1)

271-279: Handle user home shortcuts in script paths

Path("~…") stays literal right now, so a config that points to ~/scripts/setup.sh ends up looking for <yaml_dir>/~/scripts/setup.sh, which is almost certainly wrong. Expanding the home shortcut before the absolute/relative logic preserves existing behavior but also makes those configs work.

         if isinstance(script_file, str):
             script_file = Path(script_file)

+        # Normalize home directory shortcuts before resolving
+        script_file = script_file.expanduser()
+
         # Resolve relative paths against the YAML file directory, not CWD
         if not script_file.is_absolute() and self.original_data_path:
             yaml_dir = Path(self.original_data_path).parent
             script_file = (yaml_dir / script_file).resolve()
         else:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 572d145 and fd97031.

📒 Files selected for processing (16)
  • README.md (7 hunks)
  • config/evaluation_data.yaml (1 hunks)
  • config/sample_scripts/cleanup.sh (1 hunks)
  • config/sample_scripts/setup.sh (1 hunks)
  • config/sample_scripts/verify.sh (1 hunks)
  • config/system.yaml (1 hunks)
  • src/lightspeed_evaluation/__init__.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
  • src/lightspeed_evaluation/core/models/data.py (3 hunks)
  • src/lightspeed_evaluation/core/script/__init__.py (1 hunks)
  • src/lightspeed_evaluation/core/script/manager.py (1 hunks)
  • src/lightspeed_evaluation/core/system/exceptions.py (1 hunks)
  • src/lightspeed_evaluation/core/system/validator.py (5 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/script/init.py
  • src/lightspeed_evaluation/init.py
  • src/lightspeed_evaluation/core/metrics/script_eval.py
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T10:41:09.399Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • README.md
📚 Learning: 2025-09-09T08:08:40.654Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.

Applied to files:

  • README.md
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • config/system.yaml
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • config/evaluation_data.yaml
🧬 Code graph analysis (4)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationData (135-182)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • DataValidationError (16-17)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (4)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationData (135-182)
  • EvaluationResult (185-224)
  • TurnData (35-132)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
  • mark_all_metrics_as_error (17-77)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • amend_conversation_data (20-69)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-101)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (21-156)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
  • ProcessorComponents (19-26)
src/lightspeed_evaluation/core/script/manager.py (1)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • ScriptExecutionError (28-41)
🪛 markdownlint-cli2 (0.18.1)
README.md

334-334: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (1)
config/evaluation_data.yaml (1)

95-105: Script sample aligns with the new handler.
The example wires setup, verify, and cleanup scripts together, which should help exercise the ScriptExecutionManager path end-to-end.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
README.md (1)

334-341: Promote “Script Path Resolution” to a proper heading.

Markdownlint flags the bold text used as a faux heading; switching to an H4 keeps the hierarchy consistent and quiets the lint warning.

-**Script Path Resolution**
+#### Script Path Resolution
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd97031 and b212474.

📒 Files selected for processing (7)
  • README.md (7 hunks)
  • config/evaluation_data.yaml (1 hunks)
  • config/sample_scripts/cleanup.sh (1 hunks)
  • config/sample_scripts/setup.sh (1 hunks)
  • config/sample_scripts/verify.sh (1 hunks)
  • src/lightspeed_evaluation/core/script/manager.py (1 hunks)
  • src/lightspeed_evaluation/core/system/validator.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • config/sample_scripts/cleanup.sh
  • config/sample_scripts/verify.sh
  • config/evaluation_data.yaml
  • config/sample_scripts/setup.sh
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/core/script/manager.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • README.md
📚 Learning: 2025-09-09T08:08:40.654Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.

Applied to files:

  • README.md
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationData (135-182)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • DataValidationError (16-17)
src/lightspeed_evaluation/core/script/manager.py (1)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • ScriptExecutionError (28-41)
🪛 markdownlint-cli2 (0.18.1)
README.md

334-334: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/script/manager.py

[notice] 62-63:
subprocess call - check for execution of untrusted input.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (1)
src/lightspeed_evaluation/core/script/manager.py (1)

62-69: Working directory fix keeps relative script assets working.

Setting cwd=script_path.parent closes the gap where scripts sourcing sibling helpers would fail when invoked from another root. Thanks for wiring this in.

Based on learnings

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b212474 and 2e2a764.

📒 Files selected for processing (11)
  • README.md (7 hunks)
  • config/evaluation_data.yaml (1 hunks)
  • config/sample_scripts/cleanup.sh (1 hunks)
  • config/sample_scripts/setup.sh (1 hunks)
  • config/sample_scripts/verify.sh (1 hunks)
  • src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
  • src/lightspeed_evaluation/core/models/data.py (3 hunks)
  • src/lightspeed_evaluation/core/script/manager.py (1 hunks)
  • src/lightspeed_evaluation/core/system/validator.py (5 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/script_eval.py
  • config/sample_scripts/cleanup.sh
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/core/script/manager.py
📚 Learning: 2025-09-28T01:11:08.828Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#62
File: config/sample_scripts/setup.sh:7-8
Timestamp: 2025-09-28T01:11:08.828Z
Learning: Sample scripts in config/sample_scripts/ are templates that individual teams will customize for their specific environments and requirements, not production-ready scripts.

Applied to files:

  • config/sample_scripts/setup.sh
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • config/evaluation_data.yaml
📚 Learning: 2025-07-16T10:41:09.399Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • README.md
📚 Learning: 2025-09-09T08:08:40.654Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.

Applied to files:

  • README.md
🧬 Code graph analysis (3)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationData (135-182)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • DataValidationError (16-17)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6)
src/lightspeed_evaluation/core/metrics/manager.py (4)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/core/models/data.py (4)
  • EvaluationData (135-182)
  • EvaluationRequest (241-294)
  • EvaluationResult (185-224)
  • TurnData (35-132)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-109)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
  • mark_all_metrics_as_error (17-77)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • amend_conversation_data (20-69)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • evaluate_metric (58-129)
src/lightspeed_evaluation/core/script/manager.py (1)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • ScriptExecutionError (28-41)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/script/manager.py

[notice] 69-70:
subprocess call - check for execution of untrusted input.

🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/script/manager.py

[error] 28-28: Pylint: Too many branches (13/12).

🪛 markdownlint-cli2 (0.18.1)
README.md

334-334: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (4)
config/sample_scripts/setup.sh (1)

1-10: Good baseline for the sample setup script

The strict shell options and clear logging make this a solid template. No changes needed.

config/sample_scripts/verify.sh (1)

1-13: Sample verify script looks solid

The success/failure messaging tied to exit codes is exactly what downstream tooling needs. All good here.

src/lightspeed_evaluation/core/models/data.py (1)

76-172: Model field additions are correct

The new setup_script, cleanup_script, and verify_script fields integrate cleanly with existing validation patterns and give downstream components the data they need. Looks good.

README.md (1)

330-332: Quote the script:action_eval example to avoid YAML parse errors

Line 331 currently renders the list item as a mapping (script: action_eval) instead of the intended string, and several YAML parsers will either reject it or silently reshape the data. Please quote the string in the example so readers can copy/paste without breakage.

       turn_metrics:
-        - script:action_eval
+        - "script:action_eval"

(symfony.com)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aad0c7f and 675af1c.

📒 Files selected for processing (2)
  • src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lightspeed_evaluation/core/metrics/script_eval.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
🧬 Code graph analysis (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
src/lightspeed_evaluation/core/metrics/script_eval.py (1)
  • ScriptEvalMetrics (13-52)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (241-294)
  • EvaluationScope (227-238)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-117)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: mypy
  • GitHub Check: tests (3.12)

@asamal4
Copy link
Collaborator Author

asamal4 commented Sep 29, 2025

@tisnik @VladimirKadlec PTAL

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Do we have Jira story for this?

@asamal4
Copy link
Collaborator Author

asamal4 commented Sep 29, 2025

Do we have Jira story for this?

@VladimirKadlec yes LCORE-692

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit 5a2ed1c into lightspeed-core:main Sep 29, 2025
15 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants