Env setup/cleanup ability and verify through script by asamal4 · Pull Request #62 · lightspeed-core/lightspeed-evaluation

asamal4 · 2025-09-26T10:26:53Z

Ability to do env set up & clean up using script at conversation group level
This is applicable when API is enabled.
Evaluate deterministic action through a script
Added mock/sample data just to provide reference about the structure

Summary by CodeRabbit

New Features
- Script-based evaluations: per-turn verify scripts and optional conversation-level setup/cleanup scripts; new script metric (script:action_eval) with pass/fail scoring and API toggle.
- Script execution manager and ScriptExecutionError exposed; sample setup/verify/cleanup scripts included.
Refactor
- Evaluation flow now runs setup → per-turn evaluation → cleanup with centralized components, error short-circuiting, and metric routing.
- Evaluation data schema extended to include script fields and structured metric entries.
Documentation
- Added guide, YAML examples, script path resolution rules, and KUBECONFIG note.

coderabbitai · 2025-09-26T10:27:01Z

Walkthrough

Adds script-based evaluation support: new ScriptExecutionManager and ScriptExecutionError; ScriptEvalMetrics; model fields for setup/cleanup/verify scripts; validator script-path checks (YAML-dir resolution, executability); pipeline/evaluator/processor wiring to run setup scripts, per-turn verify scripts, and cleanup; README and sample scripts added.

Changes

Cohort / File(s)	Summary of changes
Docs `README.md`	Added Script-Based Evaluations section, YAML examples, setup/cleanup/verify guidance, KUBECONFIG note, and updated metrics/data-field docs.
Config data & metadata `config/evaluation_data.yaml`, `config/system.yaml`	Reworked `evaluation_data` entries (map-style metric entries, new `script_eval_example` and `tool_eval_example`), added conversation/turn script fields (`setup_script`, `cleanup_script`, `verify_script`); added `script:action_eval` metric metadata.
Public exports `src/lightspeed_evaluation/__init__.py`, `src/lightspeed_evaluation/core/script/__init__.py`	Exported `ScriptExecutionManager` and `ScriptExecutionError` on package surfaces.
Script execution subsystem `src/lightspeed_evaluation/core/script/manager.py`, `src/lightspeed_evaluation/core/system/exceptions.py`	New `ScriptExecutionManager` (resolves/validates script paths, checks executability, runs subprocess with timeout, captures output, returns success/fail, raises `ScriptExecutionError`); new `ScriptExecutionError` class.
Models `src/lightspeed_evaluation/core/models/data.py`	Added `setup_script` and `cleanup_script` to `EvaluationData`; added `verify_script` to `TurnData` (accepts `str` or `Path`).
Validation `src/lightspeed_evaluation/core/system/validator.py`	Added `_validate_scripts` and `_validate_single_script`; YAML-dir-relative path resolution, existence/type/executable checks (attempt chmod), contextual error messages; integrated into `load_evaluation_data`; added `script:action_eval` to `METRIC_REQUIREMENTS`.
Metrics `src/lightspeed_evaluation/core/metrics/script_eval.py`	Added `ScriptEvalMetrics` to evaluate `script:action_eval` by invoking `ScriptExecutionManager`; enforces turn-level scope, respects API-enabled gating, returns numeric pass/fail and messages, handles `ScriptExecutionError`.
Pipeline integration `src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`, `src/lightspeed_evaluation/pipeline/evaluation/pipeline.py`, `src/lightspeed_evaluation/pipeline/evaluation/processor.py`	Wired `ScriptExecutionManager` into pipeline and `ProcessorComponents`; `MetricsEvaluator` accepts `script_manager`; Processor runs setup before amendment/evaluation, invokes per-turn verify scripts via metrics, then runs cleanup (cleanup errors logged; setup failures short-circuit). Signatures updated to pass `script_manager` through components.
Sample scripts `config/sample_scripts/*`	Added `setup.sh`, `verify.sh`, `cleanup.sh` demonstrating setup/verify/cleanup behavior for script-based evaluations.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Pipeline
  participant Validator
  participant Processor
  participant ScriptMgr as ScriptExecutionManager
  participant Metrics as MetricsEvaluator

  User->>Pipeline: start(evaluation_data)
  Pipeline->>Validator: validate(data)  %% includes script path checks
  Pipeline->>Processor: process(conversation)

  Note over Processor: Setup phase (if provided)
  Processor->>ScriptMgr: run_script(setup_script)?
  alt Setup fails
    ScriptMgr-->>Processor: error / raises ScriptExecutionError
    Processor-->>Pipeline: return ERROR (short-circuit)
  else Setup OK or absent
    ScriptMgr-->>Processor: success/skip
    loop For each turn
      Processor->>Metrics: evaluate(turn_metrics)
      alt metric == script:action_eval
        Metrics->>ScriptMgr: run_script(verify_script)
        ScriptMgr-->>Metrics: pass / fail
      end
    end
    Processor->>Metrics: evaluate(conversation_metrics)
    Processor-->>Pipeline: return RESULTS
  end

  Note over Processor: Cleanup (always attempt)
  Processor->>ScriptMgr: run_script(cleanup_script)?
  ScriptMgr-->>Processor: success / warn

sequenceDiagram
  autonumber
  participant ScriptEval as ScriptEvalMetrics
  participant ScriptMgr as ScriptExecutionManager
  participant OS as Subprocess

  ScriptEval->>ScriptEval: validate scope & api_enabled
  alt API disabled
    ScriptEval-->>Caller: (None, "API disabled")
  else no verify_script
    ScriptEval-->>Caller: (None, "no verify script")
  else run verify script
    ScriptEval->>ScriptMgr: run_script(path)
    ScriptMgr->>OS: subprocess.run(script, env, timeout)
    alt exit code 0
      OS-->>ScriptMgr: success (code 0)
      ScriptMgr-->>ScriptEval: True
      ScriptEval-->>Caller: (1.0, "passed")
    else non-zero / error
      OS-->>ScriptMgr: failure / stderr
      ScriptMgr-->>ScriptEval: False or raises ScriptExecutionError
      ScriptEval-->>Caller: (0.0, "failed" / error message)
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Turn metric override #55 — Modifies pipeline/component wiring (ProcessorComponents, MetricsEvaluator) that overlaps with script_manager wiring and signature changes.
agent eval: multi-turn & refactoring #22 — Implements per-conversation setup/cleanup and per-turn script execution; overlaps on script validation and run semantics.
feat: add agent e2e eval #19 — Adds script-based evaluation features (script runner, ScriptExecutionError, script-based metrics) with strong code-level overlap.

Suggested reviewers

VladimirKadlec
tisnik

Poem

I hop to the scripts, I scurry and test,
Setup at sunrise, verify each quest.
Cleanup at dusk when the checks are all done,
Carrots for success — crunchy, warm fun. 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly describes the primary enhancements introduced by the changeset—enabling environment setup and cleanup scripts as well as script-based verification—and clearly conveys the main intent of the pull request.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

src/lightspeed_evaluation/core/script/manager.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
120-122: Fix potential AttributeError when turn_data is None (conversation-level).

Operator precedence causes request.turn_data.response to be accessed even when turn_data is None.

Apply this diff in both locations:
-                response=request.turn_data.response or "" if request.turn_data else "",
+                response=(request.turn_data.response or "") if request.turn_data else "",
And similarly in _create_error_result:
-            response=request.turn_data.response or "" if request.turn_data else "",
+            response=(request.turn_data.response or "") if request.turn_data else "",
Also applies to: 144-146
config/evaluation_data.yaml (1)

57-69: Ensure referenced scripts exist and are executable
The following files are referenced in config/evaluation_data.yaml but are missing:

scripts/verify_infrastructure.sh

scripts/setup_test_env.sh

scripts/cleanup_test_env.sh
Add each script (with a valid #! shebang) and mark them executable, or correct their paths.

🧹 Nitpick comments (10)

config/system.yaml (1)

76-79: Consider setting an explicit threshold for binary script metrics.

Without an explicit threshold, MetricsEvaluator defaults to 0.5; for a binary 0/1 verify script, it's clearer to require success (1.0).

Apply this diff:
-    "script:action_eval":
-      description: "Script-based evaluation for infrastructure/environment validation"
+    "script:action_eval":
+      threshold: 1.0
+      description: "Script-based evaluation for infrastructure/environment validation"

src/lightspeed_evaluation/core/system/exceptions.py (1)

28-41: Add str override for richer error context.

Include script_path and return_code in the string form to improve logs and surfaced messages.

 class ScriptExecutionError(EvaluationError):
     """Exception raised when script execution fails."""
 
     def __init__(self, message: str, script_path: str = "", return_code: int = -1):
         """Initialize script execution error.
@@
         super().__init__(message)
         self.script_path = script_path
         self.return_code = return_code
+
+    def __str__(self) -> str:
+        base = super().__str__()
+        extras = []
+        if self.script_path:
+            extras.append(f"script={self.script_path}")
+        if self.return_code != -1:
+            extras.append(f"return_code={self.return_code}")
+        return f"{base} ({', '.join(extras)})" if extras else base

README.md (1)

295-301: Tighten wording for the API-enabled scripts note.

Minor grammar fix.

-The framework supports script-based evaluations.
-**Note: Scripts only execute when API is enabled** - they're designed to test the actual with actual environment changes.
+The framework supports script-based evaluations.
+**Note: Scripts only execute when API is enabled** — they're designed to test with actual environment changes.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

30-31: Update the constructor docstring for accuracy.

Now initializes multiple managers and handlers, not just LLM.
-        """Initialize with LLM manager."""
+        """Initialize evaluator with system config, metric manager, and script manager."""

src/lightspeed_evaluation/core/models/data.py (2)

76-80: Normalize/validate script paths at the model layer (optional).

Consider a field_validator to strip whitespace and coerce to Path for verify_script to reduce downstream checks.

164-173: Mirror normalization for setup/cleanup scripts.

Same recommendation for setup_script and cleanup_script for consistency and early feedback.

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)

61-67: Make script timeout configurable.

Expose a script execution timeout in SystemConfig and pass it here to avoid hardcoding 300s.

Example (once config is added):
-        script_manager = ScriptExecutionManager()
+        script_manager = ScriptExecutionManager(timeout=config.script.timeout)

src/lightspeed_evaluation/core/system/validator.py (1)

288-295: Avoid chmod in validator; fail fast instead

Silently mutating file permissions in a validator introduces side effects and potential security posture issues (e.g., chmod on shared files). Prefer failing with a clear error and let the caller fix permissions (or perform opt‑in chmod at execution time).

Apply this diff:

-        # Check if script is executable or can be made executable
-        if not os.access(script_file, os.X_OK):
-            try:
-                script_file.chmod(0o755)
-            except (OSError, PermissionError) as exc:
-                raise DataValidationError(
-                    f"Conversation {context}: {script_type} script is not executable: {script_file}"
-                ) from exc
+        # Check if script is executable
+        if not os.access(script_file, os.X_OK):
+            raise DataValidationError(
+                f"Conversation {context}: {script_type} script is not executable: {script_file}"
+            )

src/lightspeed_evaluation/core/script/manager.py (2)

62-69: Run scripts with cwd set to the script’s directory

Many scripts assume relative paths to neighbors; setting cwd prevents surprises and improves isolation.

Apply this diff:

-            result = subprocess.run(
-                [str(script_path)],
-                text=True,
-                capture_output=True,
-                env=env,
-                timeout=self.timeout,
-                check=False,
-            )
+            result = subprocess.run(
+                [str(script_path)],
+                text=True,
+                capture_output=True,
+                env=env,
+                cwd=str(script_path.parent),
+                timeout=self.timeout,
+                check=False,
+            )

71-79: Truncate stdout/stderr in logs to avoid log flooding

Large outputs can overwhelm logs. Truncate at a sensible size while keeping full capture in memory if needed later.

Apply this diff:

-            # Log output
-            if result.stdout:
-                logger.debug("Script stdout: %s", result.stdout.strip())
-            if result.stderr:
-                if result.returncode != 0:
-                    logger.error("Script stderr: %s", result.stderr.strip())
-                else:
-                    logger.debug("Script stderr: %s", result.stderr.strip())
+            # Log output (truncate to avoid flooding)
+            def _truncate(s: str, limit: int = 8192) -> str:
+                s = s.strip()
+                return (s[:limit] + "... (truncated)") if len(s) > limit else s
+            if result.stdout:
+                logger.debug("Script stdout: %s", _truncate(result.stdout))
+            if result.stderr:
+                msg = _truncate(result.stderr)
+                if result.returncode != 0:
+                    logger.error("Script stderr: %s", msg)
+                else:
+                    logger.debug("Script stderr: %s", msg)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f75eb6c and b838b10.

📒 Files selected for processing (13)

README.md (2 hunks)
config/evaluation_data.yaml (1 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/__init__.py (2 hunks)
src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (3 hunks)
src/lightspeed_evaluation/core/script/__init__.py (1 hunks)
src/lightspeed_evaluation/core/script/manager.py (1 hunks)
src/lightspeed_evaluation/core/system/exceptions.py (1 hunks)
src/lightspeed_evaluation/core/system/validator.py (5 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-07-28T14:26:03.119Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

config/evaluation_data.yaml

📚 Learning: 2025-09-09T08:08:40.654Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.

Applied to files:

README.md

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

config/system.yaml

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/system/validator.py

🧬 Code graph analysis (8)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

src/lightspeed_evaluation/core/metrics/script_eval.py (1)

ScriptEvalMetrics (13-60)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (21-156)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

ProcessorComponents (19-26)

src/lightspeed_evaluation/core/metrics/script_eval.py (2)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationScope (227-238)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/__init__.py (2)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/core/system/exceptions.py (3)

APIError (8-9)

DataValidationError (16-17)

EvaluationError (4-5)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)

src/lightspeed_evaluation/core/metrics/manager.py (3)

MetricLevel (10-14)

resolve_metrics (24-48)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/core/models/data.py (3)

EvaluationData (135-182)

EvaluationResult (185-224)

TurnData (35-132)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

mark_all_metrics_as_error (17-77)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

amend_conversation_data (20-69)

src/lightspeed_evaluation/core/script/__init__.py (2)

src/lightspeed_evaluation/core/system/exceptions.py (1)

ScriptExecutionError (28-41)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/core/script/manager.py (1)

src/lightspeed_evaluation/core/system/exceptions.py (1)

ScriptExecutionError (28-41)

src/lightspeed_evaluation/core/system/validator.py (2)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationData (135-182)

src/lightspeed_evaluation/core/system/exceptions.py (1)

DataValidationError (16-17)

🪛 GitHub Check: Bandit

src/lightspeed_evaluation/core/script/manager.py

[notice] 5-5:
Consider possible security implications associated with the subprocess module.

[notice] 62-63:
subprocess call - check for execution of untrusted input.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: mypy
GitHub Check: tests (3.13)

🔇 Additional comments (16)

src/lightspeed_evaluation/core/script/__init__.py (1)

1-9: Public API exposure looks good.

Clean exports for ScriptExecutionManager and ScriptExecutionError.

src/lightspeed_evaluation/__init__.py (1)

27-36: Expose script execution types at top-level — LGTM.

Re-exporting ScriptExecutionManager and ScriptExecutionError is consistent and convenient.

Also applies to: 58-60

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

44-47: Script metric wiring — LGTM.

Properly instantiates ScriptEvalMetrics with API-enabled flag and routes under "script".

Also applies to: 53-54

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)

64-67: Pipeline wiring — LGTM.

Creating and passing ScriptExecutionManager into MetricsEvaluator and ProcessorComponents is clean.

69-76: Setup and cleanup scripts correctly gated by api.enabled.

src/lightspeed_evaluation/core/system/validator.py (3)

51-54: Registering script metric requirements looks good

The script:action_eval requirement for verify_script is correctly captured and aligns with the new metric.

123-126: Correctly gating script validation by API flag

Running script path validation only when API is enabled matches the intended behavior and avoids unnecessary failures offline.

202-205: Good: skipping script metric checks when API is disabled

This prevents false positives in non-API runs.

src/lightspeed_evaluation/core/script/manager.py (1)

5-5: Bandit: subprocess usage is acceptable here (no shell), ensure sources are trusted

You correctly avoid shell=True. Inputs are validated earlier (existence, file type). Keep script paths restricted to validated config.

Please confirm that all script paths originate from validated YAML (and not any runtime/user input).

src/lightspeed_evaluation/core/metrics/script_eval.py (2)

35-38: Metric routing is consistent

Handling "action_eval" under the script namespace matches the registry pattern.

47-51: Graceful no-op when API disabled

Skipping execution when API is off aligns with the validator’s behavior.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)

22-27: Components container extension looks good

Adding script_manager keeps wiring centralized and clean.

80-101: Happy path + API error short-circuit logic reads well

Short-circuiting to ERROR on API amendment failure, with cleanup in finally, is solid.

164-187: Setup runner is robust

Good logging and error propagation via string reason; integrates with manager exceptions cleanly.

188-215: Cleanup runner behavior is appropriate

Non-blocking on failure and always-attempt semantics are good.

config/evaluation_data.yaml (1)

3-29: YAML structure aligns with model and validator

Fields and nulls (verify_script/setup_script/cleanup_script) match the new schema.

config/evaluation_data.yaml

README.md

src/lightspeed_evaluation/core/system/validator.py

src/lightspeed_evaluation/pipeline/evaluation/processor.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

119-121: Fix potential AttributeError when turn_data is None (bad conditional expression precedence).
Current expression can evaluate request.turn_data.response even when turn_data is None. Apply:

-                query=request.turn_data.query if request.turn_data else "",
-                response=request.turn_data.response or "" if request.turn_data else "",
+                query=request.turn_data.query if request.turn_data else "",
+                response=(request.turn_data.response or "") if request.turn_data else "",

And similarly in _create_error_result:

-            query=request.turn_data.query if request.turn_data else "",
-            response=request.turn_data.response or "" if request.turn_data else "",
+            query=request.turn_data.query if request.turn_data else "",
+            response=(request.turn_data.response or "") if request.turn_data else "",

Also applies to: 143-145

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)

128-136: Propagate original_data_path to DataValidator and validate scripts for programmatic inputs.
Currently _validate_scripts runs only in load_evaluation_data; pipeline path using validate_evaluation_data skips script validation, risking wrong CWD resolution and missing exec checks.

Apply:

         self.original_data_path = original_data_path
         logger.info("Starting evaluation")
         results: list[EvaluationResult] = []
 
         # Step 1: Validate data
         logger.info("Validating data")
+        # Ensure validator can resolve relative script paths against the YAML file dir
+        self.data_validator.original_data_path = self.original_data_path
         if not self.validate_data(evaluation_data):
             raise ValueError("Data validation failed. Cannot proceed with evaluation.")
+        # Also validate/normalize scripts when API is enabled and data was provided programmatically
+        config = self.config_loader.system_config
+        if config and config.api.enabled:
+            # Internal method is acceptable here until a public hook is exposed
+            self.data_validator._validate_scripts(evaluation_data)  # pylint: disable=protected-access

Alternatively, expose a public validate_scripts() on DataValidator and call it here.

🧹 Nitpick comments (6)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
73-79: Optional: guard split on malformed metric identifiers.
Avoid ValueError on missing ":".
-            framework, metric_name = request.metric_identifier.split(":", 1)
+            if ":" not in request.metric_identifier:
+                execution_time = time.time() - start_time
+                return self._create_error_result(
+                    request, f"Invalid metric identifier: {request.metric_identifier}", execution_time
+                )
+            framework, metric_name = request.metric_identifier.split(":", 1)
src/lightspeed_evaluation/core/system/validator.py (3)
123-126: Scripts validated only in load_evaluation_data; also invoke during validate_evaluation_data.
To cover programmatic inputs, mirror this check in validate_evaluation_data when api_enabled.

You can add at the end of validate_evaluation_data:
         if self.validation_errors:
             ...
         validation_msg = "✅ All data validation passed"
         if self.api_enabled:
             validation_msg += " (API mode - data will be enhanced via API)"
         print(validation_msg)
+        # Normalize/validate script paths in API mode even when data is pre-loaded
+        if self.api_enabled:
+            self._validate_scripts(evaluation_data)
         return True
Based on learnings

202-205: Use the method parameter api_enabled consistently.
Currently mixes self.api_enabled and api_enabled; prefer the parameter for clarity.
-                if metric.startswith("script:") and not self.api_enabled:
+                if metric.startswith("script:") and not api_enabled:
                     continue
288-294: Be cautious about chmod in a validator.
Changing file modes may be undesirable in read-only or shared repos; consider warning instead of mutating, or make this opt-in.
src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
84-87: Prefer fail-safe handling over raising when config is missing

Raising here will bubble an exception and skip error classification. Either validate system_config in __init__ (fail-fast) or mark all metrics as ERROR here and return.

Apply this diff to convert to ERROR results (Option B):
-            if self.config is None:
-                raise ValueError("SystemConfig must be loaded")
+            if self.config is None:
+                logger.error("SystemConfig not loaded - marking all metrics as ERROR")
+                error_results = self.components.error_handler.mark_all_metrics_as_error(
+                    conv_data,
+                    "SystemConfig not loaded",
+                    resolved_turn_metrics=resolved_turn_metrics,
+                    resolved_conversation_metrics=resolved_conversation_metrics,
+                )
+                return error_results
172-176: Don’t couple script execution to API.enabled

Gating setup/cleanup scripts on api.enabled is surprising; script eval may be useful even without API calls. Suggest always running scripts when provided (or introduce a dedicated scripts.enabled flag in config).

Apply this diff to remove the API gate:
-        # Skip script execution if API is disabled
-        if self.config and not self.config.api.enabled:
-            logger.debug("Skipping setup script (API disabled): %s", setup_script)
-            return None
-        # Skip script execution if API is disabled
-        if self.config and not self.config.api.enabled:
-            logger.debug("Skipping cleanup script (API disabled): %s", cleanup_script)
-            return
If the intent was to avoid running placeholder scripts in sample data, consider a dedicated flag instead. To verify intent, please confirm expected behavior when api.enabled=false.

Also applies to: 199-203

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b838b10 and 572d145.

📒 Files selected for processing (10)

README.md (2 hunks)
config/evaluation_data.yaml (1 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/__init__.py (2 hunks)
src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (3 hunks)
src/lightspeed_evaluation/core/system/validator.py (5 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

config/system.yaml
README.md
src/lightspeed_evaluation/init.py

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-07-28T14:26:03.119Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

config/evaluation_data.yaml

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/system/validator.py

📚 Learning: 2025-07-16T12:07:29.169Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

src/lightspeed_evaluation/core/system/validator.py

📚 Learning: 2025-07-16T10:41:09.399Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/processor.py

🧬 Code graph analysis (5)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (21-156)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

ProcessorComponents (19-26)

src/lightspeed_evaluation/core/metrics/script_eval.py (2)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationScope (227-238)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

src/lightspeed_evaluation/core/metrics/script_eval.py (1)

ScriptEvalMetrics (13-60)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/core/system/validator.py (2)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationData (135-182)

src/lightspeed_evaluation/core/system/exceptions.py (1)

DataValidationError (16-17)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (6)

src/lightspeed_evaluation/core/metrics/manager.py (4)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/core/models/data.py (4)

EvaluationData (135-182)

EvaluationRequest (241-294)

EvaluationResult (185-224)

TurnData (35-132)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

mark_all_metrics_as_error (17-77)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

amend_conversation_data (20-69)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

evaluate_metric (58-129)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (21)

src/lightspeed_evaluation/core/models/data.py (3)

4-5: LGTM on imports and typing.
Path/Union imports align with new fields.

76-80: Turn-level verify_script field looks good.
Optional Path/str is appropriate; validator normalizes.

164-173: Conversation-level setup/cleanup scripts added correctly.
Matches downstream ScriptExecutionManager usage.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

13-16: Correctly wires ScriptEvalMetrics and ScriptExecutionManager.
Keeps handler map cohesive.

28-31: All MetricsEvaluator instantiations updated
Found only one call in src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (line 65) using three arguments; no legacy two‐arg usages remain.

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)

10-10: Import of ScriptExecutionManager is correct.

61-67: Good: centralized ScriptExecutionManager and passed into evaluator.

75-76: Passing script_manager through ProcessorComponents is appropriate.
src/lightspeed_evaluation/core/system/validator.py (3)
3-6: LGTM on imports and API gating setup.

51-55: Requirement mapping for script:action_eval is correct.
Ensures verify_script presence only when applicable.

242-297: Resolve relative script paths against the YAML file directory, not CWD.
As-is, Path.resolve() will use process CWD; breaks when running from other dirs.

Apply:
         if isinstance(script_file, str):
             script_file = Path(script_file)
 
-        script_file = script_file.resolve()
+        # Resolve relative paths against the YAML file directory (fallback to CWD)
+        if not script_file.is_absolute():
+            base_dir = (
+                Path(self.original_data_path).resolve().parent
+                if self.original_data_path
+                else Path.cwd()
+            )
+            script_file = (base_dir / script_file).resolve()
+        else:
+            script_file = script_file.resolve()
Based on learnings
src/lightspeed_evaluation/core/metrics/script_eval.py (3)

13-22: Constructor and API gating look good.
Aligns with system_config.api.enabled wiring.

23-39: Scope checks and metric routing are appropriate.
Turn-only restriction and unknown-metric messaging are clear.

40-61: Execution path and error handling are solid.
return (1.0/0.0/None) semantics fit existing evaluator contract.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)

67-80: Good fix: attempt cleanup even when setup fails

Cleanup is now attempted before returning on setup failure. This addresses prior feedback.

93-103: Cleanup still runs on API error path

Returning inside try is fine; finally ensures cleanup runs. LGTM.

Also applies to: 130-133

148-151: Evaluator wiring looks correct

Turn-level evaluation now delegates through components.metrics_evaluator. LGTM.

218-221: Summary delegation is consistent

Delegating to metric_manager.count_metrics_for_conversation is appropriate. LGTM.
config/evaluation_data.yaml (3)
20-22: Metrics supported: ragas:faithfulness, ragas:response_relevancy, and ragas:context_precision_without_reference are registered in system config and handled by the RAGAS pipeline.

61-68: Add missing scripts or gate their execution
config/evaluation_data.yaml references verify_infrastructure.sh, setup_test_env.sh, and cleanup_test_env.sh, but none exist in the repo. Either include these scripts under scripts/ or wrap their invocation behind a feature flag.
⛔ Skipped due to learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/data/script/eval3/setup.sh:1-3
Timestamp: 2025-07-16T09:42:00.691Z
Learning: Scripts in the lsc_agent_eval/data directory are meant to be simple examples/samples for teams to customize according to their needs, not production-ready code.
45-50: Confirmed deepeval conversation metrics registration and handler implementation

Both deepeval:conversation_completeness and deepeval:conversation_relevancy are defined under conversation_level in config/system.yaml, and DeepEvalMetrics is implemented and wired under the “deepeval” key in src/lightspeed_evaluation/pipeline/evaluation/evaluator.py.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/system/validator.py (1)
271-279: Handle user home shortcuts in script paths

Path("~…") stays literal right now, so a config that points to ~/scripts/setup.sh ends up looking for <yaml_dir>/~/scripts/setup.sh, which is almost certainly wrong. Expanding the home shortcut before the absolute/relative logic preserves existing behavior but also makes those configs work.
         if isinstance(script_file, str):
             script_file = Path(script_file)

+        # Normalize home directory shortcuts before resolving
+        script_file = script_file.expanduser()
+
         # Resolve relative paths against the YAML file directory, not CWD
         if not script_file.is_absolute() and self.original_data_path:
             yaml_dir = Path(self.original_data_path).parent
             script_file = (yaml_dir / script_file).resolve()
         else:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 572d145 and fd97031.

📒 Files selected for processing (16)

README.md (7 hunks)
config/evaluation_data.yaml (1 hunks)
config/sample_scripts/cleanup.sh (1 hunks)
config/sample_scripts/setup.sh (1 hunks)
config/sample_scripts/verify.sh (1 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/__init__.py (2 hunks)
src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (3 hunks)
src/lightspeed_evaluation/core/script/__init__.py (1 hunks)
src/lightspeed_evaluation/core/script/manager.py (1 hunks)
src/lightspeed_evaluation/core/system/exceptions.py (1 hunks)
src/lightspeed_evaluation/core/system/validator.py (5 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/script/init.py
src/lightspeed_evaluation/init.py
src/lightspeed_evaluation/core/metrics/script_eval.py

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/system/validator.py

📚 Learning: 2025-07-16T12:07:29.169Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

src/lightspeed_evaluation/core/system/validator.py

📚 Learning: 2025-07-16T10:41:09.399Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/processor.py

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

README.md

📚 Learning: 2025-09-09T08:08:40.654Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.

Applied to files:

README.md

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

config/system.yaml

📚 Learning: 2025-07-28T14:26:03.119Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

config/evaluation_data.yaml

🧬 Code graph analysis (4)

src/lightspeed_evaluation/core/system/validator.py (2)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationData (135-182)

src/lightspeed_evaluation/core/system/exceptions.py (1)

DataValidationError (16-17)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)

src/lightspeed_evaluation/core/metrics/manager.py (4)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/core/models/data.py (3)

EvaluationData (135-182)

EvaluationResult (185-224)

TurnData (35-132)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

mark_all_metrics_as_error (17-77)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

amend_conversation_data (20-69)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-101)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (21-156)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

ProcessorComponents (19-26)

src/lightspeed_evaluation/core/script/manager.py (1)

src/lightspeed_evaluation/core/system/exceptions.py (1)

ScriptExecutionError (28-41)

🪛 markdownlint-cli2 (0.18.1)

README.md

334-334: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (1)

config/evaluation_data.yaml (1)

95-105: Script sample aligns with the new handler.
The example wires setup, verify, and cleanup scripts together, which should help exercise the ScriptExecutionManager path end-to-end.

config/sample_scripts/cleanup.sh

config/sample_scripts/setup.sh

src/lightspeed_evaluation/core/script/manager.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

README.md (1)
334-341: Promote “Script Path Resolution” to a proper heading.

Markdownlint flags the bold text used as a faux heading; switching to an H4 keeps the hierarchy consistent and quiets the lint warning.
-**Script Path Resolution**
+#### Script Path Resolution

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd97031 and b212474.

📒 Files selected for processing (7)

README.md (7 hunks)
config/evaluation_data.yaml (1 hunks)
config/sample_scripts/cleanup.sh (1 hunks)
config/sample_scripts/setup.sh (1 hunks)
config/sample_scripts/verify.sh (1 hunks)
src/lightspeed_evaluation/core/script/manager.py (1 hunks)
src/lightspeed_evaluation/core/system/validator.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

config/sample_scripts/cleanup.sh
config/sample_scripts/verify.sh
config/evaluation_data.yaml
config/sample_scripts/setup.sh

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/system/validator.py

📚 Learning: 2025-07-16T12:07:29.169Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/core/script/manager.py

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

README.md

📚 Learning: 2025-09-09T08:08:40.654Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.

Applied to files:

README.md

🧬 Code graph analysis (2)

src/lightspeed_evaluation/core/system/validator.py (2)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationData (135-182)

src/lightspeed_evaluation/core/system/exceptions.py (1)

DataValidationError (16-17)

src/lightspeed_evaluation/core/script/manager.py (1)

src/lightspeed_evaluation/core/system/exceptions.py (1)

ScriptExecutionError (28-41)

🪛 markdownlint-cli2 (0.18.1)

README.md

334-334: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 GitHub Check: Bandit

src/lightspeed_evaluation/core/script/manager.py

[notice] 62-63:
subprocess call - check for execution of untrusted input.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (1)

src/lightspeed_evaluation/core/script/manager.py (1)

62-69: Working directory fix keeps relative script assets working.

Setting cwd=script_path.parent closes the gap where scripts sourcing sibling helpers would fail when invoked from another root. Thanks for wiring this in.

Based on learnings

src/lightspeed_evaluation/core/script/manager.py

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b212474 and 2e2a764.

📒 Files selected for processing (11)

README.md (7 hunks)
config/evaluation_data.yaml (1 hunks)
config/sample_scripts/cleanup.sh (1 hunks)
config/sample_scripts/setup.sh (1 hunks)
config/sample_scripts/verify.sh (1 hunks)
src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (3 hunks)
src/lightspeed_evaluation/core/script/manager.py (1 hunks)
src/lightspeed_evaluation/core/system/validator.py (5 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/script_eval.py
config/sample_scripts/cleanup.sh

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/system/validator.py

📚 Learning: 2025-07-16T12:07:29.169Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/core/script/manager.py

📚 Learning: 2025-09-28T01:11:08.828Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#62
File: config/sample_scripts/setup.sh:7-8
Timestamp: 2025-09-28T01:11:08.828Z
Learning: Sample scripts in config/sample_scripts/ are templates that individual teams will customize for their specific environments and requirements, not production-ready scripts.

Applied to files:

config/sample_scripts/setup.sh

📚 Learning: 2025-07-28T14:26:03.119Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

config/evaluation_data.yaml

📚 Learning: 2025-07-16T10:41:09.399Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/processor.py

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

README.md

📚 Learning: 2025-09-09T08:08:40.654Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.

Applied to files:

README.md

🧬 Code graph analysis (3)

src/lightspeed_evaluation/core/system/validator.py (2)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationData (135-182)

src/lightspeed_evaluation/core/system/exceptions.py (1)

DataValidationError (16-17)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (6)

src/lightspeed_evaluation/core/metrics/manager.py (4)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/core/models/data.py (4)

EvaluationData (135-182)

EvaluationRequest (241-294)

EvaluationResult (185-224)

TurnData (35-132)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-109)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

mark_all_metrics_as_error (17-77)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

amend_conversation_data (20-69)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

evaluate_metric (58-129)

src/lightspeed_evaluation/core/script/manager.py (1)

src/lightspeed_evaluation/core/system/exceptions.py (1)

ScriptExecutionError (28-41)

🪛 GitHub Check: Bandit

src/lightspeed_evaluation/core/script/manager.py

[notice] 69-70:
subprocess call - check for execution of untrusted input.

🪛 GitHub Actions: Python linter

src/lightspeed_evaluation/core/script/manager.py

[error] 28-28: Pylint: Too many branches (13/12).

🪛 markdownlint-cli2 (0.18.1)

README.md

334-334: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (4)

config/sample_scripts/setup.sh (1)

1-10: Good baseline for the sample setup script

The strict shell options and clear logging make this a solid template. No changes needed.

config/sample_scripts/verify.sh (1)

1-13: Sample verify script looks solid

The success/failure messaging tied to exit codes is exactly what downstream tooling needs. All good here.

src/lightspeed_evaluation/core/models/data.py (1)

76-172: Model field additions are correct

The new setup_script, cleanup_script, and verify_script fields integrate cleanly with existing validation patterns and give downstream components the data they need. Looks good.
README.md (1)
330-332: Quote the script:action_eval example to avoid YAML parse errors

Line 331 currently renders the list item as a mapping (script: action_eval) instead of the intended string, and several YAML parsers will either reject it or silently reshape the data. Please quote the string in the example so readers can copy/paste without breakage.
       turn_metrics:
-        - script:action_eval
+        - "script:action_eval"
(symfony.com)

src/lightspeed_evaluation/core/script/manager.py

src/lightspeed_evaluation/pipeline/evaluation/processor.py

src/lightspeed_evaluation/core/script/manager.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aad0c7f and 675af1c.

📒 Files selected for processing (2)

src/lightspeed_evaluation/core/metrics/script_eval.py (1 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/lightspeed_evaluation/core/metrics/script_eval.py

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

🧬 Code graph analysis (1)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)

src/lightspeed_evaluation/core/metrics/script_eval.py (1)

ScriptEvalMetrics (13-52)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationRequest (241-294)

EvaluationScope (227-238)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-117)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: mypy
GitHub Check: tests (3.12)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

asamal4 · 2025-09-29T05:24:29Z

@tisnik @VladimirKadlec PTAL

VladimirKadlec

LGTM
Do we have Jira story for this?

asamal4 · 2025-09-29T09:34:16Z

Do we have Jira story for this?

@VladimirKadlec yes LCORE-692

tisnik

LGTM

github-advanced-security bot found potential problems Sep 26, 2025

View reviewed changes

src/lightspeed_evaluation/core/script/manager.py Show resolved Hide resolved

src/lightspeed_evaluation/core/script/manager.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

config/evaluation_data.yaml Show resolved Hide resolved

README.md Show resolved Hide resolved

src/lightspeed_evaluation/core/system/validator.py Show resolved Hide resolved

src/lightspeed_evaluation/pipeline/evaluation/processor.py Show resolved Hide resolved

asamal4 force-pushed the script-handler branch from b838b10 to 572d145 Compare September 26, 2025 10:53

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

asamal4 added 4 commits September 28, 2025 02:55

data model change for scripts

a83016d

add script executor

c42075c

add script eval ability

fea42aa

process setup/cleanup & script eval

89db015

asamal4 force-pushed the script-handler branch from 572d145 to fd97031 Compare September 28, 2025 00:49

asamal4 changed the title ~~Script handler~~ Env setup/cleanup ability and verify through script Sep 28, 2025

coderabbitai bot reviewed Sep 28, 2025

View reviewed changes

config/sample_scripts/cleanup.sh Show resolved Hide resolved

config/sample_scripts/setup.sh Show resolved Hide resolved

src/lightspeed_evaluation/core/script/manager.py Outdated Show resolved Hide resolved

asamal4 force-pushed the script-handler branch from fd97031 to 2fbf01e Compare September 28, 2025 01:35

github-advanced-security bot found potential problems Sep 28, 2025

View reviewed changes

src/lightspeed_evaluation/core/script/manager.py Fixed Show fixed Hide fixed

asamal4 force-pushed the script-handler branch from 2fbf01e to b212474 Compare September 28, 2025 01:37

coderabbitai bot reviewed Sep 28, 2025

View reviewed changes

process script only with api enabled

15bbdbf

asamal4 force-pushed the script-handler branch from b212474 to 2e2a764 Compare September 28, 2025 22:46

github-advanced-security bot found potential problems Sep 28, 2025

View reviewed changes

src/lightspeed_evaluation/core/script/manager.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Sep 28, 2025

View reviewed changes

src/lightspeed_evaluation/core/script/manager.py Show resolved Hide resolved

src/lightspeed_evaluation/pipeline/evaluation/processor.py Show resolved Hide resolved

asamal4 added 2 commits September 29, 2025 04:45

handle script path and update sample data & readme

77d3508

add tooleval example

926d703

asamal4 force-pushed the script-handler branch from 2e2a764 to 926d703 Compare September 28, 2025 23:17

github-advanced-security bot found potential problems Sep 28, 2025

View reviewed changes

src/lightspeed_evaluation/core/script/manager.py Show resolved Hide resolved

skip result for script+api disabled

675af1c

asamal4 force-pushed the script-handler branch from aad0c7f to 675af1c Compare September 29, 2025 00:53

coderabbitai bot reviewed Sep 29, 2025

View reviewed changes

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py Show resolved Hide resolved

VladimirKadlec approved these changes Sep 29, 2025

View reviewed changes

tisnik approved these changes Sep 29, 2025

View reviewed changes

tisnik merged commit 5a2ed1c into lightspeed-core:main Sep 29, 2025
15 checks passed

coderabbitai bot mentioned this pull request Oct 6, 2025

use absolute imports #68

Merged

coderabbitai bot mentioned this pull request Nov 26, 2025

LEADS-26: Increased Unit test cases coverage #109

Merged

15 tasks

Conversation

asamal4 commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asamal4 commented Sep 29, 2025

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Sep 29, 2025

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asamal4 commented Sep 26, 2025 •

edited

Loading

coderabbitai bot commented Sep 26, 2025 •

edited

Loading