Env setup/cleanup ability and verify through script#62
Env setup/cleanup ability and verify through script#62tisnik merged 8 commits intolightspeed-core:mainfrom
Conversation
WalkthroughAdds script-based evaluation support: new ScriptExecutionManager and ScriptExecutionError; ScriptEvalMetrics; model fields for setup/cleanup/verify scripts; validator script-path checks (YAML-dir resolution, executability); pipeline/evaluator/processor wiring to run setup scripts, per-turn verify scripts, and cleanup; README and sample scripts added. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Pipeline
participant Validator
participant Processor
participant ScriptMgr as ScriptExecutionManager
participant Metrics as MetricsEvaluator
User->>Pipeline: start(evaluation_data)
Pipeline->>Validator: validate(data) %% includes script path checks
Pipeline->>Processor: process(conversation)
Note over Processor: Setup phase (if provided)
Processor->>ScriptMgr: run_script(setup_script)?
alt Setup fails
ScriptMgr-->>Processor: error / raises ScriptExecutionError
Processor-->>Pipeline: return ERROR (short-circuit)
else Setup OK or absent
ScriptMgr-->>Processor: success/skip
loop For each turn
Processor->>Metrics: evaluate(turn_metrics)
alt metric == script:action_eval
Metrics->>ScriptMgr: run_script(verify_script)
ScriptMgr-->>Metrics: pass / fail
end
end
Processor->>Metrics: evaluate(conversation_metrics)
Processor-->>Pipeline: return RESULTS
end
Note over Processor: Cleanup (always attempt)
Processor->>ScriptMgr: run_script(cleanup_script)?
ScriptMgr-->>Processor: success / warn
sequenceDiagram
autonumber
participant ScriptEval as ScriptEvalMetrics
participant ScriptMgr as ScriptExecutionManager
participant OS as Subprocess
ScriptEval->>ScriptEval: validate scope & api_enabled
alt API disabled
ScriptEval-->>Caller: (None, "API disabled")
else no verify_script
ScriptEval-->>Caller: (None, "no verify script")
else run verify script
ScriptEval->>ScriptMgr: run_script(path)
ScriptMgr->>OS: subprocess.run(script, env, timeout)
alt exit code 0
OS-->>ScriptMgr: success (code 0)
ScriptMgr-->>ScriptEval: True
ScriptEval-->>Caller: (1.0, "passed")
else non-zero / error
OS-->>ScriptMgr: failure / stderr
ScriptMgr-->>ScriptEval: False or raises ScriptExecutionError
ScriptEval-->>Caller: (0.0, "failed" / error message)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
120-122: Fix potential AttributeError when turn_data is None (conversation-level).Operator precedence causes request.turn_data.response to be accessed even when turn_data is None.
Apply this diff in both locations:
- response=request.turn_data.response or "" if request.turn_data else "", + response=(request.turn_data.response or "") if request.turn_data else "",And similarly in _create_error_result:
- response=request.turn_data.response or "" if request.turn_data else "", + response=(request.turn_data.response or "") if request.turn_data else "",Also applies to: 144-146
config/evaluation_data.yaml (1)
57-69: Ensure referenced scripts exist and are executable
The following files are referenced in config/evaluation_data.yaml but are missing:
- scripts/verify_infrastructure.sh
- scripts/setup_test_env.sh
- scripts/cleanup_test_env.sh
Add each script (with a valid#!shebang) and mark them executable, or correct their paths.
🧹 Nitpick comments (10)
config/system.yaml (1)
76-79: Consider setting an explicit threshold for binary script metrics.Without an explicit threshold, MetricsEvaluator defaults to 0.5; for a binary 0/1 verify script, it's clearer to require success (1.0).
Apply this diff:
- "script:action_eval": - description: "Script-based evaluation for infrastructure/environment validation" + "script:action_eval": + threshold: 1.0 + description: "Script-based evaluation for infrastructure/environment validation"src/lightspeed_evaluation/core/system/exceptions.py (1)
28-41: Add str override for richer error context.Include script_path and return_code in the string form to improve logs and surfaced messages.
class ScriptExecutionError(EvaluationError): """Exception raised when script execution fails.""" def __init__(self, message: str, script_path: str = "", return_code: int = -1): """Initialize script execution error. @@ super().__init__(message) self.script_path = script_path self.return_code = return_code + + def __str__(self) -> str: + base = super().__str__() + extras = [] + if self.script_path: + extras.append(f"script={self.script_path}") + if self.return_code != -1: + extras.append(f"return_code={self.return_code}") + return f"{base} ({', '.join(extras)})" if extras else baseREADME.md (1)
295-301: Tighten wording for the API-enabled scripts note.Minor grammar fix.
-The framework supports script-based evaluations. -**Note: Scripts only execute when API is enabled** - they're designed to test the actual with actual environment changes. +The framework supports script-based evaluations. +**Note: Scripts only execute when API is enabled** — they're designed to test with actual environment changes.src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
30-31: Update the constructor docstring for accuracy.Now initializes multiple managers and handlers, not just LLM.
- """Initialize with LLM manager.""" + """Initialize evaluator with system config, metric manager, and script manager."""src/lightspeed_evaluation/core/models/data.py (2)
76-80: Normalize/validate script paths at the model layer (optional).Consider a field_validator to strip whitespace and coerce to Path for verify_script to reduce downstream checks.
164-173: Mirror normalization for setup/cleanup scripts.Same recommendation for setup_script and cleanup_script for consistency and early feedback.
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)
61-67: Make script timeout configurable.Expose a script execution timeout in SystemConfig and pass it here to avoid hardcoding 300s.
Example (once config is added):
- script_manager = ScriptExecutionManager() + script_manager = ScriptExecutionManager(timeout=config.script.timeout)src/lightspeed_evaluation/core/system/validator.py (1)
288-295: Avoid chmod in validator; fail fast insteadSilently mutating file permissions in a validator introduces side effects and potential security posture issues (e.g., chmod on shared files). Prefer failing with a clear error and let the caller fix permissions (or perform opt‑in chmod at execution time).
Apply this diff:
- # Check if script is executable or can be made executable - if not os.access(script_file, os.X_OK): - try: - script_file.chmod(0o755) - except (OSError, PermissionError) as exc: - raise DataValidationError( - f"Conversation {context}: {script_type} script is not executable: {script_file}" - ) from exc + # Check if script is executable + if not os.access(script_file, os.X_OK): + raise DataValidationError( + f"Conversation {context}: {script_type} script is not executable: {script_file}" + )src/lightspeed_evaluation/core/script/manager.py (2)
62-69: Run scripts with cwd set to the script’s directoryMany scripts assume relative paths to neighbors; setting cwd prevents surprises and improves isolation.
Apply this diff:
- result = subprocess.run( - [str(script_path)], - text=True, - capture_output=True, - env=env, - timeout=self.timeout, - check=False, - ) + result = subprocess.run( + [str(script_path)], + text=True, + capture_output=True, + env=env, + cwd=str(script_path.parent), + timeout=self.timeout, + check=False, + )
71-79: Truncate stdout/stderr in logs to avoid log floodingLarge outputs can overwhelm logs. Truncate at a sensible size while keeping full capture in memory if needed later.
Apply this diff:
- # Log output - if result.stdout: - logger.debug("Script stdout: %s", result.stdout.strip()) - if result.stderr: - if result.returncode != 0: - logger.error("Script stderr: %s", result.stderr.strip()) - else: - logger.debug("Script stderr: %s", result.stderr.strip()) + # Log output (truncate to avoid flooding) + def _truncate(s: str, limit: int = 8192) -> str: + s = s.strip() + return (s[:limit] + "... (truncated)") if len(s) > limit else s + if result.stdout: + logger.debug("Script stdout: %s", _truncate(result.stdout)) + if result.stderr: + msg = _truncate(result.stderr) + if result.returncode != 0: + logger.error("Script stderr: %s", msg) + else: + logger.debug("Script stderr: %s", msg)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
README.md(2 hunks)config/evaluation_data.yaml(1 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/__init__.py(2 hunks)src/lightspeed_evaluation/core/metrics/script_eval.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(3 hunks)src/lightspeed_evaluation/core/script/__init__.py(1 hunks)src/lightspeed_evaluation/core/script/manager.py(1 hunks)src/lightspeed_evaluation/core/system/exceptions.py(1 hunks)src/lightspeed_evaluation/core/system/validator.py(5 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(6 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
config/evaluation_data.yaml
📚 Learning: 2025-09-09T08:08:40.654Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.
Applied to files:
README.md
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
config/system.yaml
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/system/validator.py
🧬 Code graph analysis (8)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/metrics/script_eval.py (1)
ScriptEvalMetrics(13-60)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(21-156)src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
ProcessorComponents(19-26)
src/lightspeed_evaluation/core/metrics/script_eval.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationScope(227-238)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)
src/lightspeed_evaluation/__init__.py (2)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)src/lightspeed_evaluation/core/system/exceptions.py (3)
APIError(8-9)DataValidationError(16-17)EvaluationError(4-5)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (3)
MetricLevel(10-14)resolve_metrics(24-48)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/core/models/data.py (3)
EvaluationData(135-182)EvaluationResult(185-224)TurnData(35-132)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
mark_all_metrics_as_error(17-77)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
amend_conversation_data(20-69)
src/lightspeed_evaluation/core/script/__init__.py (2)
src/lightspeed_evaluation/core/system/exceptions.py (1)
ScriptExecutionError(28-41)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)
src/lightspeed_evaluation/core/script/manager.py (1)
src/lightspeed_evaluation/core/system/exceptions.py (1)
ScriptExecutionError(28-41)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(135-182)src/lightspeed_evaluation/core/system/exceptions.py (1)
DataValidationError(16-17)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/script/manager.py
[notice] 5-5:
Consider possible security implications associated with the subprocess module.
[notice] 62-63:
subprocess call - check for execution of untrusted input.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: mypy
- GitHub Check: tests (3.13)
🔇 Additional comments (16)
src/lightspeed_evaluation/core/script/__init__.py (1)
1-9: Public API exposure looks good.Clean exports for ScriptExecutionManager and ScriptExecutionError.
src/lightspeed_evaluation/__init__.py (1)
27-36: Expose script execution types at top-level — LGTM.Re-exporting ScriptExecutionManager and ScriptExecutionError is consistent and convenient.
Also applies to: 58-60
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
44-47: Script metric wiring — LGTM.Properly instantiates ScriptEvalMetrics with API-enabled flag and routes under "script".
Also applies to: 53-54
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)
64-67: Pipeline wiring — LGTM.Creating and passing ScriptExecutionManager into MetricsEvaluator and ProcessorComponents is clean.
69-76: Setup and cleanup scripts correctly gated byapi.enabled.src/lightspeed_evaluation/core/system/validator.py (3)
51-54: Registering script metric requirements looks goodThe
script:action_evalrequirement forverify_scriptis correctly captured and aligns with the new metric.
123-126: Correctly gating script validation by API flagRunning script path validation only when API is enabled matches the intended behavior and avoids unnecessary failures offline.
202-205: Good: skipping script metric checks when API is disabledThis prevents false positives in non-API runs.
src/lightspeed_evaluation/core/script/manager.py (1)
5-5: Bandit: subprocess usage is acceptable here (no shell), ensure sources are trustedYou correctly avoid shell=True. Inputs are validated earlier (existence, file type). Keep script paths restricted to validated config.
Please confirm that all script paths originate from validated YAML (and not any runtime/user input).
src/lightspeed_evaluation/core/metrics/script_eval.py (2)
35-38: Metric routing is consistentHandling "action_eval" under the script namespace matches the registry pattern.
47-51: Graceful no-op when API disabledSkipping execution when API is off aligns with the validator’s behavior.
src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)
22-27: Components container extension looks goodAdding script_manager keeps wiring centralized and clean.
80-101: Happy path + API error short-circuit logic reads wellShort-circuiting to ERROR on API amendment failure, with cleanup in finally, is solid.
164-187: Setup runner is robustGood logging and error propagation via string reason; integrates with manager exceptions cleanly.
188-215: Cleanup runner behavior is appropriateNon-blocking on failure and always-attempt semantics are good.
config/evaluation_data.yaml (1)
3-29: YAML structure aligns with model and validatorFields and nulls (verify_script/setup_script/cleanup_script) match the new schema.
b838b10 to
572d145
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
119-121: Fix potential AttributeError when turn_data is None (bad conditional expression precedence).
Current expression can evaluate request.turn_data.response even when turn_data is None. Apply:- query=request.turn_data.query if request.turn_data else "", - response=request.turn_data.response or "" if request.turn_data else "", + query=request.turn_data.query if request.turn_data else "", + response=(request.turn_data.response or "") if request.turn_data else "",And similarly in _create_error_result:
- query=request.turn_data.query if request.turn_data else "", - response=request.turn_data.response or "" if request.turn_data else "", + query=request.turn_data.query if request.turn_data else "", + response=(request.turn_data.response or "") if request.turn_data else "",Also applies to: 143-145
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)
128-136: Propagate original_data_path to DataValidator and validate scripts for programmatic inputs.
Currently _validate_scripts runs only in load_evaluation_data; pipeline path using validate_evaluation_data skips script validation, risking wrong CWD resolution and missing exec checks.Apply:
self.original_data_path = original_data_path logger.info("Starting evaluation") results: list[EvaluationResult] = [] # Step 1: Validate data logger.info("Validating data") + # Ensure validator can resolve relative script paths against the YAML file dir + self.data_validator.original_data_path = self.original_data_path if not self.validate_data(evaluation_data): raise ValueError("Data validation failed. Cannot proceed with evaluation.") + # Also validate/normalize scripts when API is enabled and data was provided programmatically + config = self.config_loader.system_config + if config and config.api.enabled: + # Internal method is acceptable here until a public hook is exposed + self.data_validator._validate_scripts(evaluation_data) # pylint: disable=protected-accessAlternatively, expose a public validate_scripts() on DataValidator and call it here.
🧹 Nitpick comments (6)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
73-79: Optional: guard split on malformed metric identifiers.
Avoid ValueError on missing ":".- framework, metric_name = request.metric_identifier.split(":", 1) + if ":" not in request.metric_identifier: + execution_time = time.time() - start_time + return self._create_error_result( + request, f"Invalid metric identifier: {request.metric_identifier}", execution_time + ) + framework, metric_name = request.metric_identifier.split(":", 1)src/lightspeed_evaluation/core/system/validator.py (3)
123-126: Scripts validated only in load_evaluation_data; also invoke during validate_evaluation_data.
To cover programmatic inputs, mirror this check in validate_evaluation_data when api_enabled.You can add at the end of validate_evaluation_data:
if self.validation_errors: ... validation_msg = "✅ All data validation passed" if self.api_enabled: validation_msg += " (API mode - data will be enhanced via API)" print(validation_msg) + # Normalize/validate script paths in API mode even when data is pre-loaded + if self.api_enabled: + self._validate_scripts(evaluation_data) return TrueBased on learnings
202-205: Use the method parameter api_enabled consistently.
Currently mixes self.api_enabled and api_enabled; prefer the parameter for clarity.- if metric.startswith("script:") and not self.api_enabled: + if metric.startswith("script:") and not api_enabled: continue
288-294: Be cautious about chmod in a validator.
Changing file modes may be undesirable in read-only or shared repos; consider warning instead of mutating, or make this opt-in.src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
84-87: Prefer fail-safe handling over raising when config is missingRaising here will bubble an exception and skip error classification. Either validate
system_configin__init__(fail-fast) or mark all metrics as ERROR here and return.Apply this diff to convert to ERROR results (Option B):
- if self.config is None: - raise ValueError("SystemConfig must be loaded") + if self.config is None: + logger.error("SystemConfig not loaded - marking all metrics as ERROR") + error_results = self.components.error_handler.mark_all_metrics_as_error( + conv_data, + "SystemConfig not loaded", + resolved_turn_metrics=resolved_turn_metrics, + resolved_conversation_metrics=resolved_conversation_metrics, + ) + return error_results
172-176: Don’t couple script execution to API.enabledGating setup/cleanup scripts on
api.enabledis surprising; script eval may be useful even without API calls. Suggest always running scripts when provided (or introduce a dedicatedscripts.enabledflag in config).Apply this diff to remove the API gate:
- # Skip script execution if API is disabled - if self.config and not self.config.api.enabled: - logger.debug("Skipping setup script (API disabled): %s", setup_script) - return None- # Skip script execution if API is disabled - if self.config and not self.config.api.enabled: - logger.debug("Skipping cleanup script (API disabled): %s", cleanup_script) - returnIf the intent was to avoid running placeholder scripts in sample data, consider a dedicated flag instead. To verify intent, please confirm expected behavior when
api.enabled=false.Also applies to: 199-203
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
README.md(2 hunks)config/evaluation_data.yaml(1 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/__init__.py(2 hunks)src/lightspeed_evaluation/core/metrics/script_eval.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(3 hunks)src/lightspeed_evaluation/core/system/validator.py(5 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- config/system.yaml
- README.md
- src/lightspeed_evaluation/init.py
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
config/evaluation_data.yaml
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T10:41:09.399Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/processor.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(21-156)src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
ProcessorComponents(19-26)
src/lightspeed_evaluation/core/metrics/script_eval.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationScope(227-238)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/metrics/script_eval.py (1)
ScriptEvalMetrics(13-60)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(135-182)src/lightspeed_evaluation/core/system/exceptions.py (1)
DataValidationError(16-17)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6)
src/lightspeed_evaluation/core/metrics/manager.py (4)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/core/models/data.py (4)
EvaluationData(135-182)EvaluationRequest(241-294)EvaluationResult(185-224)TurnData(35-132)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
mark_all_metrics_as_error(17-77)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
amend_conversation_data(20-69)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
evaluate_metric(58-129)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (21)
src/lightspeed_evaluation/core/models/data.py (3)
4-5: LGTM on imports and typing.
Path/Union imports align with new fields.
76-80: Turn-level verify_script field looks good.
Optional Path/str is appropriate; validator normalizes.
164-173: Conversation-level setup/cleanup scripts added correctly.
Matches downstream ScriptExecutionManager usage.src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
13-16: Correctly wires ScriptEvalMetrics and ScriptExecutionManager.
Keeps handler map cohesive.
28-31: All MetricsEvaluator instantiations updated
Found only one call in src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (line 65) using three arguments; no legacy two‐arg usages remain.src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
10-10: Import of ScriptExecutionManager is correct.
61-67: Good: centralized ScriptExecutionManager and passed into evaluator.
75-76: Passing script_manager through ProcessorComponents is appropriate.src/lightspeed_evaluation/core/system/validator.py (3)
3-6: LGTM on imports and API gating setup.
51-55: Requirement mapping for script:action_eval is correct.
Ensures verify_script presence only when applicable.
242-297: Resolve relative script paths against the YAML file directory, not CWD.
As-is, Path.resolve() will use process CWD; breaks when running from other dirs.Apply:
if isinstance(script_file, str): script_file = Path(script_file) - script_file = script_file.resolve() + # Resolve relative paths against the YAML file directory (fallback to CWD) + if not script_file.is_absolute(): + base_dir = ( + Path(self.original_data_path).resolve().parent + if self.original_data_path + else Path.cwd() + ) + script_file = (base_dir / script_file).resolve() + else: + script_file = script_file.resolve()Based on learnings
src/lightspeed_evaluation/core/metrics/script_eval.py (3)
13-22: Constructor and API gating look good.
Aligns with system_config.api.enabled wiring.
23-39: Scope checks and metric routing are appropriate.
Turn-only restriction and unknown-metric messaging are clear.
40-61: Execution path and error handling are solid.
return (1.0/0.0/None) semantics fit existing evaluator contract.src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)
67-80: Good fix: attempt cleanup even when setup failsCleanup is now attempted before returning on setup failure. This addresses prior feedback.
93-103: Cleanup still runs on API error pathReturning inside try is fine; finally ensures cleanup runs. LGTM.
Also applies to: 130-133
148-151: Evaluator wiring looks correctTurn-level evaluation now delegates through
components.metrics_evaluator. LGTM.
218-221: Summary delegation is consistentDelegating to
metric_manager.count_metrics_for_conversationis appropriate. LGTM.config/evaluation_data.yaml (3)
20-22: Metrics supported:ragas:faithfulness,ragas:response_relevancy, andragas:context_precision_without_referenceare registered in system config and handled by the RAGAS pipeline.
61-68: Add missing scripts or gate their execution
config/evaluation_data.yaml referencesverify_infrastructure.sh,setup_test_env.sh, andcleanup_test_env.sh, but none exist in the repo. Either include these scripts underscripts/or wrap their invocation behind a feature flag.⛔ Skipped due to learnings
Learnt from: asamal4 PR: lightspeed-core/lightspeed-evaluation#19 File: lsc_agent_eval/data/script/eval3/setup.sh:1-3 Timestamp: 2025-07-16T09:42:00.691Z Learning: Scripts in the lsc_agent_eval/data directory are meant to be simple examples/samples for teams to customize according to their needs, not production-ready code.
45-50: Confirmed deepeval conversation metrics registration and handler implementationBoth deepeval:conversation_completeness and deepeval:conversation_relevancy are defined under conversation_level in config/system.yaml, and DeepEvalMetrics is implemented and wired under the “deepeval” key in src/lightspeed_evaluation/pipeline/evaluation/evaluator.py.
572d145 to
fd97031
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/system/validator.py (1)
271-279: Handle user home shortcuts in script paths
Path("~…")stays literal right now, so a config that points to~/scripts/setup.shends up looking for<yaml_dir>/~/scripts/setup.sh, which is almost certainly wrong. Expanding the home shortcut before the absolute/relative logic preserves existing behavior but also makes those configs work.if isinstance(script_file, str): script_file = Path(script_file) + # Normalize home directory shortcuts before resolving + script_file = script_file.expanduser() + # Resolve relative paths against the YAML file directory, not CWD if not script_file.is_absolute() and self.original_data_path: yaml_dir = Path(self.original_data_path).parent script_file = (yaml_dir / script_file).resolve() else:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
README.md(7 hunks)config/evaluation_data.yaml(1 hunks)config/sample_scripts/cleanup.sh(1 hunks)config/sample_scripts/setup.sh(1 hunks)config/sample_scripts/verify.sh(1 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/__init__.py(2 hunks)src/lightspeed_evaluation/core/metrics/script_eval.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(3 hunks)src/lightspeed_evaluation/core/script/__init__.py(1 hunks)src/lightspeed_evaluation/core/script/manager.py(1 hunks)src/lightspeed_evaluation/core/system/exceptions.py(1 hunks)src/lightspeed_evaluation/core/system/validator.py(5 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
- src/lightspeed_evaluation/core/script/init.py
- src/lightspeed_evaluation/init.py
- src/lightspeed_evaluation/core/metrics/script_eval.py
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T10:41:09.399Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
README.md
📚 Learning: 2025-09-09T08:08:40.654Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.
Applied to files:
README.md
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
config/system.yaml
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
config/evaluation_data.yaml
🧬 Code graph analysis (4)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(135-182)src/lightspeed_evaluation/core/system/exceptions.py (1)
DataValidationError(16-17)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (4)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/core/models/data.py (3)
EvaluationData(135-182)EvaluationResult(185-224)TurnData(35-132)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
mark_all_metrics_as_error(17-77)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
amend_conversation_data(20-69)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-101)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(21-156)src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
ProcessorComponents(19-26)
src/lightspeed_evaluation/core/script/manager.py (1)
src/lightspeed_evaluation/core/system/exceptions.py (1)
ScriptExecutionError(28-41)
🪛 markdownlint-cli2 (0.18.1)
README.md
334-334: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (1)
config/evaluation_data.yaml (1)
95-105: Script sample aligns with the new handler.
The example wires setup, verify, and cleanup scripts together, which should help exercise the ScriptExecutionManager path end-to-end.
fd97031 to
2fbf01e
Compare
2fbf01e to
b212474
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
README.md (1)
334-341: Promote “Script Path Resolution” to a proper heading.Markdownlint flags the bold text used as a faux heading; switching to an H4 keeps the hierarchy consistent and quiets the lint warning.
-**Script Path Resolution** +#### Script Path Resolution
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
README.md(7 hunks)config/evaluation_data.yaml(1 hunks)config/sample_scripts/cleanup.sh(1 hunks)config/sample_scripts/setup.sh(1 hunks)config/sample_scripts/verify.sh(1 hunks)src/lightspeed_evaluation/core/script/manager.py(1 hunks)src/lightspeed_evaluation/core/system/validator.py(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- config/sample_scripts/cleanup.sh
- config/sample_scripts/verify.sh
- config/evaluation_data.yaml
- config/sample_scripts/setup.sh
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
src/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/core/script/manager.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
README.md
📚 Learning: 2025-09-09T08:08:40.654Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.
Applied to files:
README.md
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(135-182)src/lightspeed_evaluation/core/system/exceptions.py (1)
DataValidationError(16-17)
src/lightspeed_evaluation/core/script/manager.py (1)
src/lightspeed_evaluation/core/system/exceptions.py (1)
ScriptExecutionError(28-41)
🪛 markdownlint-cli2 (0.18.1)
README.md
334-334: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/script/manager.py
[notice] 62-63:
subprocess call - check for execution of untrusted input.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (1)
src/lightspeed_evaluation/core/script/manager.py (1)
62-69: Working directory fix keeps relative script assets working.Setting
cwd=script_path.parentcloses the gap where scripts sourcing sibling helpers would fail when invoked from another root. Thanks for wiring this in.Based on learnings
b212474 to
2e2a764
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
README.md(7 hunks)config/evaluation_data.yaml(1 hunks)config/sample_scripts/cleanup.sh(1 hunks)config/sample_scripts/setup.sh(1 hunks)config/sample_scripts/verify.sh(1 hunks)src/lightspeed_evaluation/core/metrics/script_eval.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(3 hunks)src/lightspeed_evaluation/core/script/manager.py(1 hunks)src/lightspeed_evaluation/core/system/validator.py(5 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
- src/lightspeed_evaluation/core/metrics/script_eval.py
- config/sample_scripts/cleanup.sh
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
src/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/core/script/manager.py
📚 Learning: 2025-09-28T01:11:08.828Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#62
File: config/sample_scripts/setup.sh:7-8
Timestamp: 2025-09-28T01:11:08.828Z
Learning: Sample scripts in config/sample_scripts/ are templates that individual teams will customize for their specific environments and requirements, not production-ready scripts.
Applied to files:
config/sample_scripts/setup.sh
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
config/evaluation_data.yaml
📚 Learning: 2025-07-16T10:41:09.399Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
README.md
📚 Learning: 2025-09-09T08:08:40.654Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/core/api/__init__.py:3-3
Timestamp: 2025-09-09T08:08:40.654Z
Learning: API_KEY and OPENAI_API_KEY are distinct environment variables with different purposes in the lightspeed-evaluation codebase - they should not be treated as interchangeable fallbacks.
Applied to files:
README.md
🧬 Code graph analysis (3)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(135-182)src/lightspeed_evaluation/core/system/exceptions.py (1)
DataValidationError(16-17)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6)
src/lightspeed_evaluation/core/metrics/manager.py (4)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/core/models/data.py (4)
EvaluationData(135-182)EvaluationRequest(241-294)EvaluationResult(185-224)TurnData(35-132)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-109)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
mark_all_metrics_as_error(17-77)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
amend_conversation_data(20-69)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
evaluate_metric(58-129)
src/lightspeed_evaluation/core/script/manager.py (1)
src/lightspeed_evaluation/core/system/exceptions.py (1)
ScriptExecutionError(28-41)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/core/script/manager.py
[notice] 69-70:
subprocess call - check for execution of untrusted input.
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/script/manager.py
[error] 28-28: Pylint: Too many branches (13/12).
🪛 markdownlint-cli2 (0.18.1)
README.md
334-334: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (4)
config/sample_scripts/setup.sh (1)
1-10: Good baseline for the sample setup scriptThe strict shell options and clear logging make this a solid template. No changes needed.
config/sample_scripts/verify.sh (1)
1-13: Sample verify script looks solidThe success/failure messaging tied to exit codes is exactly what downstream tooling needs. All good here.
src/lightspeed_evaluation/core/models/data.py (1)
76-172: Model field additions are correctThe new
setup_script,cleanup_script, andverify_scriptfields integrate cleanly with existing validation patterns and give downstream components the data they need. Looks good.README.md (1)
330-332: Quote thescript:action_evalexample to avoid YAML parse errorsLine 331 currently renders the list item as a mapping (
script: action_eval) instead of the intended string, and several YAML parsers will either reject it or silently reshape the data. Please quote the string in the example so readers can copy/paste without breakage.turn_metrics: - - script:action_eval + - "script:action_eval"
2e2a764 to
926d703
Compare
aad0c7f to
675af1c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/lightspeed_evaluation/core/metrics/script_eval.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lightspeed_evaluation/core/metrics/script_eval.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
🧬 Code graph analysis (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
src/lightspeed_evaluation/core/metrics/script_eval.py (1)
ScriptEvalMetrics(13-52)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(241-294)EvaluationScope(227-238)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-117)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: mypy
- GitHub Check: tests (3.12)
|
@tisnik @VladimirKadlec PTAL |
VladimirKadlec
left a comment
There was a problem hiding this comment.
LGTM
Do we have Jira story for this?
|
Summary by CodeRabbit
New Features
Refactor
Documentation