fix rule for black & pydocstyle#45
Conversation
WalkthroughDocstring and string-literal formatting updates across multiple modules, addition of Black and pydocstyle configuration in pyproject.toml, and a README code fence closure. No behavioral, API, or control-flow changes. Changes
Sequence Diagram(s)Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (10)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (6)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (2)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
src/lightspeed_evaluation/core/metrics/ragas.py (1)
27-36: Fix D212: put summary on the first line of multi-line docstring.Current summary starts on Line 28. Move it to Line 27.
Apply:
- """ - Initialize with LLM Manager. + """Initialize with LLM Manager. Args: llm_manager: Pre-configured LLMManager with validated parameters """src/lightspeed_evaluation/core/metrics/custom.py (2)
29-34: Fix D212 in init docstring.Summary must start on the same line as opening quotes.
Apply:
- """ - Initialize with LLM Manager. + """Initialize with LLM Manager. Args: llm_manager: Pre-configured LLMManager with validated parameters """
86-94: Fix D212 in _parse_score_response docstring.Keep raw string, move summary to first line.
Apply:
- r""" - Parse LLM response to extract score and reason. + r"""Parse LLM response to extract score and reason. Expected formats: - "Score: 0.85\nReason: The answer is accurate..." - "8.5/10 - The response is comprehensive..." - "Rating: 4.2 out of 5" """src/lightspeed_evaluation/core/llm/manager.py (1)
14-21: Fix D212: move docstring summary to the first line
In src/lightspeed_evaluation/core/llm/manager.py (class LLMManager, lines 14–21), update the docstring to start the summary on the first line:class LLMManager: - """ - Generic LLM Manager for all use cases (Ragas, DeepEval, Custom metrics). - - Responsibilities: - - Environment validation for multiple providers - - Model name construction - - Provides LiteLLM parameters for consumption by framework-specific managers - """ + """Generic LLM Manager for all use cases (Ragas, DeepEval, Custom metrics). + + Responsibilities: + - Environment validation for multiple providers + - Model name construction + - Provides LiteLLM parameters for consumption by framework-specific managers + """This will resolve the pydocstyle D212 warning.
src/lightspeed_evaluation/core/llm/ragas.py (1)
79-84: Fix pydocstyle D212: start summary line on the first line of the docstring.Matches repo’s pydocstyle config.
Apply this diff:
class RagasLLMManager: - """ - Ragas LLM Manager - Takes LLM parameters directly. - - This manager focuses solely on Ragas-specific LLM integration. - """ + """Ragas LLM Manager - Takes LLM parameters directly. + + This manager focuses solely on Ragas-specific LLM integration. + """src/lightspeed_evaluation/core/config/loader.py (1)
50-58: Harden log-level parsing and avoid
- If
source_level/package_levelis invalid,getattr(logging, ...)raisesAttributeErrorbeforebasicConfig, causing an early crash. Parse levels defensively and fall back to sane defaults.- Use
logging.warning(root logger) instead ofdef setup_logging(logging_config: Dict[str, Any]) -> logging.Logger: """Configure logging for application and packages.""" # Get logging settings with new structure - source_level = getattr(logging, logging_config.get("source_level", "INFO").upper()) - package_level = getattr(logging, logging_config.get("package_level", "WARNING").upper()) + def _resolve_level(name: str, default: int) -> int: + try: + return getattr(logging, name.upper()) + except AttributeError: + # Will be emitted after basicConfig below + return default + + source_level = _resolve_level(logging_config.get("source_level", "INFO"), logging.INFO) + package_level = _resolve_level(logging_config.get("package_level", "WARNING"), logging.WARNING) @@ - except AttributeError: - print(f"Warning: Invalid log level '{override_level}' for package '{package_name}'") + except AttributeError: + logging.warning( + "Invalid log level '%s' for package '%s'", override_level, package_name + )Additionally, after
basicConfig, consider warning if the top-levelsource_level/package_levelnames were invalid and defaults were used. I can add that if you’d like.Also applies to: 83-94
🧹 Nitpick comments (6)
src/lightspeed_evaluation/core/metrics/custom.py (1)
59-84: Minor: make error propagation more explicit if desired.Optional: preserve original traceback while shortening message is already handled via
from e. No change required.src/lightspeed_evaluation/core/llm/manager.py (1)
27-27: Prefer logging over print for library code.Use a module logger for consistency (you already do this in visualization.py).
Apply this diff:
- print(f"✅ LLM Manager: {self.config.provider}/{self.config.model} -> {self.model_name}") + import logging + logging.getLogger("lightspeed_evaluation.LLMManager").info( + "✅ LLM Manager: %s/%s -> %s", + self.config.provider, + self.config.model, + self.model_name, + )src/lightspeed_evaluation/core/llm/ragas.py (2)
20-27: Unify “unset temperature” handling with None instead of a sentinel.Using 1e-08 as a sentinel blocks users from intentionally setting that exact value. Prefer Optional[float]=None and fall back to litellm_params when None.
Apply this diff:
- def generate_text( # pylint: disable=too-many-arguments,too-many-positional-arguments + def generate_text( # pylint: disable=too-many-arguments,too-many-positional-arguments self, prompt: Any, n: int = 1, - temperature: float = 1e-08, + temperature: Optional[float] = None, stop: Optional[List[str]] = None, callbacks: Optional[Any] = None, ) -> LLMResult: @@ - # Use temperature from params unless explicitly overridden - temp = temperature if temperature != 1e-08 else self.litellm_params.get("temperature", 0.0) + # Use temperature from params unless explicitly overridden + temp = self.litellm_params.get("temperature", 0.0) if temperature is None else temperature @@ - async def agenerate_text( # pylint: disable=too-many-arguments,too-many-positional-arguments + async def agenerate_text( # pylint: disable=too-many-arguments,too-many-positional-arguments self, prompt: Any, n: int = 1, - temperature: Optional[float] = None, + temperature: Optional[float] = None, stop: Optional[List[str]] = None, callbacks: Optional[Any] = None, ) -> LLMResult: """Async generate.""" - temp = temperature if temperature is not None else 1e-08 - return self.generate_text(prompt, n=n, temperature=temp, stop=stop, callbacks=callbacks) + return self.generate_text(prompt, n=n, temperature=temperature, stop=stop, callbacks=callbacks)Also applies to: 31-33, 61-71, 71-72
18-19: Use logging instead of print for operational messages and errors.Keeps output controllable and consistent with GraphGenerator.
Apply this diff:
- print(f"✅ Ragas Custom LLM: {self.model_name}") + import logging + self._logger = logging.getLogger("lightspeed_evaluation.RagasCustomLLM") + self._logger.info("✅ Ragas Custom LLM: %s", self.model_name) @@ - print(f"❌ Ragas LLM failed: {e}") + self._logger.exception("❌ Ragas LLM failed: %s", e)Also applies to: 57-59
src/lightspeed_evaluation/core/output/visualization.py (1)
151-153: Swap axis labels for the pass-rate bar chart.Bars plot pass_rates on Y vs metrics on X; labels are currently reversed.
Apply this diff:
- ax.set_xlabel("Pass Rate (%)", fontsize=12) - ax.set_ylabel("Metrics", fontsize=12) + ax.set_xlabel("Metrics", fontsize=12) + ax.set_ylabel("Pass Rate (%)", fontsize=12)src/lightspeed_evaluation/drivers/evaluation.py (1)
222-234: Optional: unwrap nested ternary for readability.The emoji/status selection is compact but a bit dense. Consider a small refactor for future maintainability.
- status_emoji = ( - "✅" if result_status == "PASS" else "❌" if result_status == "FAIL" else "⚠️" - ) + if result_status == "PASS": + status_emoji = "✅" + elif result_status == "FAIL": + status_emoji = "❌" + else: + status_emoji = "⚠️"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
pyproject.toml(1 hunks)src/lightspeed_evaluation/__init__.py(1 hunks)src/lightspeed_evaluation/core/config/loader.py(6 hunks)src/lightspeed_evaluation/core/config/models.py(4 hunks)src/lightspeed_evaluation/core/llm/manager.py(3 hunks)src/lightspeed_evaluation/core/llm/ragas.py(2 hunks)src/lightspeed_evaluation/core/metrics/custom.py(2 hunks)src/lightspeed_evaluation/core/metrics/ragas.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(5 hunks)src/lightspeed_evaluation/core/output/statistics.py(1 hunks)src/lightspeed_evaluation/core/output/visualization.py(4 hunks)src/lightspeed_evaluation/drivers/evaluation.py(10 hunks)src/lightspeed_evaluation/runner/evaluation.py(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
src/lightspeed_evaluation/__init__.py
🧬 Code graph analysis (7)
src/lightspeed_evaluation/runner/evaluation.py (1)
src/lightspeed_evaluation/core/output/generator.py (1)
generate_reports(32-76)
src/lightspeed_evaluation/core/output/visualization.py (2)
src/lightspeed_evaluation/core/config/models.py (1)
EvaluationResult(126-162)src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_basic_stats(19-45)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/config/models.py (1)
EvaluationResult(126-162)
src/lightspeed_evaluation/core/metrics/ragas.py (3)
src/lightspeed_evaluation/core/config/models.py (1)
TurnData(8-44)src/lightspeed_evaluation/drivers/evaluation.py (1)
_evaluate_metric(203-267)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
_evaluate_metric(51-60)
src/lightspeed_evaluation/core/config/loader.py (1)
src/lightspeed_evaluation/core/config/models.py (1)
validate_metrics(82-87)
src/lightspeed_evaluation/core/output/generator.py (1)
src/lightspeed_evaluation/core/config/models.py (1)
EvaluationResult(126-162)
src/lightspeed_evaluation/drivers/evaluation.py (2)
src/lightspeed_evaluation/runner/evaluation.py (1)
run_evaluation(15-86)src/lightspeed_evaluation/core/config/models.py (3)
EvaluationData(47-123)EvaluationResult(126-162)TurnData(8-44)
🪛 GitHub Actions: Pydocstyle
src/lightspeed_evaluation/runner/evaluation.py
[warning] 18-18: pydocstyle D212: Multi-line docstring summary should start at the first line.
src/lightspeed_evaluation/core/llm/manager.py
[warning] 14-14: pydocstyle D212: Multi-line docstring summary should start at the first line.
src/lightspeed_evaluation/core/metrics/ragas.py
[warning] 27-27: pydocstyle D212: Multi-line docstring summary should start at the first line.
src/lightspeed_evaluation/core/metrics/custom.py
[warning] 29-29: pydocstyle D212: Multi-line docstring summary should start at the first line.
[warning] 86-86: pydocstyle D212: Multi-line docstring summary should start at the first line.
src/lightspeed_evaluation/core/llm/ragas.py
[warning] 79-79: pydocstyle D212: Multi-line docstring summary should start at the first line.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (33)
pyproject.toml (1)
51-53: Black config LGTM.Line length 100 is reasonable for this codebase.
src/lightspeed_evaluation/core/metrics/ragas.py (2)
51-62: Helper refactor reads well.Single-line signature and list-comprehension are clear and unchanged semantically.
151-151: Formatting-only return change is fine.No behavioral change; good.
src/lightspeed_evaluation/core/metrics/custom.py (2)
20-22: Field reflow LGTM.Type hints and defaults unchanged.
38-39: Supported metrics dict reflow LGTM.No semantic change.
src/lightspeed_evaluation/__init__.py (1)
1-13: Nice: docstring now complies with D212.Matches Google style and the new pydocstyle config.
src/lightspeed_evaluation/core/output/statistics.py (1)
71-71: Signature reflow LGTM.Purely cosmetic; maintains type clarity.
src/lightspeed_evaluation/core/llm/manager.py (2)
53-53: LGTM — concise raise formatting.
65-65: LGTM — concise raise formatting.src/lightspeed_evaluation/core/llm/ragas.py (2)
32-33: LGTM — simplified temperature selection is equivalent.
71-72: LGTM — straightforward delegation to sync path.src/lightspeed_evaluation/core/output/generator.py (6)
54-55: LGTM — inlined JSON summary call; no behavioral change.
58-60: LGTM — inlined text summary call; no behavioral change.
78-83: LGTM — signature reflow of _generate_csv_report only.
177-181: LGTM — concise f-strings for overall stats.
189-193: LGTM — concise f-strings for per-metric stats.
215-218: LGTM — concise f-strings for per-conversation stats.src/lightspeed_evaluation/core/output/visualization.py (5)
42-45: LGTM — helper signature condensed; behavior unchanged.
46-55: LGTM — grouping helper condensed; behavior unchanged.
87-99: LGTM — graph generation calls inlined; behavior unchanged.
130-130: LGTM — compact status breakdown label assembly.
423-426: LGTM — condensed detailed-summary helper; behavior unchanged.src/lightspeed_evaluation/core/config/models.py (4)
59-60: LGTM — single-line Field declaration; no semantic change.
86-87: LGTM — clearer error message formatting; no semantic change.
117-122: LGTM — formatting only; logic unchanged.
168-174: LGTM — provider Field collapsed; no semantic change.src/lightspeed_evaluation/core/config/loader.py (1)
176-179: LGTM on config metadata wiring.Flattened defaults for
default_turn_metrics_metadataanddefault_conversation_metrics_metadatalook correct and align with how_get_effective_thresholdconsumes them.Also applies to: 236-238
src/lightspeed_evaluation/runner/evaluation.py (2)
41-43: LGTM: concise status prints.Single-line system/eval status messages improve readability without behavior changes.
73-75: LGTM: compact error summary print.Inline f-string read is clear and consistent with other output.
src/lightspeed_evaluation/drivers/evaluation.py (4)
1-4: Module docstring is compliant with D212/D205.Summary on first line with a blank line before details. No action needed.
86-91: LGTM: handler routing map formatting.Single-line annotation for
handlersimproves clarity; keys align with frameworks used elsewhere.
116-123: Docstring tightening looks good.Concise bullets and clear responsibilities. No semantic changes.
288-296: LGTM: threshold resolution falls back correctly.Conversation vs. turn default metadata lookup matches the config wiring in
loader.SystemConfig.
4c450a9 to
abacc4a
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
51-76: Harden types and parsing to avoid .strip() on non-stringsBroaden input types and validate/normalize values to prevent AttributeError if the API returns non-string values.
Apply:
- def query_agent(self, api_input: dict[str, str], timeout: int = 300) -> dict[str, Any]: + def query_agent(self, api_input: dict[str, Any], timeout: float = 300) -> dict[str, Any]: @@ - response_data = response.json() - if "response" not in response_data: - raise AgentAPIError("Agent response missing 'response' field") - - agent_response = response_data["response"].strip() - conversation_id = response_data.get("conversation_id", "").strip() - tool_calls = response_data.get("tool_calls", []) + response_data = response.json() + if "response" not in response_data: + raise AgentAPIError("Agent response missing 'response' field") + + resp = response_data["response"] + if not isinstance(resp, str): + raise AgentAPIError("Agent response 'response' must be a string") + agent_response = resp.strip() + + conv = response_data.get("conversation_id") + conversation_id = str(conv).strip() if conv is not None else "" + + tool_calls = response_data.get("tool_calls") or []
28-30: Do not disable TLS verification by defaultverify=False is unsafe. Gate with an env flag and default to verifying TLS.
- # enable verify, currently for eval it is set to False - self.client = httpx.Client(base_url=self.endpoint, verify=False) + # TLS verify is configurable via AGENT_API_VERIFY; defaults to enabled + verify_env = os.getenv("AGENT_API_VERIFY", "true").lower() + verify = verify_env in {"1", "true", "yes"} + self.client = httpx.Client(base_url=self.endpoint, verify=verify)lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (3)
51-54: Preempt W1404 by removing implicit concatenation in Azure errorSame adjacency pattern here; collapse to a single literal to avoid future pylint failures.
- raise JudgeModelError( - "AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT " - "environment variables are required for Azure provider" - ) + raise JudgeModelError( + "AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT environment variables are required for Azure provider" + )
67-71: Preempt W1404 in Watsonx error messageThree adjacent literals; join them.
- raise JudgeModelError( - "WATSONX_API_KEY, WATSONX_API_BASE, and " - "WATSONX_PROJECT_ID environment variables are " - "required for Watsonx provider" - ) + raise JudgeModelError( + "WATSONX_API_KEY, WATSONX_API_BASE, and WATSONX_PROJECT_ID environment variables are required for Watsonx provider" + )
115-118: Avoid adjacent f-strings inside raise()This can also be flagged by W1404 depending on config. Merge into a single f-string.
- raise JudgeModelError( - f"Judge model evaluation failed after " - f"{MAX_RETRY_ATTEMPTS} attempts: {e}" - ) from e + raise JudgeModelError( + f"Judge model evaluation failed after {MAX_RETRY_ATTEMPTS} attempts: {e}" + ) from e
🧹 Nitpick comments (6)
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (1)
180-186: Concise fixture wiring — consider a tiny assertion.Nice consolidation. Optionally assert
get_eval_count()is invoked once to complete the contract being exercised.Apply inline after Line [201]:
+ mock_config_manager.return_value.get_eval_count.assert_called_once()lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (3)
90-105: Type the helper to accept Optional and precise element shapeReflect possible None and element mapping; behavior remains unchanged.
- def _format_query_endpoint_tool_calls(self, tool_calls: list) -> list[list[dict[str, Any]]]: + def _format_query_endpoint_tool_calls( + self, tool_calls: Optional[list[dict[str, Any]]] + ) -> list[list[dict[str, Any]]]:
121-139: Make error-body decoding resilient to non‑UTF‑8Avoid UnicodeDecodeError on binary/invalid payloads.
- error_content = response.read().decode("utf-8") + error_content = response.read().decode("utf-8", errors="replace")
106-109: Widen streaming input typeStreaming requests often include nested structures; align with non-streaming suggestion.
- def streaming_query_agent( - self, api_input: dict[str, str], timeout: int = 300 - ) -> dict[str, Any]: + def streaming_query_agent( + self, api_input: dict[str, Any], timeout: float = 300 + ) -> dict[str, Any]:lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (2)
110-111: Minor message polish: avoid redundant str() and improve readability
str(response)is redundant inside an f-string; usingresponsecalls__str__anyway. Optional style tweak.- raise JudgeModelError( - f"No valid response from Judge Model. Check full response\n{str(response)}" - ) + raise JudgeModelError( + f"No valid response from Judge Model. Check full response:\n{response}" + )
82-123: Optional: broaden retry to include transient API errors and add backoffCurrently only
TimeoutErroris retried. Consider also handling transient provider errors (e.g., rate limits, 5xx) and using exponential backoff with jitter to reduce thundering herds. This can materially improve robustness without changing behavior on success.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (38)
archive/lightspeed_core_evaluation/query_rag.py(2 hunks)archive/lightspeed_core_evaluation/rag_eval.py(5 hunks)archive/lightspeed_core_evaluation/response_evaluation.py(9 hunks)archive/lightspeed_core_evaluation/taxonomy_eval.py(3 hunks)archive/lightspeed_core_evaluation/utils/plot.py(1 hunks)archive/lightspeed_core_evaluation/utils/relevancy_score.py(1 hunks)archive/lightspeed_core_evaluation/utils/response.py(2 hunks)archive/lightspeed_core_evaluation/utils/score.py(2 hunks)lsc_agent_eval/src/lsc_agent_eval/agent_eval.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py(2 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py(1 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py(10 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py(8 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py(23 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_models.py(2 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_results.py(3 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py(5 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py(2 hunks)lsc_agent_eval/tests/core/utils/test_api_client.py(7 hunks)lsc_agent_eval/tests/core/utils/test_judge.py(2 hunks)lsc_agent_eval/tests/core/utils/test_streaming_parser.py(1 hunks)lsc_agent_eval/tests/test_agent_eval.py(2 hunks)src/generate_answers/ls_response.py(2 hunks)src/lightspeed_evaluation/core/llm/deepeval.py(1 hunks)src/lightspeed_evaluation/core/llm/manager.py(4 hunks)src/lightspeed_evaluation/core/llm/ragas.py(2 hunks)src/lightspeed_evaluation/core/metrics/custom.py(2 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(1 hunks)src/lightspeed_evaluation/core/metrics/ragas.py(3 hunks)src/lightspeed_evaluation/runner/evaluation.py(4 hunks)
✅ Files skipped from review due to trivial changes (29)
- lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
- lsc_agent_eval/tests/core/utils/test_streaming_parser.py
- archive/lightspeed_core_evaluation/query_rag.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py
- archive/lightspeed_core_evaluation/utils/relevancy_score.py
- archive/lightspeed_core_evaluation/rag_eval.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py
- lsc_agent_eval/tests/core/utils/test_judge.py
- lsc_agent_eval/src/lsc_agent_eval/agent_eval.py
- archive/lightspeed_core_evaluation/utils/score.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py
- src/lightspeed_evaluation/core/metrics/deepeval.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
- lsc_agent_eval/tests/core/utils/test_api_client.py
- src/generate_answers/ls_response.py
- archive/lightspeed_core_evaluation/utils/response.py
- src/lightspeed_evaluation/core/llm/deepeval.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
- archive/lightspeed_core_evaluation/utils/plot.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
- archive/lightspeed_core_evaluation/response_evaluation.py
- lsc_agent_eval/tests/test_agent_eval.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py
- archive/lightspeed_core_evaluation/taxonomy_eval.py
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/core/llm/manager.py
- src/lightspeed_evaluation/core/metrics/ragas.py
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/llm/ragas.py
- src/lightspeed_evaluation/core/metrics/custom.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py
🧬 Code graph analysis (4)
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
JudgeModelError(20-21)
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py (1)
get_conversations(153-155)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/results.py (1)
save_results(25-46)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
AgentAPIError(12-13)
lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
ScriptExecutionError(16-17)
🪛 GitHub Actions: Python linter
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py
[warning] 39-39: Step 'make pylint' failed. Pylint W1404: Implicit string concatenation found in call (implicit-str-concat).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (20)
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (10)
80-107: Patch ordering and argument mapping are correct.Bottom-up patch application maps to parameters as expected; constructor dependencies are verified cleanly.
108-133: Init without judge manager covered well.Decorator order matches args; assertion that
evaluator.judge_manager is Noneis appropriate.
134-159: Kubeconfig propagation validated.
ScriptRunneris called with the provided kubeconfig and wiring toEvaluationRunneris checked.
160-177: End-to-end success path wiring looks good.Including
ResultsManagerin the patches and ordering it as the first arg (bottom-most decorator) is correct.
203-204: Result persistence assertion is precise.Verifies correct directory argument; no change needed.
211-216: Inline patch context reads clearer.Shortened
withblock maintains correctness and improves readability.
227-232: No-results path is explicit and accurate.Exception type and message are asserted tightly.
241-248: Cleanup with client: target and assertion are right.Patching at the module-under-test boundary and verifying
close()is ideal.
262-269: Exception cleanup path exercised well.Mocks set up to raise during
close()and handled in_cleanup()as expected.
277-285: Logger patch verifies message and arg binding.Capturing format string and exception text is robust; good guard against regressions.
lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py (5)
21-21: Signature compaction only; behavior unchanged.Keeps decorator order and assertions intact.
94-94: Failure-path test signature compacted correctly.Test intent and checks preserved.
119-119: Side-effect in one line — good.
SubprocessErrorpath remains clearly exercised.
138-139: Raised-message match streamlined.Assertion remains specific and stable.
154-155: Chmod error path assertion compact and correct.Test continues to validate unexpected-error wrapping.
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
64-67: Good: explicit 'response' presence checkThe added guard for the required "response" field tightens validation.
148-151: Clearer timeout messageGood consistency with the non-streaming path.
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (3)
31-31: LGTM on log reflowSingle-line debug log is fine and keeps context intact.
120-121: LGTM on single-line warning logCompact warning message reads well; keeps attempt count and exception context.
38-40: Fix implicit string concatenation (pylint W1404) — pipeline blocker
Adjacent string literals in the exception message trigger W1404. Merge into a single literal:- raise JudgeModelError( - "OPENAI_API_KEY environment variable is required " "for OpenAI provider" - ) + raise JudgeModelError("OPENAI_API_KEY environment variable is required for OpenAI provider")Run a direct lint check to confirm W1404 is resolved:
pylint --disable=all --enable=W1404 lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (4)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (4)
259-259: Sameconversation_idfallback nuance as above.Default won’t apply if value is
None; consider...get("conversation_id") or "test-conversation-id".
289-289: Sameconversation_idfallback nuance as above.
349-349: Sameconversation_idfallback nuance as above.
387-387: Sameconversation_idfallback nuance as above.
🧹 Nitpick comments (2)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (2)
32-32: Useorto fall back whenconversation_idis present butNone.
dict.get(key, default)won’t use the default if the key exists with valueNone. Ifrun_evaluationpassesconversation_id=None, this will propagateNoneinstead of a generated ID.Apply:
- "conversation_id": api_input.get("conversation_id", "generated-conversation-id"), + "conversation_id": api_input.get("conversation_id") or "generated-conversation-id",
679-679: Add a negative test for non-streamingquerymissing"response".The PR notes a validation change in
query_agent; consider a test wherequery_agentraises on missing"response"to lock behavior.Example:
+ def test_query_mode_missing_response_errors(self, mock_agent_client, mock_script_runner, sample_config_judge_llm): + # Simulate query endpoint returning payload without "response" + mock_agent_client.query_agent.side_effect = AgentAPIError("Missing 'response' field") + runner = EvaluationRunner(mock_agent_client, mock_script_runner) + results = runner.run_evaluation( + sample_config_judge_llm, "watsonx", "ibm/granite-3-3-8b-instruct", "conv-id-123", endpoint_type="query" + ) + assert len(results) == 1 and results[0].result == "ERROR" + assert "Missing 'response' field" in results[0].error
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (38)
archive/lightspeed_core_evaluation/query_rag.py(2 hunks)archive/lightspeed_core_evaluation/rag_eval.py(5 hunks)archive/lightspeed_core_evaluation/response_evaluation.py(9 hunks)archive/lightspeed_core_evaluation/taxonomy_eval.py(3 hunks)archive/lightspeed_core_evaluation/utils/plot.py(1 hunks)archive/lightspeed_core_evaluation/utils/relevancy_score.py(1 hunks)archive/lightspeed_core_evaluation/utils/response.py(2 hunks)archive/lightspeed_core_evaluation/utils/score.py(2 hunks)lsc_agent_eval/src/lsc_agent_eval/agent_eval.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py(2 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py(1 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py(10 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py(8 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py(23 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_models.py(2 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_results.py(3 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py(5 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py(2 hunks)lsc_agent_eval/tests/core/utils/test_api_client.py(7 hunks)lsc_agent_eval/tests/core/utils/test_judge.py(2 hunks)lsc_agent_eval/tests/core/utils/test_streaming_parser.py(1 hunks)lsc_agent_eval/tests/test_agent_eval.py(2 hunks)src/generate_answers/ls_response.py(2 hunks)src/lightspeed_evaluation/core/llm/deepeval.py(1 hunks)src/lightspeed_evaluation/core/llm/manager.py(4 hunks)src/lightspeed_evaluation/core/llm/ragas.py(2 hunks)src/lightspeed_evaluation/core/metrics/custom.py(2 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(1 hunks)src/lightspeed_evaluation/core/metrics/ragas.py(3 hunks)src/lightspeed_evaluation/runner/evaluation.py(4 hunks)
✅ Files skipped from review due to trivial changes (12)
- lsc_agent_eval/tests/core/utils/test_streaming_parser.py
- lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
- archive/lightspeed_core_evaluation/utils/response.py
- archive/lightspeed_core_evaluation/query_rag.py
- archive/lightspeed_core_evaluation/utils/score.py
- src/generate_answers/ls_response.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py
- archive/lightspeed_core_evaluation/taxonomy_eval.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py
- archive/lightspeed_core_evaluation/utils/relevancy_score.py
- src/lightspeed_evaluation/core/metrics/deepeval.py
🚧 Files skipped from review as they are similar to previous changes (25)
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py
- archive/lightspeed_core_evaluation/utils/plot.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
- lsc_agent_eval/tests/core/utils/test_judge.py
- lsc_agent_eval/src/lsc_agent_eval/agent_eval.py
- src/lightspeed_evaluation/core/metrics/custom.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
- archive/lightspeed_core_evaluation/response_evaluation.py
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/llm/deepeval.py
- lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
- lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py
- archive/lightspeed_core_evaluation/rag_eval.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
- src/lightspeed_evaluation/core/metrics/ragas.py
- src/lightspeed_evaluation/core/llm/ragas.py
- lsc_agent_eval/tests/core/utils/test_api_client.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py
- lsc_agent_eval/tests/test_agent_eval.py
- src/lightspeed_evaluation/core/llm/manager.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
🧬 Code graph analysis (1)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (4)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)
EvaluationRunner(20-201)run_evaluation(34-106)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1)
run_script(28-72)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (2)
ScriptExecutionError(16-17)AgentAPIError(12-13)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)
streaming_query_agent(106-155)
🔇 Additional comments (21)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (21)
126-126: LGTM: formatting-only change.No functional differences.
172-172: LGTM: formatting-only change.No functional differences.
236-236: LGTM: explicit ScriptExecutionError path is correctly exercised.Aligns with FAIL vs ERROR semantics from prior learnings.
315-315: LGTM: agent API error path is covered.Verifies ERROR result with message as intended.
336-336: LGTM: runner construction condensed.
367-367: LGTM: side_effect override is clear and scoped.
379-379: LGTM: side_effect override for no-keywords case.
407-411: Confirmtool_callsshape contract (list[list[call]]).Tests assume a nested list-of-lists; ensure
compare_tool_callsand result serialization use the same shape to avoid brittle mismatches.
420-420: LGTM: runner construction.
426-426: LGTM: asserts includetool_callsechoing actuals.
442-446: LGTM: failure path and mismatched tool calls covered.
455-455: LGTM: runner construction.
471-471: LGTM: runner construction.
508-519: LGTM: comprehensive happy-path fixture for multi-eval.Covers script, substring (case-insensitive), judge, and tools.
545-545: LGTM: runner construction.
579-583: LGTM: mixed-results fixture wired correctly.
605-605: LGTM: runner construction.
635-639: LGTM: error-focused fixture keepstool_callsempty.
641-641: LGTM: script error surfaced asERROR, consistent with semantics.
657-657: LGTM: runner construction.
714-714: LGTM: invalid endpoint test is concise and accurate.Asserts ERROR and no agent calls—good guard.
Summary by CodeRabbit