Skip to content

fix rule for black & pydocstyle#45

Merged
tisnik merged 5 commits intolightspeed-core:mainfrom
asamal4:fix-black-docstyle-rule
Sep 4, 2025
Merged

fix rule for black & pydocstyle#45
tisnik merged 5 commits intolightspeed-core:mainfrom
asamal4:fix-black-docstyle-rule

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Sep 4, 2025

Summary by CodeRabbit

  • Documentation
    • Fixed an open code block in the README’s Basic Usage section.
  • Style
    • Standardized docstring formatting across evaluation modules and LLM managers for consistency and readability.
    • Minor punctuation and string literal cleanups in logs and messages.
  • Chores
    • Added project-wide formatting and docstring style configuration (Black, Google convention) to improve consistency across the codebase.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

Walkthrough

Docstring and string-literal formatting updates across multiple modules, addition of Black and pydocstyle configuration in pyproject.toml, and a README code fence closure. No behavioral, API, or control-flow changes.

Changes

Cohort / File(s) Summary
Tooling config
pyproject.toml
Added formatting/style config: [tool.black] (line-length 88) and [tool.pydocstyle] (convention "google").
Docstring reflow (lightspeed_evaluation)
src/lightspeed_evaluation/__init__.py, src/lightspeed_evaluation/core/llm/manager.py, src/lightspeed_evaluation/core/llm/ragas.py, src/lightspeed_evaluation/core/llm/deepeval.py, src/lightspeed_evaluation/core/metrics/custom.py, src/lightspeed_evaluation/core/metrics/ragas.py, src/lightspeed_evaluation/core/metrics/deepeval.py, src/lightspeed_evaluation/drivers/evaluation.py, src/lightspeed_evaluation/runner/evaluation.py
Converted multi-line opening docstrings to single-line openers; minor punctuation normalization. No logic/API changes.
Readme fix
README.md
Added missing closing Markdown code fence for a snippet.
Minor literal/layout tweaks (non-functional)
archive/lightspeed_core_evaluation/taxonomy_eval.py, lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py, lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py
Reformatted call layout and merged adjacent string literals; error/message text unchanged; no control-flow or API changes.

Sequence Diagram(s)

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

  • tisnik
  • VladimirKadlec

Poem

Hops on the keys, I tidy each line,
Quotes in a row—now everything’s fine.
Fences are closed, docs sleek and neat,
Carrots for style, a formatting treat.
With whiskers twitching, I lint and I preen—
The code now reads crisp, polished, and clean. 🥕✨


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 774f206 and 76861fc.

📒 Files selected for processing (10)
  • archive/lightspeed_core_evaluation/taxonomy_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1 hunks)
  • pyproject.toml (1 hunks)
  • src/lightspeed_evaluation/core/llm/manager.py (1 hunks)
  • src/lightspeed_evaluation/core/llm/ragas.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/custom.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/ragas.py (1 hunks)
  • src/lightspeed_evaluation/drivers/evaluation.py (4 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • src/lightspeed_evaluation/core/metrics/custom.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • src/lightspeed_evaluation/core/llm/manager.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py
  • archive/lightspeed_core_evaluation/taxonomy_eval.py
  • src/lightspeed_evaluation/core/llm/ragas.py
  • src/lightspeed_evaluation/drivers/evaluation.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/metrics/ragas.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (2)
pyproject.toml (2)

51-53: Install Black, set target versions, and sync linters

  • Confirm Black is available (black --version) then in pyproject.toml under [tool.black] add:
    line-length = 88
    target-version = ["py311", "py312"]
  • Optionally mirror the line length for Ruff in the same file:
    [tool.ruff]
    line-length = 88

Re-run black --check --diff . and ruff format --check . to verify formatting.


54-56: Lock pydocstyle behavior and align Ruff

  • Confirm pydocstyle is installed (pydocstyle command not found) or remove the unused [tool.pydocstyle] section; if you use it, extend it to:

    [tool.pydocstyle]
    convention = "google"
    add-select = ["D212"]
    add-ignore = ["D203", "D213"]
  • Under [tool.ruff.lint], enable Ruff’s pydocstyle plugin:

    [tool.ruff.lint]
    select = ["E", "F", "D"]
    
    [tool.ruff.lint.pydocstyle]
    convention = "google"
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
src/lightspeed_evaluation/core/metrics/ragas.py (1)

27-36: Fix D212: put summary on the first line of multi-line docstring.

Current summary starts on Line 28. Move it to Line 27.

Apply:

-        """
-        Initialize with LLM Manager.
+        """Initialize with LLM Manager.
 
         Args:
             llm_manager: Pre-configured LLMManager with validated parameters
         """
src/lightspeed_evaluation/core/metrics/custom.py (2)

29-34: Fix D212 in init docstring.

Summary must start on the same line as opening quotes.

Apply:

-        """
-        Initialize with LLM Manager.
+        """Initialize with LLM Manager.
 
         Args:
             llm_manager: Pre-configured LLMManager with validated parameters
         """

86-94: Fix D212 in _parse_score_response docstring.

Keep raw string, move summary to first line.

Apply:

-        r"""
-        Parse LLM response to extract score and reason.
+        r"""Parse LLM response to extract score and reason.
 
         Expected formats:
         - "Score: 0.85\nReason: The answer is accurate..."
         - "8.5/10 - The response is comprehensive..."
         - "Rating: 4.2 out of 5"
         """
src/lightspeed_evaluation/core/llm/manager.py (1)

14-21: Fix D212: move docstring summary to the first line
In src/lightspeed_evaluation/core/llm/manager.py (class LLMManager, lines 14–21), update the docstring to start the summary on the first line:

 class LLMManager:
-    """
-    Generic LLM Manager for all use cases (Ragas, DeepEval, Custom metrics).
-
-    Responsibilities:
-    - Environment validation for multiple providers
-    - Model name construction
-    - Provides LiteLLM parameters for consumption by framework-specific managers
-    """
+    """Generic LLM Manager for all use cases (Ragas, DeepEval, Custom metrics).
+
+    Responsibilities:
+    - Environment validation for multiple providers
+    - Model name construction
+    - Provides LiteLLM parameters for consumption by framework-specific managers
+    """

This will resolve the pydocstyle D212 warning.

src/lightspeed_evaluation/core/llm/ragas.py (1)

79-84: Fix pydocstyle D212: start summary line on the first line of the docstring.

Matches repo’s pydocstyle config.

Apply this diff:

 class RagasLLMManager:
-    """
-    Ragas LLM Manager - Takes LLM parameters directly.
-
-    This manager focuses solely on Ragas-specific LLM integration.
-    """
+    """Ragas LLM Manager - Takes LLM parameters directly.
+
+    This manager focuses solely on Ragas-specific LLM integration.
+    """
src/lightspeed_evaluation/core/config/loader.py (1)

50-58: Harden log-level parsing and avoid print for log warnings.

  • If source_level/package_level is invalid, getattr(logging, ...) raises AttributeError before basicConfig, causing an early crash. Parse levels defensively and fall back to sane defaults.
  • Use logging.warning (root logger) instead of print for the invalid override case to keep diagnostics consistent.
 def setup_logging(logging_config: Dict[str, Any]) -> logging.Logger:
     """Configure logging for application and packages."""
     # Get logging settings with new structure
-    source_level = getattr(logging, logging_config.get("source_level", "INFO").upper())
-    package_level = getattr(logging, logging_config.get("package_level", "WARNING").upper())
+    def _resolve_level(name: str, default: int) -> int:
+        try:
+            return getattr(logging, name.upper())
+        except AttributeError:
+            # Will be emitted after basicConfig below
+            return default
+
+    source_level = _resolve_level(logging_config.get("source_level", "INFO"), logging.INFO)
+    package_level = _resolve_level(logging_config.get("package_level", "WARNING"), logging.WARNING)
@@
-        except AttributeError:
-            print(f"Warning: Invalid log level '{override_level}' for package '{package_name}'")
+        except AttributeError:
+            logging.warning(
+                "Invalid log level '%s' for package '%s'", override_level, package_name
+            )

Additionally, after basicConfig, consider warning if the top-level source_level/package_level names were invalid and defaults were used. I can add that if you’d like.

Also applies to: 83-94

🧹 Nitpick comments (6)
src/lightspeed_evaluation/core/metrics/custom.py (1)

59-84: Minor: make error propagation more explicit if desired.

Optional: preserve original traceback while shortening message is already handled via from e. No change required.

src/lightspeed_evaluation/core/llm/manager.py (1)

27-27: Prefer logging over print for library code.

Use a module logger for consistency (you already do this in visualization.py).

Apply this diff:

-        print(f"✅ LLM Manager: {self.config.provider}/{self.config.model} -> {self.model_name}")
+        import logging
+        logging.getLogger("lightspeed_evaluation.LLMManager").info(
+            "✅ LLM Manager: %s/%s -> %s",
+            self.config.provider,
+            self.config.model,
+            self.model_name,
+        )
src/lightspeed_evaluation/core/llm/ragas.py (2)

20-27: Unify “unset temperature” handling with None instead of a sentinel.

Using 1e-08 as a sentinel blocks users from intentionally setting that exact value. Prefer Optional[float]=None and fall back to litellm_params when None.

Apply this diff:

-    def generate_text(  # pylint: disable=too-many-arguments,too-many-positional-arguments
+    def generate_text(  # pylint: disable=too-many-arguments,too-many-positional-arguments
         self,
         prompt: Any,
         n: int = 1,
-        temperature: float = 1e-08,
+        temperature: Optional[float] = None,
         stop: Optional[List[str]] = None,
         callbacks: Optional[Any] = None,
     ) -> LLMResult:
@@
-        # Use temperature from params unless explicitly overridden
-        temp = temperature if temperature != 1e-08 else self.litellm_params.get("temperature", 0.0)
+        # Use temperature from params unless explicitly overridden
+        temp = self.litellm_params.get("temperature", 0.0) if temperature is None else temperature
@@
-    async def agenerate_text(  # pylint: disable=too-many-arguments,too-many-positional-arguments
+    async def agenerate_text(  # pylint: disable=too-many-arguments,too-many-positional-arguments
         self,
         prompt: Any,
         n: int = 1,
-        temperature: Optional[float] = None,
+        temperature: Optional[float] = None,
         stop: Optional[List[str]] = None,
         callbacks: Optional[Any] = None,
     ) -> LLMResult:
         """Async generate."""
-        temp = temperature if temperature is not None else 1e-08
-        return self.generate_text(prompt, n=n, temperature=temp, stop=stop, callbacks=callbacks)
+        return self.generate_text(prompt, n=n, temperature=temperature, stop=stop, callbacks=callbacks)

Also applies to: 31-33, 61-71, 71-72


18-19: Use logging instead of print for operational messages and errors.

Keeps output controllable and consistent with GraphGenerator.

Apply this diff:

-        print(f"✅ Ragas Custom LLM: {self.model_name}")
+        import logging
+        self._logger = logging.getLogger("lightspeed_evaluation.RagasCustomLLM")
+        self._logger.info("✅ Ragas Custom LLM: %s", self.model_name)
@@
-            print(f"❌ Ragas LLM failed: {e}")
+            self._logger.exception("❌ Ragas LLM failed: %s", e)

Also applies to: 57-59

src/lightspeed_evaluation/core/output/visualization.py (1)

151-153: Swap axis labels for the pass-rate bar chart.

Bars plot pass_rates on Y vs metrics on X; labels are currently reversed.

Apply this diff:

-        ax.set_xlabel("Pass Rate (%)", fontsize=12)
-        ax.set_ylabel("Metrics", fontsize=12)
+        ax.set_xlabel("Metrics", fontsize=12)
+        ax.set_ylabel("Pass Rate (%)", fontsize=12)
src/lightspeed_evaluation/drivers/evaluation.py (1)

222-234: Optional: unwrap nested ternary for readability.

The emoji/status selection is compact but a bit dense. Consider a small refactor for future maintainability.

-                status_emoji = (
-                    "✅" if result_status == "PASS" else "❌" if result_status == "FAIL" else "⚠️"
-                )
+                if result_status == "PASS":
+                    status_emoji = "✅"
+                elif result_status == "FAIL":
+                    status_emoji = "❌"
+                else:
+                    status_emoji = "⚠️"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a10547d and 1b41736.

📒 Files selected for processing (13)
  • pyproject.toml (1 hunks)
  • src/lightspeed_evaluation/__init__.py (1 hunks)
  • src/lightspeed_evaluation/core/config/loader.py (6 hunks)
  • src/lightspeed_evaluation/core/config/models.py (4 hunks)
  • src/lightspeed_evaluation/core/llm/manager.py (3 hunks)
  • src/lightspeed_evaluation/core/llm/ragas.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/custom.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/ragas.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (5 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (1 hunks)
  • src/lightspeed_evaluation/core/output/visualization.py (4 hunks)
  • src/lightspeed_evaluation/drivers/evaluation.py (10 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • src/lightspeed_evaluation/__init__.py
🧬 Code graph analysis (7)
src/lightspeed_evaluation/runner/evaluation.py (1)
src/lightspeed_evaluation/core/output/generator.py (1)
  • generate_reports (32-76)
src/lightspeed_evaluation/core/output/visualization.py (2)
src/lightspeed_evaluation/core/config/models.py (1)
  • EvaluationResult (126-162)
src/lightspeed_evaluation/core/output/statistics.py (1)
  • calculate_basic_stats (19-45)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/config/models.py (1)
  • EvaluationResult (126-162)
src/lightspeed_evaluation/core/metrics/ragas.py (3)
src/lightspeed_evaluation/core/config/models.py (1)
  • TurnData (8-44)
src/lightspeed_evaluation/drivers/evaluation.py (1)
  • _evaluate_metric (203-267)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • _evaluate_metric (51-60)
src/lightspeed_evaluation/core/config/loader.py (1)
src/lightspeed_evaluation/core/config/models.py (1)
  • validate_metrics (82-87)
src/lightspeed_evaluation/core/output/generator.py (1)
src/lightspeed_evaluation/core/config/models.py (1)
  • EvaluationResult (126-162)
src/lightspeed_evaluation/drivers/evaluation.py (2)
src/lightspeed_evaluation/runner/evaluation.py (1)
  • run_evaluation (15-86)
src/lightspeed_evaluation/core/config/models.py (3)
  • EvaluationData (47-123)
  • EvaluationResult (126-162)
  • TurnData (8-44)
🪛 GitHub Actions: Pydocstyle
src/lightspeed_evaluation/runner/evaluation.py

[warning] 18-18: pydocstyle D212: Multi-line docstring summary should start at the first line.

src/lightspeed_evaluation/core/llm/manager.py

[warning] 14-14: pydocstyle D212: Multi-line docstring summary should start at the first line.

src/lightspeed_evaluation/core/metrics/ragas.py

[warning] 27-27: pydocstyle D212: Multi-line docstring summary should start at the first line.

src/lightspeed_evaluation/core/metrics/custom.py

[warning] 29-29: pydocstyle D212: Multi-line docstring summary should start at the first line.


[warning] 86-86: pydocstyle D212: Multi-line docstring summary should start at the first line.

src/lightspeed_evaluation/core/llm/ragas.py

[warning] 79-79: pydocstyle D212: Multi-line docstring summary should start at the first line.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (33)
pyproject.toml (1)

51-53: Black config LGTM.

Line length 100 is reasonable for this codebase.

src/lightspeed_evaluation/core/metrics/ragas.py (2)

51-62: Helper refactor reads well.

Single-line signature and list-comprehension are clear and unchanged semantically.


151-151: Formatting-only return change is fine.

No behavioral change; good.

src/lightspeed_evaluation/core/metrics/custom.py (2)

20-22: Field reflow LGTM.

Type hints and defaults unchanged.


38-39: Supported metrics dict reflow LGTM.

No semantic change.

src/lightspeed_evaluation/__init__.py (1)

1-13: Nice: docstring now complies with D212.

Matches Google style and the new pydocstyle config.

src/lightspeed_evaluation/core/output/statistics.py (1)

71-71: Signature reflow LGTM.

Purely cosmetic; maintains type clarity.

src/lightspeed_evaluation/core/llm/manager.py (2)

53-53: LGTM — concise raise formatting.


65-65: LGTM — concise raise formatting.

src/lightspeed_evaluation/core/llm/ragas.py (2)

32-33: LGTM — simplified temperature selection is equivalent.


71-72: LGTM — straightforward delegation to sync path.

src/lightspeed_evaluation/core/output/generator.py (6)

54-55: LGTM — inlined JSON summary call; no behavioral change.


58-60: LGTM — inlined text summary call; no behavioral change.


78-83: LGTM — signature reflow of _generate_csv_report only.


177-181: LGTM — concise f-strings for overall stats.


189-193: LGTM — concise f-strings for per-metric stats.


215-218: LGTM — concise f-strings for per-conversation stats.

src/lightspeed_evaluation/core/output/visualization.py (5)

42-45: LGTM — helper signature condensed; behavior unchanged.


46-55: LGTM — grouping helper condensed; behavior unchanged.


87-99: LGTM — graph generation calls inlined; behavior unchanged.


130-130: LGTM — compact status breakdown label assembly.


423-426: LGTM — condensed detailed-summary helper; behavior unchanged.

src/lightspeed_evaluation/core/config/models.py (4)

59-60: LGTM — single-line Field declaration; no semantic change.


86-87: LGTM — clearer error message formatting; no semantic change.


117-122: LGTM — formatting only; logic unchanged.


168-174: LGTM — provider Field collapsed; no semantic change.

src/lightspeed_evaluation/core/config/loader.py (1)

176-179: LGTM on config metadata wiring.

Flattened defaults for default_turn_metrics_metadata and default_conversation_metrics_metadata look correct and align with how _get_effective_threshold consumes them.

Also applies to: 236-238

src/lightspeed_evaluation/runner/evaluation.py (2)

41-43: LGTM: concise status prints.

Single-line system/eval status messages improve readability without behavior changes.


73-75: LGTM: compact error summary print.

Inline f-string read is clear and consistent with other output.

src/lightspeed_evaluation/drivers/evaluation.py (4)

1-4: Module docstring is compliant with D212/D205.

Summary on first line with a blank line before details. No action needed.


86-91: LGTM: handler routing map formatting.

Single-line annotation for handlers improves clarity; keys align with frameworks used elsewhere.


116-123: Docstring tightening looks good.

Concise bullets and clear responsibilities. No semantic changes.


288-296: LGTM: threshold resolution falls back correctly.

Conversation vs. turn default metadata lookup matches the config wiring in loader.SystemConfig.

@asamal4 asamal4 force-pushed the fix-black-docstyle-rule branch from 4c450a9 to abacc4a Compare September 4, 2025 01:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)

51-76: Harden types and parsing to avoid .strip() on non-strings

Broaden input types and validate/normalize values to prevent AttributeError if the API returns non-string values.

Apply:

-    def query_agent(self, api_input: dict[str, str], timeout: int = 300) -> dict[str, Any]:
+    def query_agent(self, api_input: dict[str, Any], timeout: float = 300) -> dict[str, Any]:
@@
-            response_data = response.json()
-            if "response" not in response_data:
-                raise AgentAPIError("Agent response missing 'response' field")
-
-            agent_response = response_data["response"].strip()
-            conversation_id = response_data.get("conversation_id", "").strip()
-            tool_calls = response_data.get("tool_calls", [])
+            response_data = response.json()
+            if "response" not in response_data:
+                raise AgentAPIError("Agent response missing 'response' field")
+
+            resp = response_data["response"]
+            if not isinstance(resp, str):
+                raise AgentAPIError("Agent response 'response' must be a string")
+            agent_response = resp.strip()
+
+            conv = response_data.get("conversation_id")
+            conversation_id = str(conv).strip() if conv is not None else ""
+
+            tool_calls = response_data.get("tool_calls") or []

28-30: Do not disable TLS verification by default

verify=False is unsafe. Gate with an env flag and default to verifying TLS.

-            # enable verify, currently for eval it is set to False
-            self.client = httpx.Client(base_url=self.endpoint, verify=False)
+            # TLS verify is configurable via AGENT_API_VERIFY; defaults to enabled
+            verify_env = os.getenv("AGENT_API_VERIFY", "true").lower()
+            verify = verify_env in {"1", "true", "yes"}
+            self.client = httpx.Client(base_url=self.endpoint, verify=verify)
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (3)

51-54: Preempt W1404 by removing implicit concatenation in Azure error

Same adjacency pattern here; collapse to a single literal to avoid future pylint failures.

-                raise JudgeModelError(
-                    "AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT "
-                    "environment variables are required for Azure provider"
-                )
+                raise JudgeModelError(
+                    "AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT environment variables are required for Azure provider"
+                )

67-71: Preempt W1404 in Watsonx error message

Three adjacent literals; join them.

-                raise JudgeModelError(
-                    "WATSONX_API_KEY, WATSONX_API_BASE, and "
-                    "WATSONX_PROJECT_ID environment variables are "
-                    "required for Watsonx provider"
-                )
+                raise JudgeModelError(
+                    "WATSONX_API_KEY, WATSONX_API_BASE, and WATSONX_PROJECT_ID environment variables are required for Watsonx provider"
+                )

115-118: Avoid adjacent f-strings inside raise()

This can also be flagged by W1404 depending on config. Merge into a single f-string.

-                    raise JudgeModelError(
-                        f"Judge model evaluation failed after "
-                        f"{MAX_RETRY_ATTEMPTS} attempts: {e}"
-                    ) from e
+                    raise JudgeModelError(
+                        f"Judge model evaluation failed after {MAX_RETRY_ATTEMPTS} attempts: {e}"
+                    ) from e
🧹 Nitpick comments (6)
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (1)

180-186: Concise fixture wiring — consider a tiny assertion.

Nice consolidation. Optionally assert get_eval_count() is invoked once to complete the contract being exercised.

Apply inline after Line [201]:

+        mock_config_manager.return_value.get_eval_count.assert_called_once()
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (3)

90-105: Type the helper to accept Optional and precise element shape

Reflect possible None and element mapping; behavior remains unchanged.

-    def _format_query_endpoint_tool_calls(self, tool_calls: list) -> list[list[dict[str, Any]]]:
+    def _format_query_endpoint_tool_calls(
+        self, tool_calls: Optional[list[dict[str, Any]]]
+    ) -> list[list[dict[str, Any]]]:

121-139: Make error-body decoding resilient to non‑UTF‑8

Avoid UnicodeDecodeError on binary/invalid payloads.

-                    error_content = response.read().decode("utf-8")
+                    error_content = response.read().decode("utf-8", errors="replace")

106-109: Widen streaming input type

Streaming requests often include nested structures; align with non-streaming suggestion.

-    def streaming_query_agent(
-        self, api_input: dict[str, str], timeout: int = 300
-    ) -> dict[str, Any]:
+    def streaming_query_agent(
+        self, api_input: dict[str, Any], timeout: float = 300
+    ) -> dict[str, Any]:
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (2)

110-111: Minor message polish: avoid redundant str() and improve readability

str(response) is redundant inside an f-string; using response calls __str__ anyway. Optional style tweak.

-                raise JudgeModelError(
-                    f"No valid response from Judge Model. Check full response\n{str(response)}"
-                )
+                raise JudgeModelError(
+                    f"No valid response from Judge Model. Check full response:\n{response}"
+                )

82-123: Optional: broaden retry to include transient API errors and add backoff

Currently only TimeoutError is retried. Consider also handling transient provider errors (e.g., rate limits, 5xx) and using exponential backoff with jitter to reduce thundering herds. This can materially improve robustness without changing behavior on success.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1b41736 and 4c450a9.

📒 Files selected for processing (38)
  • archive/lightspeed_core_evaluation/query_rag.py (2 hunks)
  • archive/lightspeed_core_evaluation/rag_eval.py (5 hunks)
  • archive/lightspeed_core_evaluation/response_evaluation.py (9 hunks)
  • archive/lightspeed_core_evaluation/taxonomy_eval.py (3 hunks)
  • archive/lightspeed_core_evaluation/utils/plot.py (1 hunks)
  • archive/lightspeed_core_evaluation/utils/relevancy_score.py (1 hunks)
  • archive/lightspeed_core_evaluation/utils/response.py (2 hunks)
  • archive/lightspeed_core_evaluation/utils/score.py (2 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/agent_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (2 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (10 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (8 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (23 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (2 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py (3 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py (5 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (2 hunks)
  • lsc_agent_eval/tests/core/utils/test_api_client.py (7 hunks)
  • lsc_agent_eval/tests/core/utils/test_judge.py (2 hunks)
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/test_agent_eval.py (2 hunks)
  • src/generate_answers/ls_response.py (2 hunks)
  • src/lightspeed_evaluation/core/llm/deepeval.py (1 hunks)
  • src/lightspeed_evaluation/core/llm/manager.py (4 hunks)
  • src/lightspeed_evaluation/core/llm/ragas.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/custom.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/deepeval.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/ragas.py (3 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (4 hunks)
✅ Files skipped from review due to trivial changes (29)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py
  • archive/lightspeed_core_evaluation/query_rag.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py
  • archive/lightspeed_core_evaluation/utils/relevancy_score.py
  • archive/lightspeed_core_evaluation/rag_eval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py
  • lsc_agent_eval/tests/core/utils/test_judge.py
  • lsc_agent_eval/src/lsc_agent_eval/agent_eval.py
  • archive/lightspeed_core_evaluation/utils/score.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/tests/core/utils/test_api_client.py
  • src/generate_answers/ls_response.py
  • archive/lightspeed_core_evaluation/utils/response.py
  • src/lightspeed_evaluation/core/llm/deepeval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
  • archive/lightspeed_core_evaluation/utils/plot.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • archive/lightspeed_core_evaluation/response_evaluation.py
  • lsc_agent_eval/tests/test_agent_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py
  • archive/lightspeed_core_evaluation/taxonomy_eval.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/lightspeed_evaluation/core/llm/manager.py
  • src/lightspeed_evaluation/core/metrics/ragas.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/llm/ragas.py
  • src/lightspeed_evaluation/core/metrics/custom.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py
🧬 Code graph analysis (4)
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
  • JudgeModelError (20-21)
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py (1)
  • get_conversations (153-155)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/results.py (1)
  • save_results (25-46)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
  • AgentAPIError (12-13)
lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
  • ScriptExecutionError (16-17)
🪛 GitHub Actions: Python linter
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py

[warning] 39-39: Step 'make pylint' failed. Pylint W1404: Implicit string concatenation found in call (implicit-str-concat).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (20)
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (10)

80-107: Patch ordering and argument mapping are correct.

Bottom-up patch application maps to parameters as expected; constructor dependencies are verified cleanly.


108-133: Init without judge manager covered well.

Decorator order matches args; assertion that evaluator.judge_manager is None is appropriate.


134-159: Kubeconfig propagation validated.

ScriptRunner is called with the provided kubeconfig and wiring to EvaluationRunner is checked.


160-177: End-to-end success path wiring looks good.

Including ResultsManager in the patches and ordering it as the first arg (bottom-most decorator) is correct.


203-204: Result persistence assertion is precise.

Verifies correct directory argument; no change needed.


211-216: Inline patch context reads clearer.

Shortened with block maintains correctness and improves readability.


227-232: No-results path is explicit and accurate.

Exception type and message are asserted tightly.


241-248: Cleanup with client: target and assertion are right.

Patching at the module-under-test boundary and verifying close() is ideal.


262-269: Exception cleanup path exercised well.

Mocks set up to raise during close() and handled in _cleanup() as expected.


277-285: Logger patch verifies message and arg binding.

Capturing format string and exception text is robust; good guard against regressions.

lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py (5)

21-21: Signature compaction only; behavior unchanged.

Keeps decorator order and assertions intact.


94-94: Failure-path test signature compacted correctly.

Test intent and checks preserved.


119-119: Side-effect in one line — good.

SubprocessError path remains clearly exercised.


138-139: Raised-message match streamlined.

Assertion remains specific and stable.


154-155: Chmod error path assertion compact and correct.

Test continues to validate unexpected-error wrapping.

lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)

64-67: Good: explicit 'response' presence check

The added guard for the required "response" field tightens validation.


148-151: Clearer timeout message

Good consistency with the non-streaming path.

lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (3)

31-31: LGTM on log reflow

Single-line debug log is fine and keeps context intact.


120-121: LGTM on single-line warning log

Compact warning message reads well; keeps attempt count and exception context.


38-40: Fix implicit string concatenation (pylint W1404) — pipeline blocker
Adjacent string literals in the exception message trigger W1404. Merge into a single literal:

-                raise JudgeModelError(
-                    "OPENAI_API_KEY environment variable is required " "for OpenAI provider"
-                )
+                raise JudgeModelError("OPENAI_API_KEY environment variable is required for OpenAI provider")

Run a direct lint check to confirm W1404 is resolved:

pylint --disable=all --enable=W1404 lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (4)

259-259: Same conversation_id fallback nuance as above.

Default won’t apply if value is None; consider ...get("conversation_id") or "test-conversation-id".


289-289: Same conversation_id fallback nuance as above.


349-349: Same conversation_id fallback nuance as above.


387-387: Same conversation_id fallback nuance as above.

🧹 Nitpick comments (2)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (2)

32-32: Use or to fall back when conversation_id is present but None.

dict.get(key, default) won’t use the default if the key exists with value None. If run_evaluation passes conversation_id=None, this will propagate None instead of a generated ID.

Apply:

-                "conversation_id": api_input.get("conversation_id", "generated-conversation-id"),
+                "conversation_id": api_input.get("conversation_id") or "generated-conversation-id",

679-679: Add a negative test for non-streaming query missing "response".

The PR notes a validation change in query_agent; consider a test where query_agent raises on missing "response" to lock behavior.

Example:

+    def test_query_mode_missing_response_errors(self, mock_agent_client, mock_script_runner, sample_config_judge_llm):
+        # Simulate query endpoint returning payload without "response"
+        mock_agent_client.query_agent.side_effect = AgentAPIError("Missing 'response' field")
+        runner = EvaluationRunner(mock_agent_client, mock_script_runner)
+        results = runner.run_evaluation(
+            sample_config_judge_llm, "watsonx", "ibm/granite-3-3-8b-instruct", "conv-id-123", endpoint_type="query"
+        )
+        assert len(results) == 1 and results[0].result == "ERROR"
+        assert "Missing 'response' field" in results[0].error
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4c450a9 and abacc4a.

📒 Files selected for processing (38)
  • archive/lightspeed_core_evaluation/query_rag.py (2 hunks)
  • archive/lightspeed_core_evaluation/rag_eval.py (5 hunks)
  • archive/lightspeed_core_evaluation/response_evaluation.py (9 hunks)
  • archive/lightspeed_core_evaluation/taxonomy_eval.py (3 hunks)
  • archive/lightspeed_core_evaluation/utils/plot.py (1 hunks)
  • archive/lightspeed_core_evaluation/utils/relevancy_score.py (1 hunks)
  • archive/lightspeed_core_evaluation/utils/response.py (2 hunks)
  • archive/lightspeed_core_evaluation/utils/score.py (2 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/agent_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (2 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (10 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (8 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (23 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (2 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py (3 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py (5 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (2 hunks)
  • lsc_agent_eval/tests/core/utils/test_api_client.py (7 hunks)
  • lsc_agent_eval/tests/core/utils/test_judge.py (2 hunks)
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/test_agent_eval.py (2 hunks)
  • src/generate_answers/ls_response.py (2 hunks)
  • src/lightspeed_evaluation/core/llm/deepeval.py (1 hunks)
  • src/lightspeed_evaluation/core/llm/manager.py (4 hunks)
  • src/lightspeed_evaluation/core/llm/ragas.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/custom.py (2 hunks)
  • src/lightspeed_evaluation/core/metrics/deepeval.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/ragas.py (3 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (4 hunks)
✅ Files skipped from review due to trivial changes (12)
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
  • archive/lightspeed_core_evaluation/utils/response.py
  • archive/lightspeed_core_evaluation/query_rag.py
  • archive/lightspeed_core_evaluation/utils/score.py
  • src/generate_answers/ls_response.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py
  • archive/lightspeed_core_evaluation/taxonomy_eval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py
  • archive/lightspeed_core_evaluation/utils/relevancy_score.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
🚧 Files skipped from review as they are similar to previous changes (25)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py
  • archive/lightspeed_core_evaluation/utils/plot.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/tests/core/utils/test_judge.py
  • lsc_agent_eval/src/lsc_agent_eval/agent_eval.py
  • src/lightspeed_evaluation/core/metrics/custom.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
  • archive/lightspeed_core_evaluation/response_evaluation.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/llm/deepeval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py
  • archive/lightspeed_core_evaluation/rag_eval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • src/lightspeed_evaluation/core/metrics/ragas.py
  • src/lightspeed_evaluation/core/llm/ragas.py
  • lsc_agent_eval/tests/core/utils/test_api_client.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py
  • lsc_agent_eval/tests/test_agent_eval.py
  • src/lightspeed_evaluation/core/llm/manager.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
🧬 Code graph analysis (1)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (4)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)
  • EvaluationRunner (20-201)
  • run_evaluation (34-106)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1)
  • run_script (28-72)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (2)
  • ScriptExecutionError (16-17)
  • AgentAPIError (12-13)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)
  • streaming_query_agent (106-155)
🔇 Additional comments (21)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (21)

126-126: LGTM: formatting-only change.

No functional differences.


172-172: LGTM: formatting-only change.

No functional differences.


236-236: LGTM: explicit ScriptExecutionError path is correctly exercised.

Aligns with FAIL vs ERROR semantics from prior learnings.


315-315: LGTM: agent API error path is covered.

Verifies ERROR result with message as intended.


336-336: LGTM: runner construction condensed.


367-367: LGTM: side_effect override is clear and scoped.


379-379: LGTM: side_effect override for no-keywords case.


407-411: Confirm tool_calls shape contract (list[list[call]]).

Tests assume a nested list-of-lists; ensure compare_tool_calls and result serialization use the same shape to avoid brittle mismatches.


420-420: LGTM: runner construction.


426-426: LGTM: asserts include tool_calls echoing actuals.


442-446: LGTM: failure path and mismatched tool calls covered.


455-455: LGTM: runner construction.


471-471: LGTM: runner construction.


508-519: LGTM: comprehensive happy-path fixture for multi-eval.

Covers script, substring (case-insensitive), judge, and tools.


545-545: LGTM: runner construction.


579-583: LGTM: mixed-results fixture wired correctly.


605-605: LGTM: runner construction.


635-639: LGTM: error-focused fixture keeps tool_calls empty.


641-641: LGTM: script error surfaced as ERROR, consistent with semantics.


657-657: LGTM: runner construction.


714-714: LGTM: invalid endpoint test is concise and accurate.

Asserts ERROR and no agent calls—good guard.

@asamal4
Copy link
Collaborator Author

asamal4 commented Sep 4, 2025

@tisnik @VladimirKadlec @Anxhela21 PTAL

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tisnik tisnik merged commit 0913bea into lightspeed-core:main Sep 4, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants