Add client for query endpoint by Anxhela21 · Pull Request #43 · lightspeed-core/lightspeed-evaluation

Anxhela21 · 2025-09-02T16:50:21Z

Summary by CodeRabbit

New Features
- Integrated Lightspeed Stack with configurable base URL, provider, model, system prompt, tool usage, and timeout.
- Added client to query the stack and fetch server info, with support for conversation IDs and attachments.
Chores
- Introduced environment variables to reduce progress bars and logging noise.
- Added custom evaluation metric “custom:answer_correctness” with a default threshold.
Documentation
- Clarified metrics comment wording.

Signed-off-by: Anxhela Coba <acoba@redhat.com>

coderabbitai · 2025-09-02T16:50:27Z

Walkthrough

Adds a Lightspeed Stack client module with config-driven initialization and API calls to /v1/query and /v1/info. Updates system configuration to include a lightspeed_stack block and new environment variables. Introduces a custom metric declaration. Minor formatting change to a metrics file.

Changes

Cohort / File(s)	Change Summary
Configuration updates `config/system.yaml`	Added top-level `lightspeed_stack` block (base_url, provider, model, system_prompt, no_tools, timeout); added environment vars `DEEPEVAL_DISABLE_PROGRESS_BAR`, `LITELLM_LOG_LEVEL`; added `metrics_metadata` entry `custom:answer_correctness`; updated a comment.
LLM client addition `src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py`	New module providing `LightspeedStackConfig` dataclass and `LightspeedStackClient` with `query`, `get_info`, and `from_system_config`; uses JSON requests to `/v1/query` and `/v1/info` with error handling.
Metrics formatting `src/lightspeed_evaluation/core/metrics/custom.py`	Non-functional formatting: added trailing comma in `supported_metrics` dict entry for `"answer_correctness"`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Client as LightspeedStackClient
  participant API as Lightspeed Stack API

  Caller->>Client: query(query, conversation_id?, attachments?)
  Note right of Client: Build payload: query, provider, model, no_tools<br/>+ optional system_prompt, conversation_id, attachments
  Client->>API: POST /v1/query (JSON, timeout)
  API-->>Client: 200 JSON response
  Client-->>Caller: Return parsed JSON
  alt Error status
    API-->>Client: 4xx/5xx
    Client-->>Caller: raise_for_status() exception
  end

sequenceDiagram
  autonumber
  actor App
  participant Config as system_config
  participant Client as LightspeedStackClient

  App->>Client: from_system_config(system_config)
  Client->>Config: Read lightspeed_stack section
  Note right of Client: Apply defaults if keys missing<br/>Trim base_url whitespace/slash
  Client-->>App: LightspeedStackClient instance
  App->>Client: get_info()
  Client->>+Config: (uses stored config)
  Client-->>App: API /v1/info JSON

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my ears at configs new,
A stack now humming, neat and true.
I hop to /v1/query’s gate,
Fetch tidy JSON, never late.
Metrics nibble on correctness sweet—
Logs go quiet, errors neat.
Thump! Another sprint complete. 🐇✨

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b872c02 and 7eb7143.

📒 Files selected for processing (2)

src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py (1 hunks)
src/lightspeed_evaluation/core/metrics/custom.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

src/lightspeed_evaluation/core/metrics/custom.py
src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: mypy
GitHub Check: tests (3.11)
GitHub Check: black

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)

322-327: Initialize and cache the RAG manager from system config

Create the RAG manager once here and attach it to the CustomMetrics instance for reuse by _evaluate_rag_response.

     @classmethod
     def from_system_config(cls, system_config: Dict[str, Any]) -> "CustomMetrics":
         """Create CustomMetrics from system configuration."""
-        llm_manager = LLMManager.from_system_config(system_config)
-        return cls(llm_manager)
+        llm_manager = LLMManager.from_system_config(system_config)
+        inst = cls(llm_manager)
+        try:
+            from ..llm_managers.lightspeed_rag_llm import LightspeedRAGManager
+            inst.rag_manager = LightspeedRAGManager.from_system_config(system_config)
+        except ImportError:
+            inst.rag_manager = None
+        return inst

🧹 Nitpick comments (11)

lsc_eval/runner.py (2)

70-70: Double-check include_graphs source

Runner reads include_graphs from system_config.include_graphs, while config shows it under output.include_graphs. Confirm the loader flattens this, or pass system_config.output.include_graphs.

84-85: Improve actionable error hint

Consider appending the reports directory to the error reminder for quicker triage.
-            print(f"⚠️ {summary['ERROR']} evaluations had errors - check detailed report")
+            print(f"⚠️ {summary['ERROR']} evaluations had errors - check detailed report in: {output_handler.output_dir}")

lsc_eval/config/evaluation_data.yaml (1)

48-48: Trim trailing spaces flagged by yamllint

Remove trailing whitespace on these lines to satisfy CI linters.

Also applies to: 51-51, 55-55, 66-66, 69-69, 85-85

lsc_eval/config/system.yaml (2)

12-20: Lightspeed RAG config block added—OK

Block is reasonable. Optional: if RAG should default to the primary llm.* values, document/implement fallback when provider/model not set here to avoid duplication.

98-98: Trailing space

One trailing space flagged by yamllint. Please trim.

lsc_eval/README.md (1)

5-5: Fix MD019: extra spaces after ATX headings

Remove double spaces after “##”.
-##  Key Features
+## Key Features
-##  Quick Start
+## Quick Start
Also applies to: 15-15

lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py (5)

34-39: Avoid per-call client/session creation; pass system_prompt override to query.

Creating a new client each request is unnecessary and leaks idle sockets. Let query accept a per-call system_prompt.

@@
-    def query(
-        self,
-        query: str,
-        conversation_id: Optional[str] = None,
-        attachments: Optional[List[Dict[str, Any]]] = None,
-    ) -> Dict[str, Any]:
+    def query(
+        self,
+        query: str,
+        conversation_id: Optional[str] = None,
+        attachments: Optional[List[Dict[str, Any]]] = None,
+        system_prompt: Optional[str] = None,
+    ) -> Dict[str, Any]:
@@
-        if self.config.system_prompt:
-            payload["system_prompt"] = self.config.system_prompt
+        if system_prompt is not None:
+            payload["system_prompt"] = system_prompt
+        elif self.config.system_prompt:
+            payload["system_prompt"] = self.config.system_prompt
@@
-        # Create a temporary config for this query if system prompt override is needed
-        if effective_system_prompt != self.rag_config.system_prompt:
-            temp_config = LightspeedRAGConfig(
-                base_url=self.rag_config.base_url,
-                provider=self.rag_config.provider,
-                model=self.rag_config.model,
-                system_prompt=effective_system_prompt,
-                no_tools=self.rag_config.no_tools,
-                timeout=self.rag_config.timeout
-            )
-            temp_client = LightspeedRAGClient(temp_config)
-            result = temp_client.query(
-                query=query,
-                conversation_id=conversation_id,
-                attachments=attachments
-            )
-        else:
-            result = self.rag_client.query(
-                query=query,
-                conversation_id=conversation_id,
-                attachments=attachments
-            )
+        result = self.rag_client.query(
+            query=query,
+            conversation_id=conversation_id,
+            attachments=attachments,
+            system_prompt=effective_system_prompt,
+        )

Also applies to: 60-63, 134-156

31-33: Add retry/backoff to the HTTP session.

Resilience: transient 5xx/429 should retry with backoff.

 from dataclasses import dataclass
+from requests.adapters import HTTPAdapter
+from urllib3.util.retry import Retry
@@
         self.session = requests.Session()
         self.session.headers.update({'Content-Type': 'application/json'})
+        retries = Retry(
+            total=3,
+            backoff_factor=0.5,
+            status_forcelist=[429, 500, 502, 503, 504],
+            allowed_methods=["GET", "POST"],
+        )
+        adapter = HTTPAdapter(max_retries=retries)
+        self.session.mount("http://", adapter)
+        self.session.mount("https://", adapter)

223-227: Avoid hard dependency on LLMManager env (e.g., OPENAI_API_KEY).

Wrapper unconditionally constructs LLMManager, which can fail in RAG-only runs. Make it optional and degrade gracefully.

-class LightspeedRAGLLMWrapper:
+class LightspeedRAGLLMWrapper:
@@
-    def __init__(self, rag_manager: LightspeedRAGManager, llm_manager: LLMManager):
+    def __init__(self, rag_manager: LightspeedRAGManager, llm_manager: Optional[LLMManager]):
@@
-        llm_manager = LLMManager.from_system_config(system_config)
-        return cls(rag_manager, llm_manager)
+        try:
+            llm_manager = LLMManager.from_system_config(system_config)
+        except Exception as e:
+            print(f"⚠️ Skipping LLMManager init for wrapper: {e}")
+            llm_manager = None
+        return cls(rag_manager, llm_manager)

And guard getters:

-        return self.llm_manager.get_litellm_params()
+        return self.llm_manager.get_litellm_params() if self.llm_manager else {}
@@
-        return self.llm_manager.get_config()
+        return self.llm_manager.get_config() if self.llm_manager else LLMConfig(provider="lightspeed_rag", model=self.rag_manager.rag_config.model)

Also applies to: 255-260

228-235: Align model reporting with LiteLLM params.

get_model_name returns a synthetic name, but get_litellm_params returns the base LLM model; this may confuse reporting.

-        return f"lightspeed_rag/{self.rag_manager.rag_config.model}"
+        return self.llm_manager.get_model_name() if self.llm_manager else f"lightspeed_rag/{self.rag_manager.rag_config.model}"

93-99: Use logging instead of print for connectivity check.

Prefer structured logging with levels.

-import print
+import logging
@@
-            print(f"Connected to Lightspeed RAG: {info.get('name', 'Unknown')} v{info.get('version', 'Unknown')}")
+            logging.getLogger(__name__).info(
+                "Connected to Lightspeed RAG: %s v%s",
+                info.get("name", "Unknown"),
+                info.get("version", "Unknown"),
+            )
@@
-            print(f"Warning: Could not connect to Lightspeed RAG service: {e}")
+            logging.getLogger(__name__).warning(
+                "Could not connect to Lightspeed RAG service: %s", e
+            )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a10547d and cf57f26.

📒 Files selected for processing (6)

lsc_eval/README.md (3 hunks)
lsc_eval/config/evaluation_data.yaml (2 hunks)
lsc_eval/config/system.yaml (2 hunks)
lsc_eval/runner.py (3 hunks)
lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py (1 hunks)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

lsc_eval/README.md

🧬 Code graph analysis (3)

lsc_eval/runner.py (1)

lsc_eval/src/lsc_eval/output/output_handler.py (1)

generate_reports (32-80)

lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py (3)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)

LLMManager (36-173)

lsc_eval/src/lsc_eval/core/models.py (1)

TurnData (8-44)

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)

from_system_config (323-326)

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2)

lsc_eval/src/lsc_eval/core/models.py (1)

TurnData (8-44)

lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py (4)

LightspeedRAGManager (80-212)

LightspeedRAGConfig (13-21)

generate_response (100-157)

query (34-71)

🪛 markdownlint-cli2 (0.17.2)

lsc_eval/README.md

5-5: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

15-15: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

🪛 LanguageTool

lsc_eval/README.md

[grammar] ~39-~39: There might be a mistake here.
Context: ...ation_data.yaml ``` ### Troubleshooting If you encounter `ModuleNotFoundError: N...

(QB_NEW_EN)

[grammar] ~40-~40: There might be a mistake here.
Context: ...No module named 'lsc_eval', ensure you: 1. Are running from the lsc_eval/` directo...

(QB_NEW_EN)

[grammar] ~124-~124: There might be a mistake here.
Context: ....." ``` ## RAG Evaluation ### Overview The evaluation framework includes integr...

(QB_NEW_EN)

[grammar] ~133-~133: There might be a mistake here.
Context: ...l metrics ### Setup #### Prerequisites 1. Lightspeed Stack RAG Service: Must be ...

(QB_NEW_EN)

[grammar] ~134-~134: There might be a mistake here.
Context: ...Stack RAG Service**: Must be running at http://localhost:8080 2. Environment Variables: Set `OPENAI_API...

(QB_NEW_EN)

[grammar] ~138-~138: There might be a mistake here.
Context: ...ckages are installed #### Configuration Add RAG configuration to your `system.ya...

(QB_NEW_EN)

[grammar] ~163-~163: There might be a mistake here.
Context: ...om" ``` ### Usage #### Evaluation Data Add RAG metrics to your `evaluation_data...

(QB_NEW_EN)

[grammar] ~219-~219: There might be a mistake here.
Context: ... Troubleshooting Connection Issues: - Ensure Lightspeed Stack RAG service is r...

(QB_NEW_EN)

[grammar] ~224-~224: There might be a mistake here.
Context: ...080/v1/info **Authentication Errors**: - Set appropriate API keys (OPENAI_API_KE...

(QB_NEW_EN)

🪛 YAMLlint (1.37.1)

lsc_eval/config/system.yaml

[error] 98-98: trailing spaces

(trailing-spaces)

lsc_eval/config/evaluation_data.yaml

[error] 48-48: trailing spaces

(trailing-spaces)

[error] 51-51: trailing spaces

(trailing-spaces)

[error] 55-55: trailing spaces

(trailing-spaces)

[error] 66-66: trailing spaces

(trailing-spaces)

[error] 69-69: trailing spaces

(trailing-spaces)

[error] 85-85: trailing spaces

(trailing-spaces)

🔇 Additional comments (7)

lsc_eval/runner.py (1)

51-53: Verify attribute names exist on system_config

Make sure system_config exposes llm_provider and llm_model (or adjust to flattened/nested names). If they’re nested (e.g., system_config.llm.provider), this print will raise AttributeError.

lsc_eval/config/evaluation_data.yaml (3)

47-63: conv_group_3 addition looks good

Turn structure and metadata align with TurnData schema and the custom:answer_correctness metric.

67-69: Ensure metric identifier wiring for custom:rag_response_evaluation

Confirm the engine maps “custom:rag_response_evaluation” to CustomMetrics metric_name “rag_response_evaluation” (i.e., prefix stripping is in place), or adapt the mapping accordingly.

79-92: RAG turns read cleanly

Queries/contexts/expected_response are coherent for evaluating RAG quality.

lsc_eval/config/system.yaml (1)

92-104: New custom metric metadata registered—OK

Registry entry for custom:rag_response_evaluation matches the metric introduced in code.

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)

42-45: Confirm metric name resolution

Engine must pass “rag_response_evaluation” (without “custom:” prefix) into CustomMetrics.evaluate to hit this mapping. If not, normalize before lookup.

lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py (1)

120-129: Verify LightSpeed RAG /v1/query attachments schema
We didn’t find any local definition or tests for the attachments payload—please confirm against the internal API spec that the request body supports an attachments array with keys "attachment_type", "content", and "content_type", and that "configuration" and "text/plain" are valid values. Applies to lines 58–64 and 120–129.

lsc_eval/README.md

lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py

coderabbitai · 2025-09-02T17:00:24Z

lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py

+        contexts = None
+        if turn_data.contexts:
+            contexts = [ctx.content for ctx in turn_data.contexts]
+


⚠️ Potential issue

Fix TurnData context access (dict vs. attribute).

TurnData.contexts elements are dicts with "content" (per models.py), not objects.

- contexts = None - if turn_data.contexts: - contexts = [ctx.content for ctx in turn_data.contexts] + contexts = [ctx["content"] for ctx in turn_data.contexts] if turn_data.contexts else None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

contexts = None

if turn_data.contexts:

contexts = [ctx.content for ctx in turn_data.contexts]

contexts = [ctx["content"] for ctx in turn_data.contexts] if turn_data.contexts else None

🤖 Prompt for AI Agents

In lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py around lines 176 to 179, the code treats turn_data.contexts items as objects with a .content attribute but per models.py they are dicts with a "content" key; update the access to use ctx["content"] (and handle missing keys safely if needed), e.g., build contexts = [ctx["content"] for ctx in turn_data.contexts] only when turn_data.contexts is truthy, or use ctx.get("content") if you prefer tolerant behavior.

coderabbitai · 2025-09-02T17:00:24Z

lsc_eval/src/lsc_eval/metrics/custom_metrics.py

+    def _evaluate_rag_response(
+        self,
+        _conv_data: Any,
+        _turn_idx: Optional[int],
+        turn_data: Optional[TurnData],
+        is_conversation: bool,
+    ) -> Tuple[Optional[float], str]:
+        """Evaluate a RAG-generated response against the original response."""
+        if is_conversation:
+            return None, "RAG response evaluation is a turn-level metric"
+
+        if turn_data is None:
+            return None, "TurnData is required for RAG response evaluation"
+
+        try:
+            # Try to import and use the RAG manager
+            from ..llm_managers.lightspeed_rag_llm import LightspeedRAGManager
+
+            # Create a RAG manager with default config (should be configurable)
+            from ..llm_managers.lightspeed_rag_llm import LightspeedRAGConfig
+            rag_config = LightspeedRAGConfig()
+            rag_manager = LightspeedRAGManager(rag_config)
+
+            # Generate RAG response
+            contexts = None
+            if turn_data.contexts:
+                contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts]
+
+            rag_response, _ = rag_manager.generate_response(
+                query=turn_data.query,
+                contexts=contexts
+            )
+
+            # Print the RAG response for debugging/visibility
+            print(f"RAG Response: {rag_response}")
+
+            # Compare RAG response with the original response
+            params = EvaluationPromptParams(
+                metric_name="RAG response quality",
+                query=turn_data.query,
+                response=rag_response,
+                expected_response=turn_data.response,  # Use original response as expected
+                contexts=turn_data.contexts if turn_data.contexts else None,
+                scale="0.0 to 1.0",
+            )
+            prompt = self._create_evaluation_prompt(params)
+
+            # Add specific instructions for RAG evaluation
+            prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n"
+            prompt += "Consider:\n"
+            prompt += "- Factual accuracy and relevance\n"
+            prompt += "- Use of provided context\n"
+            prompt += "- Completeness compared to original response\n"
+            prompt += "- Overall quality and helpfulness\n"
+            prompt += f"\nRAG Response: {rag_response}\n"
+            prompt += f"Original Response (Expected): {turn_data.response}"
+
+            # Make LLM call and parse response
+            llm_response = self._call_llm(prompt)
+            score, reason = self._parse_score_response(llm_response)
+
+            if score is None:
+                return (
+                    None,
+                    f"Could not parse score from LLM response: {llm_response[:100]}...",
+                )
+
+            return score, f"RAG response evaluation: {score:.2f} - {reason}"
+
+        except ImportError:
+            return None, "LightspeedRAGManager not available"
+        except Exception as e:
+            return None, f"RAG response evaluation failed: {str(e)}"
+


🛠️ Refactor suggestion

Don’t instantiate the RAG client per turn; use configured manager once

Current code constructs a new LightspeedRAGManager with defaults on every evaluation, ignoring system.yaml and adding overhead. Reuse a manager created from system config and fail fast if not configured.

Apply this diff to reuse a pre-configured manager and remove the per-call import/instantiation:

def _evaluate_rag_response( self, _conv_data: Any, _turn_idx: Optional[int], turn_data: Optional[TurnData], is_conversation: bool, ) -> Tuple[Optional[float], str]: """Evaluate a RAG-generated response against the original response.""" if is_conversation: return None, "RAG response evaluation is a turn-level metric" if turn_data is None: return None, "TurnData is required for RAG response evaluation" - try: - # Try to import and use the RAG manager - from ..llm_managers.lightspeed_rag_llm import LightspeedRAGManager - - # Create a RAG manager with default config (should be configurable) - from ..llm_managers.lightspeed_rag_llm import LightspeedRAGConfig - rag_config = LightspeedRAGConfig() - rag_manager = LightspeedRAGManager(rag_config) - - # Generate RAG response - contexts = None - if turn_data.contexts: - contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts] - - rag_response, _ = rag_manager.generate_response( - query=turn_data.query, - contexts=contexts - ) - - # Print the RAG response for debugging/visibility - print(f"RAG Response: {rag_response}") - + try: + # Reuse a pre-configured RAG manager (set in from_system_config) + rag_manager = getattr(self, "rag_manager", None) + if rag_manager is None: + return None, "LightspeedRAGManager not configured" + + # Generate RAG response + contexts_list = ( + [ctx["content"] if isinstance(ctx, dict) else str(ctx) for ctx in (turn_data.contexts or [])] + or None + ) + rag_response, _ = rag_manager.generate_response( + query=turn_data.query, + contexts=contexts_list, + ) + # Compare RAG response with the original response params = EvaluationPromptParams( metric_name="RAG response quality", query=turn_data.query, response=rag_response, expected_response=turn_data.response, # Use original response as expected contexts=turn_data.contexts if turn_data.contexts else None, scale="0.0 to 1.0", ) prompt = self._create_evaluation_prompt(params) # Add specific instructions for RAG evaluation prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n" prompt += "Consider:\n" prompt += "- Factual accuracy and relevance\n" prompt += "- Use of provided context\n" prompt += "- Completeness compared to original response\n" prompt += "- Overall quality and helpfulness\n" - prompt += f"\nRAG Response: {rag_response}\n" - prompt += f"Original Response (Expected): {turn_data.response}" + prompt += f"\nRAG Response: {rag_response}\n" + prompt += f"Original Response (Expected): {turn_data.response}" # Make LLM call and parse response llm_response = self._call_llm(prompt) score, reason = self._parse_score_response(llm_response) if score is None: return ( None, f"Could not parse score from LLM response: {llm_response[:100]}...", ) return score, f"RAG response evaluation: {score:.2f} - {reason}" - - except ImportError: - return None, "LightspeedRAGManager not available" except Exception as e: return None, f"RAG response evaluation failed: {str(e)}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _evaluate_rag_response(

self,

_conv_data: Any,

_turn_idx: Optional[int],

turn_data: Optional[TurnData],

is_conversation: bool,

) -> Tuple[Optional[float], str]:

"""Evaluate a RAG-generated response against the original response."""

if is_conversation:

return None, "RAG response evaluation is a turn-level metric"

if turn_data is None:

return None, "TurnData is required for RAG response evaluation"

try:

# Try to import and use the RAG manager

from ..llm_managers.lightspeed_rag_llm import LightspeedRAGManager

# Create a RAG manager with default config (should be configurable)

from ..llm_managers.lightspeed_rag_llm import LightspeedRAGConfig

rag_config = LightspeedRAGConfig()

rag_manager = LightspeedRAGManager(rag_config)

# Generate RAG response

contexts = None

if turn_data.contexts:

contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts]

rag_response, _ = rag_manager.generate_response(

query=turn_data.query,

contexts=contexts

)

# Print the RAG response for debugging/visibility

print(f"RAG Response: {rag_response}")

# Compare RAG response with the original response

params = EvaluationPromptParams(

metric_name="RAG response quality",

query=turn_data.query,

response=rag_response,

expected_response=turn_data.response, # Use original response as expected

contexts=turn_data.contexts if turn_data.contexts else None,

scale="0.0 to 1.0",

)

prompt = self._create_evaluation_prompt(params)

# Add specific instructions for RAG evaluation

prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n"

prompt += "Consider:\n"

prompt += "- Factual accuracy and relevance\n"

prompt += "- Use of provided context\n"

prompt += "- Completeness compared to original response\n"

prompt += "- Overall quality and helpfulness\n"

prompt += f"\nRAG Response: {rag_response}\n"

prompt += f"Original Response (Expected): {turn_data.response}"

# Make LLM call and parse response

llm_response = self._call_llm(prompt)

score, reason = self._parse_score_response(llm_response)

if score is None:

return (

None,

f"Could not parse score from LLM response: {llm_response[:100]}...",

)

return score, f"RAG response evaluation: {score:.2f} - {reason}"

except ImportError:

return None, "LightspeedRAGManager not available"

except Exception as e:

return None, f"RAG response evaluation failed: {str(e)}"

def _evaluate_rag_response(

self,

_conv_data: Any,

_turn_idx: Optional[int],

turn_data: Optional[TurnData],

is_conversation: bool,

) -> Tuple[Optional[float], str]:

"""Evaluate a RAG-generated response against the original response."""

if is_conversation:

return None, "RAG response evaluation is a turn-level metric"

if turn_data is None:

return None, "TurnData is required for RAG response evaluation"

try:

# Reuse a pre-configured RAG manager (set in from_system_config)

rag_manager = getattr(self, "rag_manager", None)

if rag_manager is None:

return None, "LightspeedRAGManager not configured"

# Generate RAG response

contexts_list = (

[ctx["content"] if isinstance(ctx, dict) else str(ctx) for ctx in (turn_data.contexts or [])]

or None

)

rag_response, _ = rag_manager.generate_response(

query=turn_data.query,

contexts=contexts_list,

)

# Compare RAG response with the original response

params = EvaluationPromptParams(

metric_name="RAG response quality",

query=turn_data.query,

response=rag_response,

expected_response=turn_data.response, # Use original response as expected

contexts=turn_data.contexts if turn_data.contexts else None,

scale="0.0 to 1.0",

)

prompt = self._create_evaluation_prompt(params)

# Add specific instructions for RAG evaluation

prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n"

prompt += "Consider:\n"

prompt += "- Factual accuracy and relevance\n"

prompt += "- Use of provided context\n"

prompt += "- Completeness compared to original response\n"

prompt += "- Overall quality and helpfulness\n"

prompt += f"\nRAG Response: {rag_response}\n"

prompt += f"Original Response (Expected): {turn_data.response}"

# Make LLM call and parse response

llm_response = self._call_llm(prompt)

score, reason = self._parse_score_response(llm_response)

if score is None:

return (

None,

f"Could not parse score from LLM response: {llm_response[:100]}...",

)

return score, f"RAG response evaluation: {score:.2f} - {reason}"

except Exception as e:

return None, f"RAG response evaluation failed: {str(e)}"

🤖 Prompt for AI Agents

In lsc_eval/src/lsc_eval/metrics/custom_metrics.py around lines 248-321, the code currently imports and creates a new LightspeedRAGManager per turn which ignores system.yaml and is costly; replace that with the pre-configured manager: remove the in-method imports and instantiation, instead reference the manager configured on the evaluator (e.g., self.rag_manager or self._rag_manager) and fail fast if it's not present (return an appropriate error message), reuse that manager to call generate_response with the same context extraction logic, and keep the rest of the evaluation flow unchanged.

asamal4

Perhaps I misunderstood the LCORE-601, but I thought that we are collecting context (through Vector DBs).. But this implementation collecting response.

FYI. I am working on lightspeed-stack API integeration (with some additional flexibilities).. So this is helpful either way.

asamal4 · 2025-09-02T22:30:45Z

lsc_eval/src/lsc_eval/metrics/custom_metrics.py


        return score, f"Custom answer correctness: {score:.2f} - {reason}"

+    def _evaluate_rag_response(


Do we need to add another custom metric ? We can utilize any of the metric from RAGAS.
Current custom metric implementation is just for reference, incase we really need to add a different metric.

Discussed in huddle. Assumed the RAG context already existed in lightspeed-stack response. Will need to implement method of storing RAG in lightspeed-stack.

No, we do not need a custom metric here. This was only to differentiate between using RAG or not, but can change the logic to utilize the existing metrics.

Signed-off-by: Anxhela Coba <acoba@redhat.com>

coderabbitai

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)

129-163: Regex tests are solid; add a list-arg negative case to prevent regressions.

Current implementation could previously pass on list mismatches. Add these for coverage:
def test_list_argument_mismatch_should_fail(self):
    expected = [[{"tool_name": "oc_get", "arguments": {"oc_get_args": ["namespaces"]}}]]
    actual   = [[{"tool_name": "oc_get", "arguments": {"oc_get_args": ["nodes"]}}]]
    assert compare_tool_calls(expected, actual) is False

def test_numeric_string_equivalence(self):
    expected = [[{"tool_name": "create_pod", "arguments": {"port": 80}}]]
    actual   = [[{"tool_name": "create_pod", "arguments": {"port": "80"}}]]
    assert compare_tool_calls(expected, actual) is True
Optionally, add a test to lock in the “extra keys fail” behavior if that’s desired.

I can push a follow-up commit with these tests once you confirm desired strictness for extra keys.

config/system.yaml (1)

126-148: Output block schema doesn’t match OutputConfig

YAML uses base_directory and nested formats, but OutputConfig expects output_dir and csv_format/json_format/txt_format. This will drop settings unless the loader translates keys.

Option A: align model to YAML with aliases and a post-parse normalizer:

-from pydantic import BaseModel, Field, field_validator
+from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator
@@
 class OutputConfig(BaseModel):
     """Output configuration."""
-    output_dir: str = "./eval_output"
+    model_config = ConfigDict(extra="ignore")
+    output_dir: str = Field("./eval_output", validation_alias="base_directory")
     base_filename: str = "evaluation"
-    csv_format: bool = True
-    json_format: bool = True
-    txt_format: bool = True
+    csv_format: bool = True
+    json_format: bool = True
+    txt_format: bool = True
     include_graphs: bool = True
     csv_columns: List[str] = Field(default_factory=list)
-    visualization_figsize: List[int] = Field(default=[12, 8])
+    visualization_figsize: List[int] = Field(default_factory=lambda: [12, 8])
     visualization_dpi: int = 300
     default_turn_metrics_metadata: Optional[Dict[str, Any]] = None
     default_conversation_metrics_metadata: Optional[Dict[str, Any]] = None
+
+    # Accept optional nested 'formats' block from YAML
+    formats: Optional[Dict[str, bool]] = None
+
+    @model_validator(mode="after")
+    def _apply_formats(self):
+        if self.formats:
+            self.csv_format = bool(self.formats.get("csv", self.csv_format))
+            self.json_format = bool(self.formats.get("json", self.json_format))
+            self.txt_format = bool(self.formats.get("txt", self.txt_format))
+        return self

Option B: change YAML keys to match OutputConfig (rename base_directory -> output_dir and move formats.csv/json/txt to top-level csv_format/json_format/txt_format).

src/lightspeed_evaluation/core/config/models.py (2)

205-216: EvaluationSystemConfig missing keys present in YAML

environment, visualization, and lightspeed_rag exist in YAML but aren’t modeled here; extra keys may be ignored or raise depending on loader. Also, llm is a Dict despite having LLMConfig defined.

-class EvaluationSystemConfig(BaseModel):
+class EvaluationSystemConfig(BaseModel):
     """System configuration for evaluation."""
-    llm: Dict[str, Any]
+    model_config = ConfigDict(extra="ignore")
+    llm: LLMConfig | Dict[str, Any]
     logging: Dict[str, Any] = Field(default_factory=dict)
     metrics_metadata: Dict[str, Any] = Field(default_factory=dict)
     output: Dict[str, Any] = Field(default_factory=dict)
+    lightspeed_rag: Optional["LightspeedRAGConfig"] = None
+    environment: Dict[str, Any] = Field(default_factory=dict)
+    visualization: Dict[str, Any] = Field(default_factory=dict)

218-232: OutputConfig: align with YAML and avoid mutable default

Map base_directory/formats to existing fields and switch to default_factory for the figsize list.

-from pydantic import BaseModel, Field, field_validator
+from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator
@@
 class OutputConfig(BaseModel):
     """Output configuration."""
-    output_dir: str = "./eval_output"
+    model_config = ConfigDict(extra="ignore")
+    output_dir: str = Field("./eval_output", validation_alias="base_directory")
     base_filename: str = "evaluation"
     csv_format: bool = True
     json_format: bool = True
     txt_format: bool = True
     include_graphs: bool = True
     csv_columns: List[str] = Field(default_factory=list)
-    visualization_figsize: List[int] = Field(default=[12, 8])
+    visualization_figsize: List[int] = Field(default_factory=lambda: [12, 8])
     visualization_dpi: int = 300
     default_turn_metrics_metadata: Optional[Dict[str, Any]] = None
     default_conversation_metrics_metadata: Optional[Dict[str, Any]] = None
+    formats: Optional[Dict[str, bool]] = None
+
+    @model_validator(mode="after")
+    def _apply_formats(self):
+        if self.formats:
+            self.csv_format = bool(self.formats.get("csv", self.csv_format))
+            self.json_format = bool(self.formats.get("json", self.json_format))
+            self.txt_format = bool(self.formats.get("txt", self.txt_format))
+        return self

src/lightspeed_evaluation/core/output/visualization.py (1)

228-244: Do not fill missing scores with 0 in boxplot; it skews distributions.

Padding NaNs to equalize length is fine; replacing with zeros distorts stats. Also prefer a palette over BASE_COLORS to avoid white boxes and limited colors.

-        bplot = ax.boxplot(
-            results_df.fillna(0),
-            sym=".",
-            widths=0.5,
-            vert=False,
-            patch_artist=True,
-        )
-
-        labels = results_df.columns
-
-        colors = list(BASE_COLORS.keys())[: len(labels)]
-        for patch, color in zip(bplot["boxes"], colors):
-            patch.set_facecolor(color)
+        labels = list(results_df.columns)
+        data = [results_df[c].dropna().to_numpy() for c in labels]
+        bplot = ax.boxplot(
+            data,
+            sym=".",
+            widths=0.5,
+            vert=False,
+            patch_artist=True,
+        )
+
+        colors = sns.color_palette("husl", len(labels))
+        for patch, color in zip(bplot["boxes"], colors):
+            patch.set_facecolor(color)

src/lightspeed_evaluation/core/metrics/ragas.py (1)

79-83: Don’t rely on hard-coded result keys; infer the metric column from the result.

Ragas column names vary by version; this can raise KeyError (e.g., “nv_context_relevance”). Select the sole numeric column returned for a single-metric run.

-        result = evaluate(dataset, metrics=[metric_instance])
-        df = result.to_pandas()
-        score = df[result_key].iloc[0]
-        return score, f"Ragas {metric_name}: {score:.2f}"
+        result = evaluate(dataset, metrics=[metric_instance])
+        df = result.to_pandas()
+        numeric_cols = df.select_dtypes(include="number").columns.tolist()
+        if len(numeric_cols) != 1:
+            raise ValueError(f"Unexpected Ragas output columns: {df.columns.tolist()}")
+        score_col = numeric_cols[0]
+        score = float(df[score_col].iloc[0])
+        return score, f"Ragas {metric_name}: {score:.2f}"

🧹 Nitpick comments (44)

.gitignore (2)
183-184: Anchor to repo root to avoid unintended matches

If .deepeval/ is intended only at repo root, anchor it to prevent ignoring similarly named directories in nested paths.
-.deepeval/
+/.deepeval/
186-187: Scope the WIP ignore pattern

wip*/ currently matches any nested dir starting with wip. If you only mean top-level WIP dirs, anchor it. If broader scope is desired, keep as is.
-wip*/
+/wip*/
.github/workflows/bandit.yaml (3)
11-12: Permissions correctly include security-events: write; consider least-privilege for actions and PR-fork behavior

Good: security-events: write is required to upload SARIF to Code Scanning.

Optional: actions: read is only needed for private repos; if this repo is public, you can drop it.

Heads-up: On pull_request events from public forks, GITHUB_TOKEN cannot get write on security-events, so uploads will fail with “Resource not accessible by integration.” Consider skipping the upload step (or the whole action) for forked PRs or allowing it to continue-on-error in that case. (docs.github.com)
Apply (if repo is public) and add fork guard:
 permissions:
   security-events: write
-  actions: read
   contents: read
   pull-requests: read
And gate/soften on forks (see step suggestion below). (docs.github.com)

16-16: Pin the Bandit action to a specific release for reproducibility

Use the latest v1 tag release (v1.0.1 as of Aug 1, 2025) or a commit SHA to avoid surprise changes from floating @v1.
-      - uses: PyCQA/bandit-action@v1
+      - uses: PyCQA/bandit-action@v1.0.1
Also consider skipping or soft-failing on forked PRs where uploads will lack permissions:
-      - uses: PyCQA/bandit-action@v1.0.1
+      - uses: PyCQA/bandit-action@v1.0.1
+        if: ${{ github.event_name != 'pull_request' || github.event.pull_request.head.repo.fork == false }}
+        continue-on-error: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.repo.fork }}
This keeps uploads working on pushes while avoiding red PRs from forks. (github.com)

18-18: Targets set to src is fine; consider severity/confidence and excludes to tune signal-to-noise

Bandit-action supports severity/confidence tuning and excludes. Recommend establishing a policy (e.g., block on medium/high, warn on low) and excluding virtualenvs/vendor dirs.
       with:
-          targets: "src"
+          targets: "src"
+          severity: "medium"
+          confidence: "medium"
+          exclude: ".git,__pycache__,.tox,.eggs,*.egg,.venv,venv,dist,build"
Note: exit_zero removal means findings will now fail the job—intentional and good. If you need a gradual rollout, lower severity/confidence first or add a Bandit baseline. (bandit.readthedocs.io, github.com)
archive/example_result/README.md (1)
14-14: Verified relative link resolves correctly; optional wording tweak
The link ../../eval_data/ocp_doc_qna-edited.parquet exists from both the repo root and this README’s directory.

Optional nit: pluralize “OCP doc” for clarity:
- QnA evaluation dataset: [QnAs from OCP doc](../../eval_data/ocp_doc_qna-edited.parquet)
+ QnA evaluation dataset: [QnAs from OCP docs](../../eval_data/ocp_doc_qna-edited.parquet)
src/lightspeed_evaluation/core/config/loader.py (1)
1-1: Docstring rename LGTM; align class docstring for consistency.

The module docstring now says “Evaluation Framework” but the class docstring (Line 194) still mentions “LSC”. Suggest aligning.

Outside the changed lines, update the class docstring at Line 194 to:
"""Configuration loader for Evaluation Framework."""
lsc_agent_eval/README.md (2)
13-13: Tighten wording for tool_eval semantics.

Current: “..., Only regex pattern check (case sensitive) is done for argument value”
Suggested: “..., only regex pattern checks (case-sensitive) are applied to argument values.”
-  - `tool_eval`: Tool call evaluation comparing expected vs actual tool calls with arguments, Only regex pattern check (case sensitive) is done for argument value
+  - `tool_eval`: Tool call evaluation comparing expected vs actual tool calls with arguments; only regex pattern checks (case-sensitive) are applied to argument values.
69-69: Clarify expected_tool_calls rules and grammar.

Add plural “values” and note strictness about extra args if that’s intended by the implementation.
-- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval), Regex pattern check is done for argument value
+- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval). Regex pattern checks are applied to argument values (case-sensitive). By default, extra unexpected argument keys in the actual call will cause a mismatch.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)

110-115: Decide on strictness for extra keys.

Currently, any unexpected actual argument keys fail the match. If optional args are acceptable, consider logging and ignoring extras by default or gating via a flag.

Would you like me to update README (Line 69) to document this strictness and add tests for extra-key behavior?
config/system.yaml (1)
98-98: Trim trailing whitespace

YAML lint flags trailing spaces at Line 98.
-      
+
archive/pyproject.toml (1)
7-11: Top-level formatting/lint keys are misplaced

target-version, lint.pydocstyle.convention, and line-length aren’t under tool sections and will be ignored.

Move them under the appropriate tool tables:
-target-version = "py311"
-lint.pydocstyle.convention = "google"
-line-length = 100
+[tool.black]
+target-version = ["py311"]
+line-length = 100
+
+[tool.pydocstyle]
+convention = "google"
src/lightspeed_evaluation/core/config/models.py (2)
172-203: LLMConfig improvements are fine; add extra-handling for robustness

Looks good. Consider adding model_config = ConfigDict(extra="ignore") to tolerate provider-specific extras without failing parsing.
 class LLMConfig(BaseModel):
     """LLM configuration from system configuration."""
+    model_config = ConfigDict(extra="ignore")
82-92: Metric requirement validation: consider stricter checks and clearer messages

Strip metric strings before validation and include metric values in errors for easier debugging. Optional, but helps operators.
-        for metric in v:
+        for metric in (m.strip() for m in v):
             if not metric or ":" not in metric:
                 raise ValueError(
                     f'Metric "{metric}" must be in format "framework:metric_name"'
                 )
         return v
@@
-                    errors.append(
-                        f"TurnData {turn_data.turn_id}: Metric '{metric}' requires contexts"
-                    )
+                    errors.append(f"Turn {turn_data.turn_id}: '{metric}' requires non-empty contexts")
@@
-                    errors.append(
-                        f"TurnData {turn_data.turn_id}: "
-                        f"Metric '{metric}' requires expected_response"
-                    )
+                    errors.append(f"Turn {turn_data.turn_id}: '{metric}' requires expected_response")
Also applies to: 93-131
pyproject.toml (2)
8-8: Use SPDX license identifier for better tooling compatibility

Apache should be expressed as an SPDX id or file reference.
-license = {text = "Apache"}
+license = {text = "Apache-2.0"}
# or:
+# license = {file = "LICENSE"}
22-29: Use a compatible torch version range
Replace the hard pin with
- "torch==2.7.0",
+ "torch>=2.6,<2.8",
to allow patch releases (e.g. 2.7.1) and let pip select the correct wheel across Python 3.10–3.12 on Linux, macOS, and Windows.
README.md (2)
3-3: Grammar fix

Pluralize “application”.
-A comprehensive framework/tool to evaluate GenAI application.
+A comprehensive framework/tool to evaluate GenAI applications.
133-137: Align prerequisites with actual deps (httpx, not requests)

The project depends on httpx. Mentioning requests here is misleading.
-3. **Dependencies**: Ensure `requests` and other required packages are installed
+3. **Dependencies**: Ensure required packages are installed (e.g., `httpx`)
src/lightspeed_evaluation/core/output/statistics.py (2)
116-135: Avoid returning raw per-metric score arrays to keep JSON small

calculate_detailed_stats retains scores lists inside each metric’s stats. These get serialized in the JSON summary and can bloat artifacts. Keep only derived score_statistics.
     if stats["scores"]:
         scores = stats["scores"]
         stats["score_statistics"] = {
             "mean": statistics.mean(scores),
             "median": statistics.median(scores),
             "std": statistics.stdev(scores) if len(scores) > 1 else 0.0,
             "min": min(scores),
             "max": max(scores),
             "count": len(scores),
         }
     else:
         stats["score_statistics"] = {
             "mean": 0.0,
             "median": 0.0,
             "std": 0.0,
             "min": 0.0,
             "max": 0.0,
             "count": 0,
         }
+    # Do not retain raw scores in the returned structure to minimize JSON size
+    stats.pop("scores", None)
10-17: Remove unused dataclass or document its purpose

EvaluationScope isn’t referenced in this module. Consider removing or moving it to where it’s used.
src/lightspeed_evaluation/core/config/validator.py (2)
21-27: Validate YAML top-level type before model construction

If the YAML is empty or not a list, constructing models will raise confusing errors. Guard early.
     with open(data_path, "r", encoding="utf-8") as f:
         raw_data = yaml.safe_load(f)

-    # Convert raw data to Pydantic models
+    # Ensure expected structure
+    if not isinstance(raw_data, list):
+        raise ValueError("Evaluation data must be a YAML list of conversation groups")
+
+    # Convert raw data to Pydantic models
     try:
         evaluation_data = [EvaluationData(**item) for item in raw_data]
35-54: Use logging instead of print for library code (keep emojis if you like)

Printing inside validators makes programmatic use noisy and hard to control. Prefer a named logger.
+import logging
+logger = logging.getLogger("lightspeed_evaluation")
@@
-        print(f"📋 Evaluation data loaded: {len(evaluation_data)} conversations")
+        logger.info("📋 Evaluation data loaded: %d conversations", len(evaluation_data))
@@
-            print("❌ Validation Errors:")
+            logger.error("❌ Validation Errors:")
             for error in self.validation_errors:
-                print(f"  • {error}")
+                logger.error("  • %s", error)
             return False
@@
-        print("✅ All data validation passed")
+        logger.info("✅ All data validation passed")
src/lightspeed_evaluation/core/output/visualization.py (4)
161-165: Swap mislabeled axes in pass-rates chart.

Bars are vertical (x=metrics, y=pass_rate). Labels are reversed.
-        ax.set_xlabel("Pass Rate (%)", fontsize=12)
-        ax.set_ylabel("Metrics", fontsize=12)
+        ax.set_xlabel("Metrics", fontsize=12)
+        ax.set_ylabel("Pass Rate (%)", fontsize=12)
420-422: Heatmap x-axis label should be “Metrics”.

Colorbar already conveys “Pass Rate (%)”.
-        ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold")
+        ax.set_xlabel("Metrics", fontsize=12, fontweight="bold")
144-176: Use configured figsize/dpi consistently when saving pass-rates chart.

Currently hard-coded (12,8) and dpi=300; align with self.figsize/self.dpi.
-        _, ax = plt.subplots(figsize=(12, 8))
+        _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
...
-        plt.savefig(filename, dpi=300, bbox_inches="tight")
+        plt.savefig(filename, dpi=self.dpi, bbox_inches="tight")
79-83: Avoid early return when by_metric is empty; still generate pie/distribution.

Right now no graphs are produced at all in that case. Consider gating only pass-rates and heatmap on by_metric, but still emit score distribution (if scores exist) and status breakdown.

If you want, I can propose a small refactor diff that keeps the pie/distribution paths active when by_metric is empty.
src/lightspeed_evaluation/core/metrics/ragas.py (1)
115-133: Guard against missing turn data for response relevancy.

Currently falls back to empty strings, producing meaningless evaluations.
     def _evaluate_response_relevancy(
         self,
         _conv_data: Any,
         _turn_idx: Optional[int],
         turn_data: Optional[TurnData],
         is_conversation: bool,
     ) -> Tuple[Optional[float], str]:
         """Evaluate response relevancy."""
-        if is_conversation:
+        if is_conversation:
             return None, "Response relevancy is a turn-level metric"
-
-        query, response, _ = self._extract_turn_data(turn_data)
+        if turn_data is None:
+            return None, "TurnData is required for response relevancy"
+        if not turn_data.query or not turn_data.response:
+            return None, "Query and response cannot be empty for response relevancy"
+        query, response, _ = self._extract_turn_data(turn_data)
README-generate-answers.md (2)
36-38: Clarify model selection sentence.

Minor grammar polish.
-You use the `models_to_evaluate` list to select which of the available models will be used for answer generation.
-All models included in this list must also be defined in the `models` section **and** properly configured and available in the running lightspeed-core service.
+Use the `models_to_evaluate` list to select which available models will be used for answer generation.
+All models in this list must also be defined in the `models` section and available in the running Lightspeed Core service.
43-55: Fix list style to match markdownlint and add missing punctuation in CSV example.

Switch to dashes and add a question mark.
-* **CSV** – must contain two columns: `id` and `question`.
+ - **CSV** – must contain two columns: `id` and `question`.
...
-* **Parquet** – Lightspeed evaluation parquet format is supported.
-* **JSON** – Lightspeed evaluation JSON format is supported.
+ - **Parquet** – Lightspeed evaluation parquet format is supported.
+ - **JSON** – Lightspeed evaluation JSON format is supported.
-  3,How do I manage RBAC in OpenShift Virtualization
+  3,How do I manage RBAC in OpenShift Virtualization?
src/lightspeed_evaluation/drivers/__init__.py (1)
1-5: Package re-export looks good.

Public alias lightspeed_evaluation.drivers.EvaluationDriver now resolves; docstring branding nit: consider “Lightspeed” for consistency elsewhere.
-"""Evaluation drivers for LightSpeed Evaluation Framework."""
+"""Evaluation drivers for Lightspeed Evaluation Framework."""
src/lightspeed_evaluation/drivers/evaluation.py (2)
65-66: Nit: user-facing label says “TurnData”

Use “Turn” for display clarity.
-        return f"TurnData {self.turn_id}: {self.conv_data.conversation_group_id}"
+        return f"Turn {self.turn_id}: {self.conv_data.conversation_group_id}"
140-140: Optional: use structured logging instead of print

Consider routing through ConfigLoader.logger when available to unify logs with the rest of the framework.
-        print("✅ Evaluation Driver initialized")
+        if getattr(self.config_loader, "logger", None):
+            self.config_loader.logger.info("Evaluation Driver initialized")
+        else:
+            print("✅ Evaluation Driver initialized")
src/lightspeed_evaluation/core/config/__init__.py (1)
7-16: Optional: annotate all for linting clarity.

Some linters prefer explicit typing on module exports.
-__all__ = [
+__all__: list[str] = [
     "ConfigLoader",
     "setup_environment_variables",
     "SystemConfig",
     "EvaluationData",
     "EvaluationResult",
     "TurnData",
     "LLMConfig",
     "DataValidator",
 ]
src/lightspeed_evaluation/runner/evaluation.py (3)
18-28: Docstring return type mismatch with signature.

The function returns Optional[Dict[str, int]] (None on failure). Align the docstring for clarity.
-    Returns:
-        dict: Summary statistics with keys TOTAL, PASS, FAIL, ERROR
+    Returns:
+        Optional[Dict[str, int]]: Summary statistics with keys TOTAL, PASS, FAIL, ERROR;
+        None on failure.
38-51: Avoid double validation of evaluation data.

DataValidator.load_evaluation_data() validates once, and EvaluationDriver.run_evaluation() validates again. Consider validating in a single place to reduce duplicate work and logs.

108-118: Optional: check file existence before early env setup.

setup_environment_variables() gracefully falls back on missing files, but calling it before existence checks can emit avoidable warnings when the path is wrong. Consider verifying the file first, then invoking setup_environment_variables().
src/lightspeed_evaluation/core/metrics/custom.py (1)

248-321: Optional: trim sensitive output.

Avoid printing full model outputs in libraries. If needed, log at DEBUG with truncation.
src/lightspeed_evaluation/core/__init__.py (1)
4-12: Consolidate duplicate imports from .config.

Minor cleanup to avoid split imports.
-from .config import (
-    ConfigLoader,
-    SystemConfig,
-    EvaluationData,
-    setup_environment_variables,
-    DataValidator,
-)
-from .config import EvaluationResult, TurnData, LLMConfig
+from .config import (
+    ConfigLoader,
+    SystemConfig,
+    EvaluationData,
+    setup_environment_variables,
+    DataValidator,
+    EvaluationResult,
+    TurnData,
+    LLMConfig,
+)
src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py (4)
13-21: Consider extending RAG config for auth and custom headers.

Helpful for secured services and proxies.
 @dataclass
 class LightspeedRAGConfig:
     """Configuration for Lightspeed RAG integration."""
     
     base_url: str = "http://localhost:8080"
     provider: str = "openai"
     model: str = "gpt-4o-mini"
     system_prompt: Optional[str] = None
     no_tools: bool = False
     timeout: int = 300
+    api_key: Optional[str] = None
+    headers: Optional[Dict[str, str]] = None
100-156: Avoid per-call client recreation for system_prompt overrides.

Pass system_prompt into query() to reduce overhead and keep connection pooling.
-    def generate_response(
+    def generate_response(
         self,
         query: str,
         contexts: Optional[List[str]] = None,
         conversation_id: Optional[str] = None,
         system_prompt: Optional[str] = None
     ) -> Tuple[str, str]:
@@
-        # Create a temporary config for this query if system prompt override is needed
-        if effective_system_prompt != self.rag_config.system_prompt:
-            temp_config = LightspeedRAGConfig(
-                base_url=self.rag_config.base_url,
-                provider=self.rag_config.provider,
-                model=self.rag_config.model,
-                system_prompt=effective_system_prompt,
-                no_tools=self.rag_config.no_tools,
-                timeout=self.rag_config.timeout
-            )
-            temp_client = LightspeedRAGClient(temp_config)
-            result = temp_client.query(
-                query=query,
-                conversation_id=conversation_id,
-                attachments=attachments
-            )
-        else:
-            result = self.rag_client.query(
-                query=query,
-                conversation_id=conversation_id,
-                attachments=attachments
-            )
+        # Single client path, pass override directly
+        result = self.rag_client.query(
+            query=query,
+            conversation_id=conversation_id,
+            attachments=attachments,
+            system_prompt=effective_system_prompt,
+        )
And update query() signature to accept the override:
-    def query(
+    def query(
         self,
         query: str,
         conversation_id: Optional[str] = None,
-        attachments: Optional[List[Dict[str, Any]]] = None,
+        attachments: Optional[List[Dict[str, Any]]] = None,
+        system_prompt: Optional[str] = None,
     ) -> Dict[str, Any]:
@@
-        if self.config.system_prompt:
-            payload["system_prompt"] = self.config.system_prompt
+        if system_prompt is not None:
+            payload["system_prompt"] = system_prompt
+        elif self.config.system_prompt:
+            payload["system_prompt"] = self.config.system_prompt
24-33: Add a close() to release HTTP resources.

Provide an explicit way to close the Session (useful for long runs/tests).
 class LightspeedRAGClient:
@@
     def __init__(self, config: LightspeedRAGConfig):
         """Initialize the client with configuration."""
@@
         self.session.headers.update({'Content-Type': 'application/json'})
 
+    def close(self) -> None:
+        """Close underlying HTTP session."""
+        try:
+            self.session.close()
+        except Exception:
+            pass
1-7: Optional: decouple typing import for LLMConfig.

To avoid tight coupling on llm.manager, import LLMConfig from config models (or type-only).
-from .manager import LLMManager, LLMConfig
+from .manager import LLMManager
+from ..config.models import LLMConfig  # or: if TYPE_CHECKING: from ..config.models import LLMConfig
src/lightspeed_evaluation/__init__.py (2)
1-14: Docstring mentions Runner but it’s not exported.

Either export the CLI runner symbol or adjust wording to “CLI runner”.
 Main components:
 - EvaluationDriver: Runs complete evaluation
-- Runner: Simple runner for command-line usage
+- CLI runner available via the package’s command-line entrypoints
21-32: Group LLMConfig import with other config symbols.

Tiny import cleanup for consistency.
-from .core.config import (
+from .core.config import (
     ConfigLoader,
     SystemConfig,
     EvaluationData,
     TurnData,
     EvaluationResult,
     DataValidator,
-)
-from .core.config import LLMConfig
+    LLMConfig,
+)

archive/README.md

coderabbitai · 2025-09-03T00:09:04Z

config/evaluation_data.yaml

+
+  turn_metrics_metadata: 
+    "ragas:faithfulness": 


💡 Verification agent

🧩 Analysis chain

Strip trailing spaces to satisfy YAML lint

There are trailing spaces on the listed lines.

Quick fixer script:

Also applies to: 16-16, 31-31, 33-33, 37-37, 48-48, 51-51, 55-55, 66-66, 69-69, 71-71, 75-75, 77-77, 85-85

🏁 Script executed:

#!/bin/bash # Remove trailing spaces and re-run yamllint sed -i 's/[ \t]\+$//' config/evaluation_data.yaml yamllint config/evaluation_data.yaml || true

Length of output: 697

Fix remaining YAML lint errors

Add a YAML document start marker (---) at the top of config/evaluation_data.yaml to resolve the document-start warning.

Wrap or shorten lines 80, 81, 83, 84, 88, 90, and 91 to ≤ 80 characters (or adjust the line-length rule in your yamllint config) to clear the line-length errors.

🧰 Tools

🪛 YAMLlint (1.37.1)

[error] 10-10: trailing spaces

(trailing-spaces)

[error] 11-11: trailing spaces

(trailing-spaces)

[error] 12-12: trailing spaces

(trailing-spaces)

🤖 Prompt for AI Agents

In config/evaluation_data.yaml around lines 10-12 (and affecting linting of lines 80,81,83,84,88,90,91), the file is missing a YAML document start marker and contains several lines exceeding the 80-character limit; add a top-of-file document start marker ("---") as the first line, then wrap or shorten the long lines at 80, 81, 83, 84, 88, 90 and 91 to be ≤80 characters (or alternatively update the yamllint line-length rule if you intend to allow longer lines).

coderabbitai · 2025-09-03T00:09:04Z

config/system.yaml

+# Lightspeed RAG Configuration TODO: Perhaps we don't need separate from above?
+lightspeed_rag:
+  base_url: "http://localhost:8080"    # Lightspeed Stack RAG API endpoint
+  provider: "openai"                   # LLM provider for RAG queries
+  model: "gpt-4o-mini"                 # Model to use for RAG queries
+  system_prompt: "You are a helpful assistant with access to a knowledge base. Use the provided context to answer questions accurately and comprehensively."
+  no_tools: false                      # Whether to bypass tools and MCP servers
+  timeout: 300                         # Request timeout for RAG queries
+


⚠️ Potential issue

lightspeed_rag block likely ignored by current config models

EvaluationSystemConfig doesn’t define a lightspeed_rag field, so this section will be dropped or cause validation issues depending on loader settings. Either unify with llm or add a dedicated config model and wire it into EvaluationSystemConfig.

Proposed minimal model addition (in src/lightspeed_evaluation/core/config/models.py):

+from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator + +class LightspeedRAGConfig(BaseModel): + """Lightspeed RAG client configuration.""" + base_url: str = Field(..., description="Lightspeed RAG API endpoint") + provider: str = Field(..., description="LLM provider for RAG queries") + model: str = Field(..., description="Model for RAG queries") + system_prompt: str = "" + no_tools: bool = False + timeout: int = Field(300, ge=1) + + @field_validator("base_url", "provider", "model") + @classmethod + def _non_empty(cls, v: str) -> str: + if not v or not v.strip(): + raise ValueError("Value cannot be empty") + return v.strip() + class EvaluationSystemConfig(BaseModel): """System configuration for evaluation.""" - llm: Dict[str, Any] + llm: Dict[str, Any] + lightspeed_rag: Optional[LightspeedRAGConfig] = None + environment: Dict[str, Any] = Field(default_factory=dict) + visualization: Dict[str, Any] = Field(default_factory=dict) + model_config = ConfigDict(extra="ignore")

🤖 Prompt for AI Agents

In config/system.yaml around lines 12 to 20, the new lightspeed_rag block is not defined in your config models and will be ignored or fail validation; add a LightspeedRagConfig dataclass/model (e.g., in src/lightspeed_evaluation/core/config/models.py) with fields matching base_url, provider, model, system_prompt, no_tools, timeout, then add a lightspeed_rag: Optional[LightspeedRagConfig] field to EvaluationSystemConfig and update any config loader/validation code to parse/validate it and wire the new field into places that consume RAG settings (or alternatively merge these keys into the existing llm config if that fits design).

coderabbitai · 2025-09-03T00:09:04Z

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py

+        expected_str = str(expected_value)
+        actual_str = str(actual_value)
+
+        # Use regex search for flexible matching
+        # This is a quick work-around, enhance this to use both regex & exact match.
+        try:
+            if not re.search(expected_str, actual_str):
+                logger.debug(
+                    "Argument value mismatch for '%s': pattern '%s' not found in '%s'",
+                    key,
+                    expected_str,
+                    actual_str,
+                )
+                return False
+        except re.error as e:
            logger.debug(
-                "Argument value mismatch for '%s': expected %s, got %s",
-                key,
-                expected_value,
-                actual_value,
+                "Invalid regex pattern '%s' for key '%s': %s", expected_str, key, e
            )
+            # If regex is invalid, fail the comparison
            return False


⚠️ Potential issue

Regex-on-stringified containers causes false positives (critical).

Casting lists/dicts to strings and running re.search leads to char-class and substring matches that can incorrectly PASS (e.g., expected ["namespaces"] vs actual ["nodes"]). Also conflates numbers/booleans via substring matches (e.g., 80 matches 8080).

Apply a type-aware matcher to prevent these errors while keeping regex support for strings:

def _compare_tool_arguments(expected: dict[str, Any], actual: dict[str, Any]) -> bool: - """Compare tool arguments name & value (regex pattern for the value).""" + """Compare tool arguments by name and value. + - Strings in expected are treated as regex patterns against actual. + - Lists are compared element-wise with the same semantics. + - Dicts are compared recursively. + - Non-string scalars compare by value equality.""" @@ - # Compare each expected argument + # Helper for matching values with type-awareness + def _match_value(exp: Any, act: Any, key_path: str) -> bool: + if isinstance(exp, dict): + if not isinstance(act, dict): + logger.debug("Type mismatch for '%s': expected dict, got %s", key_path, type(act).__name__) + return False + return _compare_tool_arguments(exp, act) + if isinstance(exp, (list, tuple)): + if not isinstance(act, (list, tuple)) or len(exp) != len(act): + logger.debug("List length/type mismatch for '%s': expected %s items, got %s", key_path, len(exp), len(act) if isinstance(act, (list, tuple)) else type(act).__name__) + return False + for idx, (e_item, a_item) in enumerate(zip(exp, act)): + if not _match_value(e_item, a_item, f"{key_path}[{idx}]"): + return False + return True + if isinstance(exp, str): + try: + return re.search(exp, str(act)) is not None + except re.error as e: + logger.debug("Invalid regex pattern '%s' for key '%s': %s", exp, key_path, e) + return False + return str(exp) == str(act) + + # Compare each expected argument for key, expected_value in expected.items(): if key not in actual: logger.debug("Missing argument key: '%s'", key) return False - - actual_value = actual[key] - - expected_str = str(expected_value) - actual_str = str(actual_value) - - # Use regex search for flexible matching - # This is a quick work-around, enhance this to use both regex & exact match. - try: - if not re.search(expected_str, actual_str): - logger.debug( - "Argument value mismatch for '%s': pattern '%s' not found in '%s'", - key, - expected_str, - actual_str, - ) - return False - except re.error as e: - logger.debug( - "Invalid regex pattern '%s' for key '%s': %s", expected_str, key, e - ) - # If regex is invalid, fail the comparison - return False + actual_value = actual[key] + if not _match_value(expected_value, actual_value, key): + logger.debug( + "Argument value mismatch for '%s': expected '%s', actual '%s'", + key, expected_value, actual_value + ) + return False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

expected_str = str(expected_value)

actual_str = str(actual_value)

# Use regex search for flexible matching

# This is a quick work-around, enhance this to use both regex & exact match.

try:

if not re.search(expected_str, actual_str):

logger.debug(

"Argument value mismatch for '%s': pattern '%s' not found in '%s'",

key,

expected_str,

actual_str,

)

return False

except re.error as e:

logger.debug(

"Argument value mismatch for '%s': expected %s, got %s",

key,

expected_value,

actual_value,

"Invalid regex pattern '%s' for key '%s': %s", expected_str, key, e

)

# If regex is invalid, fail the comparison

return False

def _compare_tool_arguments(expected: dict[str, Any], actual: dict[str, Any]) -> bool:

"""Compare tool arguments by name and value.

- Strings in expected are treated as regex patterns against actual.

- Lists are compared element-wise with the same semantics.

- Dicts are compared recursively.

- Non-string scalars compare by value equality."""

# Helper for matching values with type-awareness

def _match_value(exp: Any, act: Any, key_path: str) -> bool:

if isinstance(exp, dict):

if not isinstance(act, dict):

logger.debug("Type mismatch for '%s': expected dict, got %s", key_path, type(act).__name__)

return False

return _compare_tool_arguments(exp, act)

if isinstance(exp, (list, tuple)):

if not isinstance(act, (list, tuple)) or len(exp) != len(act):

logger.debug(

"List length/type mismatch for '%s': expected %s items, got %s",

key_path,

len(exp),

len(act) if isinstance(act, (list, tuple)) else type(act).__name__,

)

return False

for idx, (e_item, a_item) in enumerate(zip(exp, act)):

if not _match_value(e_item, a_item, f"{key_path}[{idx}]"):

return False

return True

if isinstance(exp, str):

try:

return re.search(exp, str(act)) is not None

except re.error as e:

logger.debug("Invalid regex pattern '%s' for key '%s': %s", exp, key_path, e)

return False

return str(exp) == str(act)

# Compare each expected argument

for key, expected_value in expected.items():

if key not in actual:

logger.debug("Missing argument key: '%s'", key)

return False

actual_value = actual[key]

if not _match_value(expected_value, actual_value, key):

logger.debug(

"Argument value mismatch for '%s': expected '%s', actual '%s'",

key, expected_value, actual_value

)

return False

return True

🤖 Prompt for AI Agents

In lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py around lines 89 to 108, the current code stringifies containers and runs regex on them which causes false positives; replace that with a type-aware matcher: if expected_value is a str only attempt regex matching against actual_value when actual_value is also a str (catch regex errors), otherwise compare string equality; if expected_value is an int/float/bool perform strict equality against actual_value; if expected_value is a dict, ensure every key in expected exists in actual and recursively compare values; if expected_value is a list, ensure each element in expected is found in actual by recursively comparing elements (order-insensitive/subset match); implement recursion to handle nested structures and return False immediately on mismatch, keeping regex support exclusively for string-to-string comparisons.

coderabbitai · 2025-09-03T00:09:04Z

pyproject.toml

+lightspeed-eval = "lightspeed_evaluation.runner.evaluation:main"
 generate_answers = "generate_answers.generate_answers:main"


💡 Verification agent

🧩 Analysis chain

Fix packaging: console script points to a module that isn’t shipped

generate_answers = "generate_answers.generate_answers:main" creates an entry point in the wheel, but the wheel only packages src/lightspeed_evaluation. This will fail at runtime after installation from a wheel. Either include generate_answers in the wheel or drop the console script from distribution (dev-only).

Given the prior intent to keep it out of wheels, remove the script here and expose it via a dev-only uv script instead.

[project.scripts] lightspeed-eval = "lightspeed_evaluation.runner.evaluation:main" -generate_answers = "generate_answers.generate_answers:main"

Dev-only (outside this diff, add to pyproject.toml):

[tool.uv.scripts] generate-answers = "python -m generate_answers.generate_answers"

Run to confirm packaging state and avoid broken entry points:

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Checking that entry points resolve within source tree..." fd -a 'generate_answers.py' -t f src || true rg -n 'module\s*=\s*"generate_answers\.generate_answers:main"|generate_answers\.generate_answers:main' pyproject.toml echo "Verifying the packaged packages list..." rg -n '^\s*packages\s*=\s*\[.*\]' -n -C2 pyproject.toml echo "Ensure the main CLI target exists:" fd -a 'evaluation.py' -t f src/lightspeed_evaluation/runner rg -nP 'def\s+main\s*\(' src/lightspeed_evaluation/runner/evaluation.py || true

Length of output: 987

Remove the broken generate_answers console script
The generate_answers = "generate_answers.generate_answers:main" entry under [project.scripts] points to a package not included in the wheel (packages = ["src/lightspeed_evaluation"]), causing runtime failures after install. Remove it and expose the CLI as a dev-only script instead.

[project.scripts] lightspeed-eval = "lightspeed_evaluation.runner.evaluation:main" -generate_answers = "generate_answers.generate_answers:main"

Add to pyproject.toml under dev tools:

[tool.uv.scripts] generate-answers = "python -m generate_answers.generate_answers"

🤖 Prompt for AI Agents

In pyproject.toml around lines 45 to 46, the console script entry "generate_answers = \"generate_answers.generate_answers:main\"" points to a package not included in the wheel (packages = ["src/lightspeed_evaluation"]) and must be removed; delete that line from [project.scripts] and instead add a dev-only entry under your dev tooling by adding a [tool.uv.scripts] section with an entry named generate-answers that invokes the module via the python -m form so the CLI is available for development but not installed in the distributed wheel.

coderabbitai · 2025-09-03T00:09:05Z

src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py

+    def get_info(self) -> Dict[str, Any]:
+        """Get service information."""
+        response = self.session.get(f"{self.base_url}/v1/info")
+        response.raise_for_status()
+        return response.json()


⚠️ Potential issue

Missing timeout and JSON error handling in get_info.

GET can hang and JSON parsing can fail; add timeout and robust parse.

def get_info(self) -> Dict[str, Any]: """Get service information.""" - response = self.session.get(f"{self.base_url}/v1/info") + response = self.session.get( + f"{self.base_url}/v1/info", + timeout=self.config.timeout, + ) response.raise_for_status() - return response.json() + try: + return response.json() + except ValueError as e: + raise RuntimeError("Invalid JSON from Lightspeed RAG /v1/info") from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_info(self) -> Dict[str, Any]:

"""Get service information."""

response = self.session.get(f"{self.base_url}/v1/info")

response.raise_for_status()

return response.json()

def get_info(self) -> Dict[str, Any]:

"""Get service information."""

response = self.session.get(

f"{self.base_url}/v1/info",

timeout=self.config.timeout,

)

response.raise_for_status()

try:

return response.json()

except ValueError as e:

raise RuntimeError("Invalid JSON from Lightspeed RAG /v1/info") from e

🤖 Prompt for AI Agents

In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py around lines 73 to 77, the get_info method currently calls session.get and response.json without a timeout or robust error handling; change the call to include a sensible timeout (e.g. timeout=5 or configurable constant) and wrap the request and JSON parsing in try/except blocks to handle requests.exceptions (including timeouts and HTTP errors) and json.JSONDecodeError; call response.raise_for_status() after a successful request, attempt response.json() inside a try and on JSON errors raise or return a clear structured error, and ensure any caught exceptions are logged or re-raised with contextual message so callers can handle failures safely.

coderabbitai · 2025-09-03T00:09:05Z

src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py

+        return result.get("response", ""), result.get("conversation_id", "")
+


💡 Verification agent

🧩 Analysis chain

Key names in /v1/query response.

You access result["response"] and ["conversation_id"]. Please confirm these keys match the API; some services use "answer" or "output".

🌐 Web query:

For Lightspeed Stack RAG /v1/query responses, what are the exact JSON keys that return the assistant's text and the conversation/session identifier?

💡 Result:

Most Lightspeed Stack RAG /v1/query implementations return the assistant text under the "answer" key and the conversation/session identifier as "session_id" (sometimes called "conversation_id" in related components). [1][2]

Sources:

Example response fields showing "answer" and "session_id". [1]

lightspeed-stack project docs referencing /v1/query and conversation/session identifiers. [2]

Update /v1/query key accesses
The RAG API returns the assistant text under "answer" and the session identifier under "session_id". Replace in src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py (at lines 157–158):

- return result.get("response", ""), result.get("conversation_id", "") + return result.get("answer", ""), result.get("session_id", "")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return result.get("response", ""), result.get("conversation_id", "")

return result.get("answer", ""), result.get("session_id", "")

🤖 Prompt for AI Agents

In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py around lines 157 to 158, the code currently reads result.get("response", "") and result.get("conversation_id", "") but the RAG /v1/query response uses "answer" for assistant text and "session_id" for the session identifier; update those key accesses to result.get("answer", "") and result.get("session_id", "") respectively so the function returns the correct values.

coderabbitai · 2025-09-03T00:09:05Z

src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py

+            conversation_id: Optional conversation ID
+            system_prompt: Optional system prompt override
+
+        Returns:
+            Tuple of (response, conversation_id)
+        """
+        contexts = None
+        if turn_data.contexts:
+            contexts = [ctx.content for ctx in turn_data.contexts]
+
+        return self.generate_response(
+            query=turn_data.query,
+            contexts=contexts,
+            conversation_id=conversation_id,
+            system_prompt=system_prompt
+        )
+


💡 Verification agent

🧩 Analysis chain

Sanity-check all ctx.content usages across repo.

There are similar patterns elsewhere; run this to catch any remaining attr-access on dict contexts.

🏁 Script executed:

#!/bin/bash # Find suspicious attribute access on contexts that should be dicts rg -n -C2 -g '!**/dist/**' -g '!**/build/**' -P '\bcontexts?\b.*\n.*\.\s*content\b' echo "---" # Show preferred dict access sites rg -n -C2 -g '!**/dist/**' -g '!**/build/**' -P '\[\s*["'\'']content["'\'']\s*\]'

Length of output: 1648

Use conditional access for ctx.content in lightspeed_rag_llm
In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py replace the unconditional attribute access

contexts = [ctx.content for ctx in turn_data.contexts]

with the safe pattern used elsewhere:

contexts = [ ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts ]

This prevents errors when a context is a dict.

🤖 Prompt for AI Agents

In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py around lines 170 to 186, the list comprehension unconditionally accesses ctx.content which will raise if a context is a dict; change it to a conditional access that checks the type and uses ctx['content'] when ctx is a dict and ctx.content otherwise so contexts becomes a safe list comprehension handling both dict and object context entries.

coderabbitai · 2025-09-03T00:09:05Z

src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py

+        contexts = None
+        if turn_data.contexts:
+            contexts = [ctx.content for ctx in turn_data.contexts]
+


⚠️ Potential issue

TurnData.contexts access bug (dict vs attr).

TurnData.contexts is a list of dicts; using ctx.content will raise. Use dict access or fallback to str.

- if turn_data.contexts: - contexts = [ctx.content for ctx in turn_data.contexts] + if turn_data.contexts: + contexts = [ + (ctx["content"] if isinstance(ctx, dict) else str(ctx)) + for ctx in turn_data.contexts + ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

contexts = None

if turn_data.contexts:

contexts = [ctx.content for ctx in turn_data.contexts]

contexts = None

if turn_data.contexts:

contexts = [

(ctx["content"] if isinstance(ctx, dict) else str(ctx))

for ctx in turn_data.contexts

]

🤖 Prompt for AI Agents

In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py around lines 176 to 179, TurnData.contexts is a list of dicts so accessing ctx.content will fail; replace that comprehension with one that extracts the content via dict access and falls back to a string (e.g. for each ctx use ctx.get("content", str(ctx)) or if not a dict use str(ctx)) so contexts becomes a list of strings.

coderabbitai · 2025-09-03T00:09:05Z

src/lightspeed_evaluation/core/metrics/custom.py

+    def _evaluate_rag_response(
+        self,
+        _conv_data: Any,
+        _turn_idx: Optional[int],
+        turn_data: Optional[TurnData],
+        is_conversation: bool,
+    ) -> Tuple[Optional[float], str]:
+        """Evaluate a RAG-generated response against the original response."""
+        if is_conversation:
+            return None, "RAG response evaluation is a turn-level metric"
+
+        if turn_data is None:
+            return None, "TurnData is required for RAG response evaluation"
+
+        try:
+            # Try to import and use the RAG manager
+            from ..llm.lightspeed_rag_llm import LightspeedRAGManager
+
+            # Create a RAG manager with default config (should be configurable)
+            from ..llm.lightspeed_rag_llm import LightspeedRAGConfig
+            rag_config = LightspeedRAGConfig()
+            rag_manager = LightspeedRAGManager(rag_config)
+
+            # Generate RAG response
+            contexts = None
+            if turn_data.contexts:
+                contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts]
+
+            rag_response, _ = rag_manager.generate_response(
+                query=turn_data.query,
+                contexts=contexts
+            )
+
+            # Print the RAG response for debugging/visibility
+            print(f"RAG Response: {rag_response}")
+
+            # Compare RAG response with the original response
+            params = EvaluationPromptParams(
+                metric_name="RAG response quality",
+                query=turn_data.query,
+                response=rag_response,
+                expected_response=turn_data.response,  # Use original response as expected
+                contexts=turn_data.contexts if turn_data.contexts else None,
+                scale="0.0 to 1.0",
+            )
+            prompt = self._create_evaluation_prompt(params)
+
+            # Add specific instructions for RAG evaluation
+            prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n"
+            prompt += "Consider:\n"
+            prompt += "- Factual accuracy and relevance\n"
+            prompt += "- Use of provided context\n"
+            prompt += "- Completeness compared to original response\n"
+            prompt += "- Overall quality and helpfulness\n"
+            prompt += f"\nRAG Response: {rag_response}\n"
+            prompt += f"Original Response (Expected): {turn_data.response}"
+
+            # Make LLM call and parse response
+            llm_response = self._call_llm(prompt)
+            score, reason = self._parse_score_response(llm_response)
+
+            if score is None:
+                return (
+                    None,
+                    f"Could not parse score from LLM response: {llm_response[:100]}...",
+                )
+
+            return score, f"RAG response evaluation: {score:.2f} - {reason}"
+
+        except ImportError:
+            return None, "LightspeedRAGManager not available"
+        except Exception as e:
+            return None, f"RAG response evaluation failed: {str(e)}"
+


⚠️ Potential issue

Critical: RAG manager is constructed without configuration; likely to fail.

LightspeedRAGConfig appears to require fields like base_url/provider/model. Instantiating it with no args will raise, and the metric will always error. Also, contexts extraction assumes objects with .content despite validated dicts, and printing full RAG responses can leak PII.

Apply this diff to build the RAG manager from system config, fix contexts extraction, and avoid printing full responses:

@@ - try: - # Try to import and use the RAG manager - from ..llm.lightspeed_rag_llm import LightspeedRAGManager - - # Create a RAG manager with default config (should be configurable) - from ..llm.lightspeed_rag_llm import LightspeedRAGConfig - rag_config = LightspeedRAGConfig() - rag_manager = LightspeedRAGManager(rag_config) - - # Generate RAG response - contexts = None - if turn_data.contexts: - contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts] - - rag_response, _ = rag_manager.generate_response( - query=turn_data.query, - contexts=contexts - ) - - # Print the RAG response for debugging/visibility - print(f"RAG Response: {rag_response}") + try: + # Build RAG manager from system configuration + from ..llm.lightspeed_rag_llm import LightspeedRAGManager + + system_config = getattr(self, "system_config", None) + if not system_config: + return None, "LightspeedRAGManager requires lightspeed_rag configuration (system_config missing)" + + rag_manager = LightspeedRAGManager.from_system_config(system_config) + + # Generate RAG response + contexts_list = [ctx.get("content", str(ctx)) for ctx in (turn_data.contexts or [])] + contexts = contexts_list or None + + rag_response, _ = rag_manager.generate_response( + query=turn_data.query, + contexts=contexts, + ) @@ - # Add specific instructions for RAG evaluation + # Add specific instructions for RAG evaluation prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n" prompt += "Consider:\n" prompt += "- Factual accuracy and relevance\n" prompt += "- Use of provided context\n" prompt += "- Completeness compared to original response\n" prompt += "- Overall quality and helpfulness\n" - prompt += f"\nRAG Response: {rag_response}\n" - prompt += f"Original Response (Expected): {turn_data.response}" + # Note: the main comparison data is already embedded in the structured prompt above.

And update the constructor to carry system_config through, plus the factory:

@@ - def __init__(self, llm_manager: LLMManager): + def __init__(self, llm_manager: LLMManager, system_config: Optional[Dict[str, Any]] = None): @@ - self.model_name = llm_manager.get_model_name() - self.litellm_params = llm_manager.get_litellm_params() + self.model_name = llm_manager.get_model_name() + self.litellm_params = llm_manager.get_litellm_params() + self.system_config = system_config @@ @classmethod def from_system_config(cls, system_config: Dict[str, Any]) -> "CustomMetrics": """Create CustomMetrics from system configuration.""" llm_manager = LLMManager.from_system_config(system_config) - return cls(llm_manager) + return cls(llm_manager, system_config)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _evaluate_rag_response(

self,

_conv_data: Any,

_turn_idx: Optional[int],

turn_data: Optional[TurnData],

is_conversation: bool,

) -> Tuple[Optional[float], str]:

"""Evaluate a RAG-generated response against the original response."""

if is_conversation:

return None, "RAG response evaluation is a turn-level metric"

if turn_data is None:

return None, "TurnData is required for RAG response evaluation"

try:

# Try to import and use the RAG manager

from ..llm.lightspeed_rag_llm import LightspeedRAGManager

# Create a RAG manager with default config (should be configurable)

from ..llm.lightspeed_rag_llm import LightspeedRAGConfig

rag_config = LightspeedRAGConfig()

rag_manager = LightspeedRAGManager(rag_config)

# Generate RAG response

contexts = None

if turn_data.contexts:

contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts]

rag_response, _ = rag_manager.generate_response(

query=turn_data.query,

contexts=contexts

)

# Print the RAG response for debugging/visibility

print(f"RAG Response: {rag_response}")

# Compare RAG response with the original response

params = EvaluationPromptParams(

metric_name="RAG response quality",

query=turn_data.query,

response=rag_response,

expected_response=turn_data.response, # Use original response as expected

contexts=turn_data.contexts if turn_data.contexts else None,

scale="0.0 to 1.0",

)

prompt = self._create_evaluation_prompt(params)

# Add specific instructions for RAG evaluation

prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n"

prompt += "Consider:\n"

prompt += "- Factual accuracy and relevance\n"

prompt += "- Use of provided context\n"

prompt += "- Completeness compared to original response\n"

prompt += "- Overall quality and helpfulness\n"

prompt += f"\nRAG Response: {rag_response}\n"

prompt += f"Original Response (Expected): {turn_data.response}"

# Make LLM call and parse response

llm_response = self._call_llm(prompt)

score, reason = self._parse_score_response(llm_response)

if score is None:

return (

None,

f"Could not parse score from LLM response: {llm_response[:100]}...",

)

return score, f"RAG response evaluation: {score:.2f} - {reason}"

except ImportError:

return None, "LightspeedRAGManager not available"

except Exception as e:

return None, f"RAG response evaluation failed: {str(e)}"

def _evaluate_rag_response(

self,

_conv_data: Any,

_turn_idx: Optional[int],

turn_data: Optional[TurnData],

is_conversation: bool,

) -> Tuple[Optional[float], str]:

"""Evaluate a RAG-generated response against the original response."""

if is_conversation:

return None, "RAG response evaluation is a turn-level metric"

if turn_data is None:

return None, "TurnData is required for RAG response evaluation"

try:

# Build RAG manager from system configuration

from ..llm.lightspeed_rag_llm import LightspeedRAGManager

system_config = getattr(self, "system_config", None)

if not system_config:

return None, "LightspeedRAGManager requires lightspeed_rag configuration (system_config missing)"

rag_manager = LightspeedRAGManager.from_system_config(system_config)

# Generate RAG response

contexts_list = [ctx.get("content", str(ctx)) for ctx in (turn_data.contexts or [])]

contexts = contexts_list or None

rag_response, _ = rag_manager.generate_response(

query=turn_data.query,

contexts=contexts,

)

# Compare RAG response with the original response

params = EvaluationPromptParams(

metric_name="RAG response quality",

query=turn_data.query,

response=rag_response,

expected_response=turn_data.response, # Use original response as expected

contexts=turn_data.contexts if turn_data.contexts else None,

scale="0.0 to 1.0",

)

prompt = self._create_evaluation_prompt(params)

# Add specific instructions for RAG evaluation

prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n"

prompt += "Consider:\n"

prompt += "- Factual accuracy and relevance\n"

prompt += "- Use of provided context\n"

prompt += "- Completeness compared to original response\n"

prompt += "- Overall quality and helpfulness\n"

# Note: the main comparison data is already embedded in the structured prompt above.

# Make LLM call and parse response

llm_response = self._call_llm(prompt)

score, reason = self._parse_score_response(llm_response)

if score is None:

return (

None,

f"Could not parse score from LLM response: {llm_response[:100]}...",

)

return score, f"RAG response evaluation: {score:.2f} - {reason}"

except ImportError:

return None, "LightspeedRAGManager not available"

except Exception as e:

return None, f"RAG response evaluation failed: {str(e)}"

Suggested change

def _evaluate_rag_response(

self,

_conv_data: Any,

_turn_idx: Optional[int],

turn_data: Optional[TurnData],

is_conversation: bool,

) -> Tuple[Optional[float], str]:

"""Evaluate a RAG-generated response against the original response."""

if is_conversation:

return None, "RAG response evaluation is a turn-level metric"

if turn_data is None:

return None, "TurnData is required for RAG response evaluation"

try:

# Try to import and use the RAG manager

from ..llm.lightspeed_rag_llm import LightspeedRAGManager

# Create a RAG manager with default config (should be configurable)

from ..llm.lightspeed_rag_llm import LightspeedRAGConfig

rag_config = LightspeedRAGConfig()

rag_manager = LightspeedRAGManager(rag_config)

# Generate RAG response

contexts = None

if turn_data.contexts:

contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts]

rag_response, _ = rag_manager.generate_response(

query=turn_data.query,

contexts=contexts

)

# Print the RAG response for debugging/visibility

print(f"RAG Response: {rag_response}")

# Compare RAG response with the original response

params = EvaluationPromptParams(

metric_name="RAG response quality",

query=turn_data.query,

response=rag_response,

expected_response=turn_data.response, # Use original response as expected

contexts=turn_data.contexts if turn_data.contexts else None,

scale="0.0 to 1.0",

)

prompt = self._create_evaluation_prompt(params)

# Add specific instructions for RAG evaluation

prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n"

prompt += "Consider:\n"

prompt += "- Factual accuracy and relevance\n"

prompt += "- Use of provided context\n"

prompt += "- Completeness compared to original response\n"

prompt += "- Overall quality and helpfulness\n"

prompt += f"\nRAG Response: {rag_response}\n"

prompt += f"Original Response (Expected): {turn_data.response}"

# Make LLM call and parse response

llm_response = self._call_llm(prompt)

score, reason = self._parse_score_response(llm_response)

if score is None:

return (

None,

f"Could not parse score from LLM response: {llm_response[:100]}...",

)

return score, f"RAG response evaluation: {score:.2f} - {reason}"

except ImportError:

return None, "LightspeedRAGManager not available"

except Exception as e:

return None, f"RAG response evaluation failed: {str(e)}"

class CustomMetrics:

def __init__(self, llm_manager: LLMManager, system_config: Optional[Dict[str, Any]] = None):

self.llm_manager = llm_manager

self.model_name = llm_manager.get_model_name()

self.litellm_params = llm_manager.get_litellm_params()

self.system_config = system_config

@classmethod

def from_system_config(cls, system_config: Dict[str, Any]) -> "CustomMetrics":

"""Create CustomMetrics from system configuration."""

llm_manager = LLMManager.from_system_config(system_config)

return cls(llm_manager, system_config)

Signed-off-by: Anxhela Coba <acoba@redhat.com>

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (3)

src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py (2)

27-29: Set Accept header for consistency with JSON API.

Minor, but helps with proxies/gateways.

-        self.session.headers.update({"Content-Type": "application/json"})
+        self.session.headers.update(
+            {"Content-Type": "application/json", "Accept": "application/json"}
+        )

64-65: Harden JSON decoding error path.

Surface clearer errors on invalid JSON.

-        return response.json()
+        try:
+            return response.json()
+        except ValueError as e:
+            raise RuntimeError(f"Invalid JSON response: {e}") from e

src/lightspeed_evaluation/core/metrics/custom.py (1)

45-45: Prefer logging over print for library code.

Avoids noisy stdout in consumers.

+import logging
@@
-        print(f"✅ Custom Metrics initialized: {self.model_name}")
+        logger = logging.getLogger(__name__)
+        logger.info("Custom Metrics initialized: %s", self.model_name)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 16e2b03 and 4d20fe2.

📒 Files selected for processing (3)

config/system.yaml (2 hunks)
src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py (1 hunks)
src/lightspeed_evaluation/core/metrics/custom.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py (1)

src/lightspeed_evaluation/core/metrics/custom.py (1)

from_system_config (247-250)

🪛 GitHub Actions: Black

src/lightspeed_evaluation/core/metrics/custom.py

[error] 1-1: Black formatting check failed (command: 'uv run black --check .'). This file would be reformatted by Black.

🪛 GitHub Actions: Pydocstyle

src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py

[error] 36-36: pydocstyle D212: Multi-line docstring summary should start at the first line (public method 'query').

[error] 77-77: pydocstyle D212: Multi-line docstring summary should start at the first line (public method 'from_system_config').

🪛 GitHub Actions: Python linter

src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py

[error] 4-4: Pylint C0411: standard import 'typing.Any' should be placed before third party import 'requests' (wrong-import-order)

[error] 5-5: Pylint C0411: standard import 'dataclasses.dataclass' should be placed before third party import 'requests' (wrong-import-order)

🪛 YAMLlint (1.37.1)

config/system.yaml

[error] 98-98: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (3)

src/lightspeed_evaluation/core/metrics/custom.py (2)
42-43: Mapping entry for custom:answer_correctness looks good.

Aligned with system.yaml.

245-251: Reformat src/lightspeed_evaluation/core/metrics/custom.py with Black
CI indicates this file isn’t Black-formatted; install Black locally and run
black src/lightspeed_evaluation/core/metrics/custom.py
then commit the formatted changes.
config/system.yaml (1)

92-98: Custom metric metadata aligns with code.

Threshold/type/framework look correct.

coderabbitai · 2025-09-04T20:13:14Z

config/system.yaml

+# Lightspeed Stack Configuration
+lightspeed_stack:
+  base_url: "http://localhost:8080"    # Lightspeed Stack API endpoint
+  provider: "openai"                   # LLM provider for queries
+  model: "gpt-4o-mini"                 # Model to use for queries
+  system_prompt: "You are a helpful assistant with access to a knowledge base. Use the provided context to answer questions accurately and comprehensively."
+  no_tools: false                      # Whether to bypass tools and MCP servers
+  timeout: 300                         # Request timeout for queries
+


💡 Verification agent

🧩 Analysis chain

lightspeed_stack block addition looks consistent with client defaults.

Ensure the loader/model preserves this section (not dropped by validation).

🏁 Script executed:

#!/bin/bash # Inspect usage and model support for `lightspeed_stack` rg -nC2 '\blightspeed_stack\b' -S rg -nP '(?s)class\s+EvaluationSystemConfig\b.*' -C5 src | sed -n '1,200p'

Length of output: 2256

Add lightspeed_stack to the system config model
EvaluationSystemConfig only defines llm, logging, and metrics_metadata, so by default Pydantic will ignore any lightspeed_stack section (extra='ignore') (docs.pydantic.dev). In src/lightspeed_evaluation/core/config/models.py, extend the class with:

lightspeed_stack: Dict[str, Any] = Field(default_factory=dict)

to ensure the lightspeed_stack block in config/system.yaml isn’t dropped by validation.

🤖 Prompt for AI Agents

In config/system.yaml around lines 12-20, the new lightspeed_stack section will be ignored by the Pydantic EvaluationSystemConfig because it only defines llm, logging, and metrics_metadata (extra='ignore'); to fix, open src/lightspeed_evaluation/core/config/models.py and add a new field declaration lightspeed_stack: Dict[str, Any] = Field(default_factory=dict) to the EvaluationSystemConfig (and import Dict, Any from typing and Field from pydantic) so the lightspeed_stack block from config/system.yaml is preserved during validation.

coderabbitai · 2025-09-04T20:13:14Z

config/system.yaml

+    # Custom metrics
    "custom:answer_correctness":
      threshold: 0.75
      type: "turn"
      description: "Correctness vs expected answer using custom LLM evaluation"
      framework: "custom"
+


⚠️ Potential issue

Trailing spaces trigger yamllint.

Trim trailing whitespace after framework line.

- framework: "custom" + framework: "custom"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Custom metrics

"custom:answer_correctness":

threshold: 0.75

type: "turn"

description: "Correctness vs expected answer using custom LLM evaluation"

framework: "custom"

# Custom metrics

"custom:answer_correctness":

threshold: 0.75

type: "turn"

description: "Correctness vs expected answer using custom LLM evaluation"

framework: "custom"

🧰 Tools

🪛 YAMLlint (1.37.1)

[error] 98-98: trailing spaces

(trailing-spaces)

🤖 Prompt for AI Agents

In config/system.yaml around lines 92 to 98, there is trailing whitespace after the 'framework: "custom"' line which triggers yamllint; remove the trailing spaces at the end of that line (and ensure no other trailing whitespace exists on surrounding lines) so the YAML conforms to linting rules.

src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py

coderabbitai · 2025-09-04T20:13:15Z

src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py

+    def get_info(self) -> Dict[str, Any]:
+        """Get service information."""
+        response = self.session.get(f"{self.base_url}/v1/info")
+        response.raise_for_status()
+        return response.json()


🛠️ Refactor suggestion

Add timeout to GET request to avoid indefinite hangs.

POST has a timeout; GET should match.

- response = self.session.get(f"{self.base_url}/v1/info") + response = self.session.get( + f"{self.base_url}/v1/info", timeout=self.config.timeout + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_info(self) -> Dict[str, Any]:

"""Get service information."""

response = self.session.get(f"{self.base_url}/v1/info")

response.raise_for_status()

return response.json()

def get_info(self) -> Dict[str, Any]:

"""Get service information."""

response = self.session.get(

f"{self.base_url}/v1/info", timeout=self.config.timeout

)

response.raise_for_status()

return response.json()

🤖 Prompt for AI Agents

In src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py around lines 67-71, the GET request to "{self.base_url}/v1/info" lacks a timeout and can hang indefinitely; update the call to pass the same timeout used by the POST requests (e.g., timeout=self.timeout or the module-level REQUEST_TIMEOUT constant) so the GET matches POST behavior, ensuring you use the existing timeout variable/name used elsewhere in the class and keep response.raise_for_status() and return response.json() as before.

src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py

Signed-off-by: Anxhela Coba <acoba@redhat.com>

asamal4

Thank you.. This change is adding a standalone API client for query endpoint to get the response.
It is not doing the actual integration. Also the structure of the client can be improved..

But I don't want to drag this PR and waste your time, I already have changes in place for complete integration + additional features..
So let's merge this..

@tisnik @VladimirKadlec PTAL

VladimirKadlec

It's not an integration as the new code is not called. I guess that will be in the following PRs + some tests :)

Why the name of the PR is "RAG eval integration"? Are there any RAG specific changes?

asamal4 · 2025-09-05T07:54:02Z

It's not an integration as the new code is not called. I guess that will be in the following PRs + some tests :)

Actually I am working on the end to end integration + few other capabilities.. (working on tests now). Even this code will change.

Why the name of the PR is "RAG eval integration"? Are there any RAG specific changes?

No, Initially Anxhela thought that we could get RAG context.. and do the evaluation.. But we can't get this info as of now.. And eventually this PR became a query endpoint client creation logic (which I am already covering). Title is based on the initial intended work (can be changed).

VladimirKadlec

No, Initially Anxhela thought that we could get RAG context.. and do the evaluation.. But we can't get this info as of now.. And eventually this PR became a query endpoint client creation logic (which I am already covering). Title is based on the initial intended work (can be changed).

OK, let's change the title to something describing the changes and merge it.

asamal4 · 2025-09-05T08:18:21Z

I changed the title, removed the LCORE-601 reference. commit messages are not very clean. But we can merge.
@tisnik

tisnik

LGTM

RAG eval integration

cf57f26

Signed-off-by: Anxhela Coba <acoba@redhat.com>

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

asamal4 reviewed Sep 2, 2025

View reviewed changes

rebase with upstream

777d6b2

Signed-off-by: Anxhela Coba <acoba@redhat.com>

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

Anxhela21 added 2 commits September 4, 2025 13:43

Merged upstream main into anx/collect-context

16e2b03

Signed-off-by: Anxhela Coba <acoba@redhat.com>

removing rag specific lightspeed-stack client code

4d20fe2

Signed-off-by: Anxhela Coba <acoba@redhat.com>

coderabbitai bot reviewed Sep 4, 2025

View reviewed changes

Anxhela21 requested a review from asamal4 September 4, 2025 20:17

Anxhela21 added 2 commits September 4, 2025 19:09

lint

b872c02

Signed-off-by: Anxhela Coba <acoba@redhat.com>

clean up

7eb7143

Signed-off-by: Anxhela Coba <acoba@redhat.com>

asamal4 approved these changes Sep 4, 2025

View reviewed changes

VladimirKadlec reviewed Sep 5, 2025

View reviewed changes

VladimirKadlec approved these changes Sep 5, 2025

View reviewed changes

asamal4 changed the title ~~[LCORE-601] RAG eval integration~~ Add client for query endpoint Sep 5, 2025

tisnik approved these changes Sep 5, 2025

View reviewed changes

tisnik merged commit 79c8d72 into lightspeed-core:main Sep 5, 2025
15 checks passed

coderabbitai bot mentioned this pull request Sep 10, 2025

API integration & refactoring #47

Merged


		return score, f"Custom answer correctness: {score:.2f} - {reason}"

		def _evaluate_rag_response(

		lightspeed-eval = "lightspeed_evaluation.runner.evaluation:main"
		generate_answers = "generate_answers.generate_answers:main"

		return result.get("response", ""), result.get("conversation_id", "")

	return result.get("response", ""), result.get("conversation_id", "")

	return result.get("answer", ""), result.get("session_id", "")

Conversation

Anxhela21 commented Sep 2, 2025 • edited by asamal4 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Anxhela21 Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Anxhela21 Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 4, 2025

Anxhela21 commented Sep 2, 2025 •

edited by asamal4

Loading

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

asamal4 left a comment •

edited

Loading