Add client for query endpoint#43
Conversation
Signed-off-by: Anxhela Coba <acoba@redhat.com>
WalkthroughAdds a Lightspeed Stack client module with config-driven initialization and API calls to /v1/query and /v1/info. Updates system configuration to include a lightspeed_stack block and new environment variables. Introduces a custom metric declaration. Minor formatting change to a metrics file. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant Client as LightspeedStackClient
participant API as Lightspeed Stack API
Caller->>Client: query(query, conversation_id?, attachments?)
Note right of Client: Build payload: query, provider, model, no_tools<br/>+ optional system_prompt, conversation_id, attachments
Client->>API: POST /v1/query (JSON, timeout)
API-->>Client: 200 JSON response
Client-->>Caller: Return parsed JSON
alt Error status
API-->>Client: 4xx/5xx
Client-->>Caller: raise_for_status() exception
end
sequenceDiagram
autonumber
actor App
participant Config as system_config
participant Client as LightspeedStackClient
App->>Client: from_system_config(system_config)
Client->>Config: Read lightspeed_stack section
Note right of Client: Apply defaults if keys missing<br/>Trim base_url whitespace/slash
Client-->>App: LightspeedStackClient instance
App->>Client: get_info()
Client->>+Config: (uses stored config)
Client-->>App: API /v1/info JSON
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)
322-327: Initialize and cache the RAG manager from system configCreate the RAG manager once here and attach it to the CustomMetrics instance for reuse by _evaluate_rag_response.
@classmethod def from_system_config(cls, system_config: Dict[str, Any]) -> "CustomMetrics": """Create CustomMetrics from system configuration.""" - llm_manager = LLMManager.from_system_config(system_config) - return cls(llm_manager) + llm_manager = LLMManager.from_system_config(system_config) + inst = cls(llm_manager) + try: + from ..llm_managers.lightspeed_rag_llm import LightspeedRAGManager + inst.rag_manager = LightspeedRAGManager.from_system_config(system_config) + except ImportError: + inst.rag_manager = None + return inst
🧹 Nitpick comments (11)
lsc_eval/runner.py (2)
70-70: Double-check include_graphs sourceRunner reads include_graphs from system_config.include_graphs, while config shows it under output.include_graphs. Confirm the loader flattens this, or pass system_config.output.include_graphs.
84-85: Improve actionable error hintConsider appending the reports directory to the error reminder for quicker triage.
- print(f"⚠️ {summary['ERROR']} evaluations had errors - check detailed report") + print(f"⚠️ {summary['ERROR']} evaluations had errors - check detailed report in: {output_handler.output_dir}")lsc_eval/config/evaluation_data.yaml (1)
48-48: Trim trailing spaces flagged by yamllintRemove trailing whitespace on these lines to satisfy CI linters.
Also applies to: 51-51, 55-55, 66-66, 69-69, 85-85
lsc_eval/config/system.yaml (2)
12-20: Lightspeed RAG config block added—OKBlock is reasonable. Optional: if RAG should default to the primary llm.* values, document/implement fallback when provider/model not set here to avoid duplication.
98-98: Trailing spaceOne trailing space flagged by yamllint. Please trim.
lsc_eval/README.md (1)
5-5: Fix MD019: extra spaces after ATX headingsRemove double spaces after “##”.
-## Key Features +## Key Features -## Quick Start +## Quick StartAlso applies to: 15-15
lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py (5)
34-39: Avoid per-call client/session creation; pass system_prompt override to query.Creating a new client each request is unnecessary and leaks idle sockets. Let query accept a per-call system_prompt.
@@ - def query( - self, - query: str, - conversation_id: Optional[str] = None, - attachments: Optional[List[Dict[str, Any]]] = None, - ) -> Dict[str, Any]: + def query( + self, + query: str, + conversation_id: Optional[str] = None, + attachments: Optional[List[Dict[str, Any]]] = None, + system_prompt: Optional[str] = None, + ) -> Dict[str, Any]: @@ - if self.config.system_prompt: - payload["system_prompt"] = self.config.system_prompt + if system_prompt is not None: + payload["system_prompt"] = system_prompt + elif self.config.system_prompt: + payload["system_prompt"] = self.config.system_prompt @@ - # Create a temporary config for this query if system prompt override is needed - if effective_system_prompt != self.rag_config.system_prompt: - temp_config = LightspeedRAGConfig( - base_url=self.rag_config.base_url, - provider=self.rag_config.provider, - model=self.rag_config.model, - system_prompt=effective_system_prompt, - no_tools=self.rag_config.no_tools, - timeout=self.rag_config.timeout - ) - temp_client = LightspeedRAGClient(temp_config) - result = temp_client.query( - query=query, - conversation_id=conversation_id, - attachments=attachments - ) - else: - result = self.rag_client.query( - query=query, - conversation_id=conversation_id, - attachments=attachments - ) + result = self.rag_client.query( + query=query, + conversation_id=conversation_id, + attachments=attachments, + system_prompt=effective_system_prompt, + )Also applies to: 60-63, 134-156
31-33: Add retry/backoff to the HTTP session.Resilience: transient 5xx/429 should retry with backoff.
from dataclasses import dataclass +from requests.adapters import HTTPAdapter +from urllib3.util.retry import Retry @@ self.session = requests.Session() self.session.headers.update({'Content-Type': 'application/json'}) + retries = Retry( + total=3, + backoff_factor=0.5, + status_forcelist=[429, 500, 502, 503, 504], + allowed_methods=["GET", "POST"], + ) + adapter = HTTPAdapter(max_retries=retries) + self.session.mount("http://", adapter) + self.session.mount("https://", adapter)
223-227: Avoid hard dependency on LLMManager env (e.g., OPENAI_API_KEY).Wrapper unconditionally constructs LLMManager, which can fail in RAG-only runs. Make it optional and degrade gracefully.
-class LightspeedRAGLLMWrapper: +class LightspeedRAGLLMWrapper: @@ - def __init__(self, rag_manager: LightspeedRAGManager, llm_manager: LLMManager): + def __init__(self, rag_manager: LightspeedRAGManager, llm_manager: Optional[LLMManager]): @@ - llm_manager = LLMManager.from_system_config(system_config) - return cls(rag_manager, llm_manager) + try: + llm_manager = LLMManager.from_system_config(system_config) + except Exception as e: + print(f"⚠️ Skipping LLMManager init for wrapper: {e}") + llm_manager = None + return cls(rag_manager, llm_manager)And guard getters:
- return self.llm_manager.get_litellm_params() + return self.llm_manager.get_litellm_params() if self.llm_manager else {} @@ - return self.llm_manager.get_config() + return self.llm_manager.get_config() if self.llm_manager else LLMConfig(provider="lightspeed_rag", model=self.rag_manager.rag_config.model)Also applies to: 255-260
228-235: Align model reporting with LiteLLM params.get_model_name returns a synthetic name, but get_litellm_params returns the base LLM model; this may confuse reporting.
- return f"lightspeed_rag/{self.rag_manager.rag_config.model}" + return self.llm_manager.get_model_name() if self.llm_manager else f"lightspeed_rag/{self.rag_manager.rag_config.model}"
93-99: Use logging instead of print for connectivity check.Prefer structured logging with levels.
-import print +import logging @@ - print(f"Connected to Lightspeed RAG: {info.get('name', 'Unknown')} v{info.get('version', 'Unknown')}") + logging.getLogger(__name__).info( + "Connected to Lightspeed RAG: %s v%s", + info.get("name", "Unknown"), + info.get("version", "Unknown"), + ) @@ - print(f"Warning: Could not connect to Lightspeed RAG service: {e}") + logging.getLogger(__name__).warning( + "Could not connect to Lightspeed RAG service: %s", e + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
lsc_eval/README.md(3 hunks)lsc_eval/config/evaluation_data.yaml(2 hunks)lsc_eval/config/system.yaml(2 hunks)lsc_eval/runner.py(3 hunks)lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py(1 hunks)lsc_eval/src/lsc_eval/metrics/custom_metrics.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
lsc_eval/README.md
🧬 Code graph analysis (3)
lsc_eval/runner.py (1)
lsc_eval/src/lsc_eval/output/output_handler.py (1)
generate_reports(32-80)
lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py (3)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)
LLMManager(36-173)lsc_eval/src/lsc_eval/core/models.py (1)
TurnData(8-44)lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)
from_system_config(323-326)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2)
lsc_eval/src/lsc_eval/core/models.py (1)
TurnData(8-44)lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py (4)
LightspeedRAGManager(80-212)LightspeedRAGConfig(13-21)generate_response(100-157)query(34-71)
🪛 markdownlint-cli2 (0.17.2)
lsc_eval/README.md
5-5: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
15-15: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
🪛 LanguageTool
lsc_eval/README.md
[grammar] ~39-~39: There might be a mistake here.
Context: ...ation_data.yaml ``` ### Troubleshooting If you encounter `ModuleNotFoundError: N...
(QB_NEW_EN)
[grammar] ~40-~40: There might be a mistake here.
Context: ...No module named 'lsc_eval', ensure you: 1. Are running from the lsc_eval/` directo...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ....." ``` ## RAG Evaluation ### Overview The evaluation framework includes integr...
(QB_NEW_EN)
[grammar] ~133-~133: There might be a mistake here.
Context: ...l metrics ### Setup #### Prerequisites 1. Lightspeed Stack RAG Service: Must be ...
(QB_NEW_EN)
[grammar] ~134-~134: There might be a mistake here.
Context: ...Stack RAG Service**: Must be running at http://localhost:8080 2. Environment Variables: Set `OPENAI_API...
(QB_NEW_EN)
[grammar] ~138-~138: There might be a mistake here.
Context: ...ckages are installed #### Configuration Add RAG configuration to your `system.ya...
(QB_NEW_EN)
[grammar] ~163-~163: There might be a mistake here.
Context: ...om" ``` ### Usage #### Evaluation Data Add RAG metrics to your `evaluation_data...
(QB_NEW_EN)
[grammar] ~219-~219: There might be a mistake here.
Context: ... Troubleshooting Connection Issues: - Ensure Lightspeed Stack RAG service is r...
(QB_NEW_EN)
[grammar] ~224-~224: There might be a mistake here.
Context: ...080/v1/info **Authentication Errors**: - Set appropriate API keys (OPENAI_API_KE...
(QB_NEW_EN)
🪛 YAMLlint (1.37.1)
lsc_eval/config/system.yaml
[error] 98-98: trailing spaces
(trailing-spaces)
lsc_eval/config/evaluation_data.yaml
[error] 48-48: trailing spaces
(trailing-spaces)
[error] 51-51: trailing spaces
(trailing-spaces)
[error] 55-55: trailing spaces
(trailing-spaces)
[error] 66-66: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 85-85: trailing spaces
(trailing-spaces)
🔇 Additional comments (7)
lsc_eval/runner.py (1)
51-53: Verify attribute names exist on system_configMake sure system_config exposes llm_provider and llm_model (or adjust to flattened/nested names). If they’re nested (e.g., system_config.llm.provider), this print will raise AttributeError.
lsc_eval/config/evaluation_data.yaml (3)
47-63: conv_group_3 addition looks goodTurn structure and metadata align with TurnData schema and the custom:answer_correctness metric.
67-69: Ensure metric identifier wiring for custom:rag_response_evaluationConfirm the engine maps “custom:rag_response_evaluation” to CustomMetrics metric_name “rag_response_evaluation” (i.e., prefix stripping is in place), or adapt the mapping accordingly.
79-92: RAG turns read cleanlyQueries/contexts/expected_response are coherent for evaluating RAG quality.
lsc_eval/config/system.yaml (1)
92-104: New custom metric metadata registered—OKRegistry entry for custom:rag_response_evaluation matches the metric introduced in code.
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)
42-45: Confirm metric name resolutionEngine must pass “rag_response_evaluation” (without “custom:” prefix) into CustomMetrics.evaluate to hit this mapping. If not, normalize before lookup.
lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py (1)
120-129: Verify LightSpeed RAG/v1/queryattachments schema
We didn’t find any local definition or tests for the attachments payload—please confirm against the internal API spec that the request body supports anattachmentsarray with keys"attachment_type","content", and"content_type", and that"configuration"and"text/plain"are valid values. Applies to lines 58–64 and 120–129.
| contexts = None | ||
| if turn_data.contexts: | ||
| contexts = [ctx.content for ctx in turn_data.contexts] | ||
|
|
There was a problem hiding this comment.
Fix TurnData context access (dict vs. attribute).
TurnData.contexts elements are dicts with "content" (per models.py), not objects.
- contexts = None
- if turn_data.contexts:
- contexts = [ctx.content for ctx in turn_data.contexts]
+ contexts = [ctx["content"] for ctx in turn_data.contexts] if turn_data.contexts else None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| contexts = None | |
| if turn_data.contexts: | |
| contexts = [ctx.content for ctx in turn_data.contexts] | |
| contexts = [ctx["content"] for ctx in turn_data.contexts] if turn_data.contexts else None |
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/llm_managers/lightspeed_rag_llm.py around lines 176 to
179, the code treats turn_data.contexts items as objects with a .content
attribute but per models.py they are dicts with a "content" key; update the
access to use ctx["content"] (and handle missing keys safely if needed), e.g.,
build contexts = [ctx["content"] for ctx in turn_data.contexts] only when
turn_data.contexts is truthy, or use ctx.get("content") if you prefer tolerant
behavior.
| def _evaluate_rag_response( | ||
| self, | ||
| _conv_data: Any, | ||
| _turn_idx: Optional[int], | ||
| turn_data: Optional[TurnData], | ||
| is_conversation: bool, | ||
| ) -> Tuple[Optional[float], str]: | ||
| """Evaluate a RAG-generated response against the original response.""" | ||
| if is_conversation: | ||
| return None, "RAG response evaluation is a turn-level metric" | ||
|
|
||
| if turn_data is None: | ||
| return None, "TurnData is required for RAG response evaluation" | ||
|
|
||
| try: | ||
| # Try to import and use the RAG manager | ||
| from ..llm_managers.lightspeed_rag_llm import LightspeedRAGManager | ||
|
|
||
| # Create a RAG manager with default config (should be configurable) | ||
| from ..llm_managers.lightspeed_rag_llm import LightspeedRAGConfig | ||
| rag_config = LightspeedRAGConfig() | ||
| rag_manager = LightspeedRAGManager(rag_config) | ||
|
|
||
| # Generate RAG response | ||
| contexts = None | ||
| if turn_data.contexts: | ||
| contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts] | ||
|
|
||
| rag_response, _ = rag_manager.generate_response( | ||
| query=turn_data.query, | ||
| contexts=contexts | ||
| ) | ||
|
|
||
| # Print the RAG response for debugging/visibility | ||
| print(f"RAG Response: {rag_response}") | ||
|
|
||
| # Compare RAG response with the original response | ||
| params = EvaluationPromptParams( | ||
| metric_name="RAG response quality", | ||
| query=turn_data.query, | ||
| response=rag_response, | ||
| expected_response=turn_data.response, # Use original response as expected | ||
| contexts=turn_data.contexts if turn_data.contexts else None, | ||
| scale="0.0 to 1.0", | ||
| ) | ||
| prompt = self._create_evaluation_prompt(params) | ||
|
|
||
| # Add specific instructions for RAG evaluation | ||
| prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n" | ||
| prompt += "Consider:\n" | ||
| prompt += "- Factual accuracy and relevance\n" | ||
| prompt += "- Use of provided context\n" | ||
| prompt += "- Completeness compared to original response\n" | ||
| prompt += "- Overall quality and helpfulness\n" | ||
| prompt += f"\nRAG Response: {rag_response}\n" | ||
| prompt += f"Original Response (Expected): {turn_data.response}" | ||
|
|
||
| # Make LLM call and parse response | ||
| llm_response = self._call_llm(prompt) | ||
| score, reason = self._parse_score_response(llm_response) | ||
|
|
||
| if score is None: | ||
| return ( | ||
| None, | ||
| f"Could not parse score from LLM response: {llm_response[:100]}...", | ||
| ) | ||
|
|
||
| return score, f"RAG response evaluation: {score:.2f} - {reason}" | ||
|
|
||
| except ImportError: | ||
| return None, "LightspeedRAGManager not available" | ||
| except Exception as e: | ||
| return None, f"RAG response evaluation failed: {str(e)}" | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Don’t instantiate the RAG client per turn; use configured manager once
Current code constructs a new LightspeedRAGManager with defaults on every evaluation, ignoring system.yaml and adding overhead. Reuse a manager created from system config and fail fast if not configured.
Apply this diff to reuse a pre-configured manager and remove the per-call import/instantiation:
def _evaluate_rag_response(
self,
_conv_data: Any,
_turn_idx: Optional[int],
turn_data: Optional[TurnData],
is_conversation: bool,
) -> Tuple[Optional[float], str]:
"""Evaluate a RAG-generated response against the original response."""
if is_conversation:
return None, "RAG response evaluation is a turn-level metric"
if turn_data is None:
return None, "TurnData is required for RAG response evaluation"
- try:
- # Try to import and use the RAG manager
- from ..llm_managers.lightspeed_rag_llm import LightspeedRAGManager
-
- # Create a RAG manager with default config (should be configurable)
- from ..llm_managers.lightspeed_rag_llm import LightspeedRAGConfig
- rag_config = LightspeedRAGConfig()
- rag_manager = LightspeedRAGManager(rag_config)
-
- # Generate RAG response
- contexts = None
- if turn_data.contexts:
- contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts]
-
- rag_response, _ = rag_manager.generate_response(
- query=turn_data.query,
- contexts=contexts
- )
-
- # Print the RAG response for debugging/visibility
- print(f"RAG Response: {rag_response}")
-
+ try:
+ # Reuse a pre-configured RAG manager (set in from_system_config)
+ rag_manager = getattr(self, "rag_manager", None)
+ if rag_manager is None:
+ return None, "LightspeedRAGManager not configured"
+
+ # Generate RAG response
+ contexts_list = (
+ [ctx["content"] if isinstance(ctx, dict) else str(ctx) for ctx in (turn_data.contexts or [])]
+ or None
+ )
+ rag_response, _ = rag_manager.generate_response(
+ query=turn_data.query,
+ contexts=contexts_list,
+ )
+
# Compare RAG response with the original response
params = EvaluationPromptParams(
metric_name="RAG response quality",
query=turn_data.query,
response=rag_response,
expected_response=turn_data.response, # Use original response as expected
contexts=turn_data.contexts if turn_data.contexts else None,
scale="0.0 to 1.0",
)
prompt = self._create_evaluation_prompt(params)
# Add specific instructions for RAG evaluation
prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n"
prompt += "Consider:\n"
prompt += "- Factual accuracy and relevance\n"
prompt += "- Use of provided context\n"
prompt += "- Completeness compared to original response\n"
prompt += "- Overall quality and helpfulness\n"
- prompt += f"\nRAG Response: {rag_response}\n"
- prompt += f"Original Response (Expected): {turn_data.response}"
+ prompt += f"\nRAG Response: {rag_response}\n"
+ prompt += f"Original Response (Expected): {turn_data.response}"
# Make LLM call and parse response
llm_response = self._call_llm(prompt)
score, reason = self._parse_score_response(llm_response)
if score is None:
return (
None,
f"Could not parse score from LLM response: {llm_response[:100]}...",
)
return score, f"RAG response evaluation: {score:.2f} - {reason}"
-
- except ImportError:
- return None, "LightspeedRAGManager not available"
except Exception as e:
return None, f"RAG response evaluation failed: {str(e)}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _evaluate_rag_response( | |
| self, | |
| _conv_data: Any, | |
| _turn_idx: Optional[int], | |
| turn_data: Optional[TurnData], | |
| is_conversation: bool, | |
| ) -> Tuple[Optional[float], str]: | |
| """Evaluate a RAG-generated response against the original response.""" | |
| if is_conversation: | |
| return None, "RAG response evaluation is a turn-level metric" | |
| if turn_data is None: | |
| return None, "TurnData is required for RAG response evaluation" | |
| try: | |
| # Try to import and use the RAG manager | |
| from ..llm_managers.lightspeed_rag_llm import LightspeedRAGManager | |
| # Create a RAG manager with default config (should be configurable) | |
| from ..llm_managers.lightspeed_rag_llm import LightspeedRAGConfig | |
| rag_config = LightspeedRAGConfig() | |
| rag_manager = LightspeedRAGManager(rag_config) | |
| # Generate RAG response | |
| contexts = None | |
| if turn_data.contexts: | |
| contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts] | |
| rag_response, _ = rag_manager.generate_response( | |
| query=turn_data.query, | |
| contexts=contexts | |
| ) | |
| # Print the RAG response for debugging/visibility | |
| print(f"RAG Response: {rag_response}") | |
| # Compare RAG response with the original response | |
| params = EvaluationPromptParams( | |
| metric_name="RAG response quality", | |
| query=turn_data.query, | |
| response=rag_response, | |
| expected_response=turn_data.response, # Use original response as expected | |
| contexts=turn_data.contexts if turn_data.contexts else None, | |
| scale="0.0 to 1.0", | |
| ) | |
| prompt = self._create_evaluation_prompt(params) | |
| # Add specific instructions for RAG evaluation | |
| prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n" | |
| prompt += "Consider:\n" | |
| prompt += "- Factual accuracy and relevance\n" | |
| prompt += "- Use of provided context\n" | |
| prompt += "- Completeness compared to original response\n" | |
| prompt += "- Overall quality and helpfulness\n" | |
| prompt += f"\nRAG Response: {rag_response}\n" | |
| prompt += f"Original Response (Expected): {turn_data.response}" | |
| # Make LLM call and parse response | |
| llm_response = self._call_llm(prompt) | |
| score, reason = self._parse_score_response(llm_response) | |
| if score is None: | |
| return ( | |
| None, | |
| f"Could not parse score from LLM response: {llm_response[:100]}...", | |
| ) | |
| return score, f"RAG response evaluation: {score:.2f} - {reason}" | |
| except ImportError: | |
| return None, "LightspeedRAGManager not available" | |
| except Exception as e: | |
| return None, f"RAG response evaluation failed: {str(e)}" | |
| def _evaluate_rag_response( | |
| self, | |
| _conv_data: Any, | |
| _turn_idx: Optional[int], | |
| turn_data: Optional[TurnData], | |
| is_conversation: bool, | |
| ) -> Tuple[Optional[float], str]: | |
| """Evaluate a RAG-generated response against the original response.""" | |
| if is_conversation: | |
| return None, "RAG response evaluation is a turn-level metric" | |
| if turn_data is None: | |
| return None, "TurnData is required for RAG response evaluation" | |
| try: | |
| # Reuse a pre-configured RAG manager (set in from_system_config) | |
| rag_manager = getattr(self, "rag_manager", None) | |
| if rag_manager is None: | |
| return None, "LightspeedRAGManager not configured" | |
| # Generate RAG response | |
| contexts_list = ( | |
| [ctx["content"] if isinstance(ctx, dict) else str(ctx) for ctx in (turn_data.contexts or [])] | |
| or None | |
| ) | |
| rag_response, _ = rag_manager.generate_response( | |
| query=turn_data.query, | |
| contexts=contexts_list, | |
| ) | |
| # Compare RAG response with the original response | |
| params = EvaluationPromptParams( | |
| metric_name="RAG response quality", | |
| query=turn_data.query, | |
| response=rag_response, | |
| expected_response=turn_data.response, # Use original response as expected | |
| contexts=turn_data.contexts if turn_data.contexts else None, | |
| scale="0.0 to 1.0", | |
| ) | |
| prompt = self._create_evaluation_prompt(params) | |
| # Add specific instructions for RAG evaluation | |
| prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n" | |
| prompt += "Consider:\n" | |
| prompt += "- Factual accuracy and relevance\n" | |
| prompt += "- Use of provided context\n" | |
| prompt += "- Completeness compared to original response\n" | |
| prompt += "- Overall quality and helpfulness\n" | |
| prompt += f"\nRAG Response: {rag_response}\n" | |
| prompt += f"Original Response (Expected): {turn_data.response}" | |
| # Make LLM call and parse response | |
| llm_response = self._call_llm(prompt) | |
| score, reason = self._parse_score_response(llm_response) | |
| if score is None: | |
| return ( | |
| None, | |
| f"Could not parse score from LLM response: {llm_response[:100]}...", | |
| ) | |
| return score, f"RAG response evaluation: {score:.2f} - {reason}" | |
| except Exception as e: | |
| return None, f"RAG response evaluation failed: {str(e)}" |
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/metrics/custom_metrics.py around lines 248-321, the
code currently imports and creates a new LightspeedRAGManager per turn which
ignores system.yaml and is costly; replace that with the pre-configured manager:
remove the in-method imports and instantiation, instead reference the manager
configured on the evaluator (e.g., self.rag_manager or self._rag_manager) and
fail fast if it's not present (return an appropriate error message), reuse that
manager to call generate_response with the same context extraction logic, and
keep the rest of the evaluation flow unchanged.
asamal4
left a comment
There was a problem hiding this comment.
Perhaps I misunderstood the LCORE-601, but I thought that we are collecting context (through Vector DBs).. But this implementation collecting response.
FYI. I am working on lightspeed-stack API integeration (with some additional flexibilities).. So this is helpful either way.
|
|
||
| return score, f"Custom answer correctness: {score:.2f} - {reason}" | ||
|
|
||
| def _evaluate_rag_response( |
There was a problem hiding this comment.
Do we need to add another custom metric ? We can utilize any of the metric from RAGAS.
Current custom metric implementation is just for reference, incase we really need to add a different metric.
There was a problem hiding this comment.
Discussed in huddle. Assumed the RAG context already existed in lightspeed-stack response. Will need to implement method of storing RAG in lightspeed-stack.
There was a problem hiding this comment.
No, we do not need a custom metric here. This was only to differentiate between using RAG or not, but can change the logic to utilize the existing metrics.
Signed-off-by: Anxhela Coba <acoba@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)
129-163: Regex tests are solid; add a list-arg negative case to prevent regressions.Current implementation could previously pass on list mismatches. Add these for coverage:
def test_list_argument_mismatch_should_fail(self): expected = [[{"tool_name": "oc_get", "arguments": {"oc_get_args": ["namespaces"]}}]] actual = [[{"tool_name": "oc_get", "arguments": {"oc_get_args": ["nodes"]}}]] assert compare_tool_calls(expected, actual) is False def test_numeric_string_equivalence(self): expected = [[{"tool_name": "create_pod", "arguments": {"port": 80}}]] actual = [[{"tool_name": "create_pod", "arguments": {"port": "80"}}]] assert compare_tool_calls(expected, actual) is TrueOptionally, add a test to lock in the “extra keys fail” behavior if that’s desired.
I can push a follow-up commit with these tests once you confirm desired strictness for extra keys.
config/system.yaml (1)
126-148: Output block schema doesn’t match OutputConfigYAML uses base_directory and nested formats, but OutputConfig expects output_dir and csv_format/json_format/txt_format. This will drop settings unless the loader translates keys.
Option A: align model to YAML with aliases and a post-parse normalizer:
-from pydantic import BaseModel, Field, field_validator +from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator @@ class OutputConfig(BaseModel): """Output configuration.""" - output_dir: str = "./eval_output" + model_config = ConfigDict(extra="ignore") + output_dir: str = Field("./eval_output", validation_alias="base_directory") base_filename: str = "evaluation" - csv_format: bool = True - json_format: bool = True - txt_format: bool = True + csv_format: bool = True + json_format: bool = True + txt_format: bool = True include_graphs: bool = True csv_columns: List[str] = Field(default_factory=list) - visualization_figsize: List[int] = Field(default=[12, 8]) + visualization_figsize: List[int] = Field(default_factory=lambda: [12, 8]) visualization_dpi: int = 300 default_turn_metrics_metadata: Optional[Dict[str, Any]] = None default_conversation_metrics_metadata: Optional[Dict[str, Any]] = None + + # Accept optional nested 'formats' block from YAML + formats: Optional[Dict[str, bool]] = None + + @model_validator(mode="after") + def _apply_formats(self): + if self.formats: + self.csv_format = bool(self.formats.get("csv", self.csv_format)) + self.json_format = bool(self.formats.get("json", self.json_format)) + self.txt_format = bool(self.formats.get("txt", self.txt_format)) + return selfOption B: change YAML keys to match OutputConfig (rename base_directory -> output_dir and move formats.csv/json/txt to top-level csv_format/json_format/txt_format).
src/lightspeed_evaluation/core/config/models.py (2)
205-216: EvaluationSystemConfig missing keys present in YAMLenvironment, visualization, and lightspeed_rag exist in YAML but aren’t modeled here; extra keys may be ignored or raise depending on loader. Also, llm is a Dict despite having LLMConfig defined.
-class EvaluationSystemConfig(BaseModel): +class EvaluationSystemConfig(BaseModel): """System configuration for evaluation.""" - llm: Dict[str, Any] + model_config = ConfigDict(extra="ignore") + llm: LLMConfig | Dict[str, Any] logging: Dict[str, Any] = Field(default_factory=dict) metrics_metadata: Dict[str, Any] = Field(default_factory=dict) output: Dict[str, Any] = Field(default_factory=dict) + lightspeed_rag: Optional["LightspeedRAGConfig"] = None + environment: Dict[str, Any] = Field(default_factory=dict) + visualization: Dict[str, Any] = Field(default_factory=dict)
218-232: OutputConfig: align with YAML and avoid mutable defaultMap base_directory/formats to existing fields and switch to default_factory for the figsize list.
-from pydantic import BaseModel, Field, field_validator +from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator @@ class OutputConfig(BaseModel): """Output configuration.""" - output_dir: str = "./eval_output" + model_config = ConfigDict(extra="ignore") + output_dir: str = Field("./eval_output", validation_alias="base_directory") base_filename: str = "evaluation" csv_format: bool = True json_format: bool = True txt_format: bool = True include_graphs: bool = True csv_columns: List[str] = Field(default_factory=list) - visualization_figsize: List[int] = Field(default=[12, 8]) + visualization_figsize: List[int] = Field(default_factory=lambda: [12, 8]) visualization_dpi: int = 300 default_turn_metrics_metadata: Optional[Dict[str, Any]] = None default_conversation_metrics_metadata: Optional[Dict[str, Any]] = None + formats: Optional[Dict[str, bool]] = None + + @model_validator(mode="after") + def _apply_formats(self): + if self.formats: + self.csv_format = bool(self.formats.get("csv", self.csv_format)) + self.json_format = bool(self.formats.get("json", self.json_format)) + self.txt_format = bool(self.formats.get("txt", self.txt_format)) + return selfsrc/lightspeed_evaluation/core/output/visualization.py (1)
228-244: Do not fill missing scores with 0 in boxplot; it skews distributions.Padding NaNs to equalize length is fine; replacing with zeros distorts stats. Also prefer a palette over BASE_COLORS to avoid white boxes and limited colors.
- bplot = ax.boxplot( - results_df.fillna(0), - sym=".", - widths=0.5, - vert=False, - patch_artist=True, - ) - - labels = results_df.columns - - colors = list(BASE_COLORS.keys())[: len(labels)] - for patch, color in zip(bplot["boxes"], colors): - patch.set_facecolor(color) + labels = list(results_df.columns) + data = [results_df[c].dropna().to_numpy() for c in labels] + bplot = ax.boxplot( + data, + sym=".", + widths=0.5, + vert=False, + patch_artist=True, + ) + + colors = sns.color_palette("husl", len(labels)) + for patch, color in zip(bplot["boxes"], colors): + patch.set_facecolor(color)src/lightspeed_evaluation/core/metrics/ragas.py (1)
79-83: Don’t rely on hard-coded result keys; infer the metric column from the result.Ragas column names vary by version; this can raise KeyError (e.g., “nv_context_relevance”). Select the sole numeric column returned for a single-metric run.
- result = evaluate(dataset, metrics=[metric_instance]) - df = result.to_pandas() - score = df[result_key].iloc[0] - return score, f"Ragas {metric_name}: {score:.2f}" + result = evaluate(dataset, metrics=[metric_instance]) + df = result.to_pandas() + numeric_cols = df.select_dtypes(include="number").columns.tolist() + if len(numeric_cols) != 1: + raise ValueError(f"Unexpected Ragas output columns: {df.columns.tolist()}") + score_col = numeric_cols[0] + score = float(df[score_col].iloc[0]) + return score, f"Ragas {metric_name}: {score:.2f}"
🧹 Nitpick comments (44)
.gitignore (2)
183-184: Anchor to repo root to avoid unintended matchesIf
.deepeval/is intended only at repo root, anchor it to prevent ignoring similarly named directories in nested paths.-.deepeval/ +/.deepeval/
186-187: Scope the WIP ignore pattern
wip*/currently matches any nested dir starting withwip. If you only mean top-level WIP dirs, anchor it. If broader scope is desired, keep as is.-wip*/ +/wip*/.github/workflows/bandit.yaml (3)
11-12: Permissions correctly include security-events: write; consider least-privilege for actions and PR-fork behavior
- Good: security-events: write is required to upload SARIF to Code Scanning.
- Optional: actions: read is only needed for private repos; if this repo is public, you can drop it.
- Heads-up: On pull_request events from public forks, GITHUB_TOKEN cannot get write on security-events, so uploads will fail with “Resource not accessible by integration.” Consider skipping the upload step (or the whole action) for forked PRs or allowing it to continue-on-error in that case. (docs.github.com)
Apply (if repo is public) and add fork guard:permissions: security-events: write - actions: read contents: read pull-requests: readAnd gate/soften on forks (see step suggestion below). (docs.github.com)
16-16: Pin the Bandit action to a specific release for reproducibilityUse the latest v1 tag release (v1.0.1 as of Aug 1, 2025) or a commit SHA to avoid surprise changes from floating @v1.
- - uses: PyCQA/bandit-action@v1 + - uses: PyCQA/bandit-action@v1.0.1Also consider skipping or soft-failing on forked PRs where uploads will lack permissions:
- - uses: PyCQA/bandit-action@v1.0.1 + - uses: PyCQA/bandit-action@v1.0.1 + if: ${{ github.event_name != 'pull_request' || github.event.pull_request.head.repo.fork == false }} + continue-on-error: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.repo.fork }}This keeps uploads working on pushes while avoiding red PRs from forks. (github.com)
18-18: Targets set to src is fine; consider severity/confidence and excludes to tune signal-to-noiseBandit-action supports severity/confidence tuning and excludes. Recommend establishing a policy (e.g., block on medium/high, warn on low) and excluding virtualenvs/vendor dirs.
with: - targets: "src" + targets: "src" + severity: "medium" + confidence: "medium" + exclude: ".git,__pycache__,.tox,.eggs,*.egg,.venv,venv,dist,build"Note: exit_zero removal means findings will now fail the job—intentional and good. If you need a gradual rollout, lower severity/confidence first or add a Bandit baseline. (bandit.readthedocs.io, github.com)
archive/example_result/README.md (1)
14-14: Verified relative link resolves correctly; optional wording tweak
The link../../eval_data/ocp_doc_qna-edited.parquetexists from both the repo root and this README’s directory.Optional nit: pluralize “OCP doc” for clarity:
- QnA evaluation dataset: [QnAs from OCP doc](../../eval_data/ocp_doc_qna-edited.parquet) + QnA evaluation dataset: [QnAs from OCP docs](../../eval_data/ocp_doc_qna-edited.parquet)src/lightspeed_evaluation/core/config/loader.py (1)
1-1: Docstring rename LGTM; align class docstring for consistency.The module docstring now says “Evaluation Framework” but the class docstring (Line 194) still mentions “LSC”. Suggest aligning.
Outside the changed lines, update the class docstring at Line 194 to:
"""Configuration loader for Evaluation Framework."""lsc_agent_eval/README.md (2)
13-13: Tighten wording for tool_eval semantics.Current: “..., Only regex pattern check (case sensitive) is done for argument value”
Suggested: “..., only regex pattern checks (case-sensitive) are applied to argument values.”- - `tool_eval`: Tool call evaluation comparing expected vs actual tool calls with arguments, Only regex pattern check (case sensitive) is done for argument value + - `tool_eval`: Tool call evaluation comparing expected vs actual tool calls with arguments; only regex pattern checks (case-sensitive) are applied to argument values.
69-69: Clarify expected_tool_calls rules and grammar.Add plural “values” and note strictness about extra args if that’s intended by the implementation.
-- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval), Regex pattern check is done for argument value +- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval). Regex pattern checks are applied to argument values (case-sensitive). By default, extra unexpected argument keys in the actual call will cause a mismatch.lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
110-115: Decide on strictness for extra keys.Currently, any unexpected actual argument keys fail the match. If optional args are acceptable, consider logging and ignoring extras by default or gating via a flag.
Would you like me to update README (Line 69) to document this strictness and add tests for extra-key behavior?
config/system.yaml (1)
98-98: Trim trailing whitespaceYAML lint flags trailing spaces at Line 98.
- +archive/pyproject.toml (1)
7-11: Top-level formatting/lint keys are misplacedtarget-version, lint.pydocstyle.convention, and line-length aren’t under tool sections and will be ignored.
Move them under the appropriate tool tables:
-target-version = "py311" -lint.pydocstyle.convention = "google" -line-length = 100 +[tool.black] +target-version = ["py311"] +line-length = 100 + +[tool.pydocstyle] +convention = "google"src/lightspeed_evaluation/core/config/models.py (2)
172-203: LLMConfig improvements are fine; add extra-handling for robustnessLooks good. Consider adding model_config = ConfigDict(extra="ignore") to tolerate provider-specific extras without failing parsing.
class LLMConfig(BaseModel): """LLM configuration from system configuration.""" + model_config = ConfigDict(extra="ignore")
82-92: Metric requirement validation: consider stricter checks and clearer messagesStrip metric strings before validation and include metric values in errors for easier debugging. Optional, but helps operators.
- for metric in v: + for metric in (m.strip() for m in v): if not metric or ":" not in metric: raise ValueError( f'Metric "{metric}" must be in format "framework:metric_name"' ) return v @@ - errors.append( - f"TurnData {turn_data.turn_id}: Metric '{metric}' requires contexts" - ) + errors.append(f"Turn {turn_data.turn_id}: '{metric}' requires non-empty contexts") @@ - errors.append( - f"TurnData {turn_data.turn_id}: " - f"Metric '{metric}' requires expected_response" - ) + errors.append(f"Turn {turn_data.turn_id}: '{metric}' requires expected_response")Also applies to: 93-131
pyproject.toml (2)
8-8: Use SPDX license identifier for better tooling compatibilityApache should be expressed as an SPDX id or file reference.
-license = {text = "Apache"} +license = {text = "Apache-2.0"} # or: +# license = {file = "LICENSE"}
22-29: Use a compatible torch version range
Replace the hard pin with- "torch==2.7.0", + "torch>=2.6,<2.8",to allow patch releases (e.g. 2.7.1) and let pip select the correct wheel across Python 3.10–3.12 on Linux, macOS, and Windows.
README.md (2)
3-3: Grammar fixPluralize “application”.
-A comprehensive framework/tool to evaluate GenAI application. +A comprehensive framework/tool to evaluate GenAI applications.
133-137: Align prerequisites with actual deps (httpx, not requests)The project depends on
httpx. Mentioningrequestshere is misleading.-3. **Dependencies**: Ensure `requests` and other required packages are installed +3. **Dependencies**: Ensure required packages are installed (e.g., `httpx`)src/lightspeed_evaluation/core/output/statistics.py (2)
116-135: Avoid returning raw per-metric score arrays to keep JSON small
calculate_detailed_statsretainsscoreslists inside each metric’s stats. These get serialized in the JSON summary and can bloat artifacts. Keep only derivedscore_statistics.if stats["scores"]: scores = stats["scores"] stats["score_statistics"] = { "mean": statistics.mean(scores), "median": statistics.median(scores), "std": statistics.stdev(scores) if len(scores) > 1 else 0.0, "min": min(scores), "max": max(scores), "count": len(scores), } else: stats["score_statistics"] = { "mean": 0.0, "median": 0.0, "std": 0.0, "min": 0.0, "max": 0.0, "count": 0, } + # Do not retain raw scores in the returned structure to minimize JSON size + stats.pop("scores", None)
10-17: Remove unused dataclass or document its purpose
EvaluationScopeisn’t referenced in this module. Consider removing or moving it to where it’s used.src/lightspeed_evaluation/core/config/validator.py (2)
21-27: Validate YAML top-level type before model constructionIf the YAML is empty or not a list, constructing models will raise confusing errors. Guard early.
with open(data_path, "r", encoding="utf-8") as f: raw_data = yaml.safe_load(f) - # Convert raw data to Pydantic models + # Ensure expected structure + if not isinstance(raw_data, list): + raise ValueError("Evaluation data must be a YAML list of conversation groups") + + # Convert raw data to Pydantic models try: evaluation_data = [EvaluationData(**item) for item in raw_data]
35-54: Use logging instead of print for library code (keep emojis if you like)Printing inside validators makes programmatic use noisy and hard to control. Prefer a named logger.
+import logging +logger = logging.getLogger("lightspeed_evaluation") @@ - print(f"📋 Evaluation data loaded: {len(evaluation_data)} conversations") + logger.info("📋 Evaluation data loaded: %d conversations", len(evaluation_data)) @@ - print("❌ Validation Errors:") + logger.error("❌ Validation Errors:") for error in self.validation_errors: - print(f" • {error}") + logger.error(" • %s", error) return False @@ - print("✅ All data validation passed") + logger.info("✅ All data validation passed")src/lightspeed_evaluation/core/output/visualization.py (4)
161-165: Swap mislabeled axes in pass-rates chart.Bars are vertical (x=metrics, y=pass_rate). Labels are reversed.
- ax.set_xlabel("Pass Rate (%)", fontsize=12) - ax.set_ylabel("Metrics", fontsize=12) + ax.set_xlabel("Metrics", fontsize=12) + ax.set_ylabel("Pass Rate (%)", fontsize=12)
420-422: Heatmap x-axis label should be “Metrics”.Colorbar already conveys “Pass Rate (%)”.
- ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold") + ax.set_xlabel("Metrics", fontsize=12, fontweight="bold")
144-176: Use configured figsize/dpi consistently when saving pass-rates chart.Currently hard-coded (12,8) and dpi=300; align with self.figsize/self.dpi.
- _, ax = plt.subplots(figsize=(12, 8)) + _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi) ... - plt.savefig(filename, dpi=300, bbox_inches="tight") + plt.savefig(filename, dpi=self.dpi, bbox_inches="tight")
79-83: Avoid early return when by_metric is empty; still generate pie/distribution.Right now no graphs are produced at all in that case. Consider gating only pass-rates and heatmap on by_metric, but still emit score distribution (if scores exist) and status breakdown.
If you want, I can propose a small refactor diff that keeps the pie/distribution paths active when by_metric is empty.
src/lightspeed_evaluation/core/metrics/ragas.py (1)
115-133: Guard against missing turn data for response relevancy.Currently falls back to empty strings, producing meaningless evaluations.
def _evaluate_response_relevancy( self, _conv_data: Any, _turn_idx: Optional[int], turn_data: Optional[TurnData], is_conversation: bool, ) -> Tuple[Optional[float], str]: """Evaluate response relevancy.""" - if is_conversation: + if is_conversation: return None, "Response relevancy is a turn-level metric" - - query, response, _ = self._extract_turn_data(turn_data) + if turn_data is None: + return None, "TurnData is required for response relevancy" + if not turn_data.query or not turn_data.response: + return None, "Query and response cannot be empty for response relevancy" + query, response, _ = self._extract_turn_data(turn_data)README-generate-answers.md (2)
36-38: Clarify model selection sentence.Minor grammar polish.
-You use the `models_to_evaluate` list to select which of the available models will be used for answer generation. -All models included in this list must also be defined in the `models` section **and** properly configured and available in the running lightspeed-core service. +Use the `models_to_evaluate` list to select which available models will be used for answer generation. +All models in this list must also be defined in the `models` section and available in the running Lightspeed Core service.
43-55: Fix list style to match markdownlint and add missing punctuation in CSV example.Switch to dashes and add a question mark.
-* **CSV** – must contain two columns: `id` and `question`. + - **CSV** – must contain two columns: `id` and `question`. ... -* **Parquet** – Lightspeed evaluation parquet format is supported. -* **JSON** – Lightspeed evaluation JSON format is supported. + - **Parquet** – Lightspeed evaluation parquet format is supported. + - **JSON** – Lightspeed evaluation JSON format is supported.- 3,How do I manage RBAC in OpenShift Virtualization + 3,How do I manage RBAC in OpenShift Virtualization?src/lightspeed_evaluation/drivers/__init__.py (1)
1-5: Package re-export looks good.Public alias lightspeed_evaluation.drivers.EvaluationDriver now resolves; docstring branding nit: consider “Lightspeed” for consistency elsewhere.
-"""Evaluation drivers for LightSpeed Evaluation Framework.""" +"""Evaluation drivers for Lightspeed Evaluation Framework."""src/lightspeed_evaluation/drivers/evaluation.py (2)
65-66: Nit: user-facing label says “TurnData”Use “Turn” for display clarity.
- return f"TurnData {self.turn_id}: {self.conv_data.conversation_group_id}" + return f"Turn {self.turn_id}: {self.conv_data.conversation_group_id}"
140-140: Optional: use structured logging instead of printConsider routing through ConfigLoader.logger when available to unify logs with the rest of the framework.
- print("✅ Evaluation Driver initialized") + if getattr(self.config_loader, "logger", None): + self.config_loader.logger.info("Evaluation Driver initialized") + else: + print("✅ Evaluation Driver initialized")src/lightspeed_evaluation/core/config/__init__.py (1)
7-16: Optional: annotate all for linting clarity.Some linters prefer explicit typing on module exports.
-__all__ = [ +__all__: list[str] = [ "ConfigLoader", "setup_environment_variables", "SystemConfig", "EvaluationData", "EvaluationResult", "TurnData", "LLMConfig", "DataValidator", ]src/lightspeed_evaluation/runner/evaluation.py (3)
18-28: Docstring return type mismatch with signature.The function returns Optional[Dict[str, int]] (None on failure). Align the docstring for clarity.
- Returns: - dict: Summary statistics with keys TOTAL, PASS, FAIL, ERROR + Returns: + Optional[Dict[str, int]]: Summary statistics with keys TOTAL, PASS, FAIL, ERROR; + None on failure.
38-51: Avoid double validation of evaluation data.DataValidator.load_evaluation_data() validates once, and EvaluationDriver.run_evaluation() validates again. Consider validating in a single place to reduce duplicate work and logs.
108-118: Optional: check file existence before early env setup.setup_environment_variables() gracefully falls back on missing files, but calling it before existence checks can emit avoidable warnings when the path is wrong. Consider verifying the file first, then invoking setup_environment_variables().
src/lightspeed_evaluation/core/metrics/custom.py (1)
248-321: Optional: trim sensitive output.Avoid printing full model outputs in libraries. If needed, log at DEBUG with truncation.
src/lightspeed_evaluation/core/__init__.py (1)
4-12: Consolidate duplicate imports from .config.Minor cleanup to avoid split imports.
-from .config import ( - ConfigLoader, - SystemConfig, - EvaluationData, - setup_environment_variables, - DataValidator, -) -from .config import EvaluationResult, TurnData, LLMConfig +from .config import ( + ConfigLoader, + SystemConfig, + EvaluationData, + setup_environment_variables, + DataValidator, + EvaluationResult, + TurnData, + LLMConfig, +)src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py (4)
13-21: Consider extending RAG config for auth and custom headers.Helpful for secured services and proxies.
@dataclass class LightspeedRAGConfig: """Configuration for Lightspeed RAG integration.""" base_url: str = "http://localhost:8080" provider: str = "openai" model: str = "gpt-4o-mini" system_prompt: Optional[str] = None no_tools: bool = False timeout: int = 300 + api_key: Optional[str] = None + headers: Optional[Dict[str, str]] = None
100-156: Avoid per-call client recreation for system_prompt overrides.Pass system_prompt into query() to reduce overhead and keep connection pooling.
- def generate_response( + def generate_response( self, query: str, contexts: Optional[List[str]] = None, conversation_id: Optional[str] = None, system_prompt: Optional[str] = None ) -> Tuple[str, str]: @@ - # Create a temporary config for this query if system prompt override is needed - if effective_system_prompt != self.rag_config.system_prompt: - temp_config = LightspeedRAGConfig( - base_url=self.rag_config.base_url, - provider=self.rag_config.provider, - model=self.rag_config.model, - system_prompt=effective_system_prompt, - no_tools=self.rag_config.no_tools, - timeout=self.rag_config.timeout - ) - temp_client = LightspeedRAGClient(temp_config) - result = temp_client.query( - query=query, - conversation_id=conversation_id, - attachments=attachments - ) - else: - result = self.rag_client.query( - query=query, - conversation_id=conversation_id, - attachments=attachments - ) + # Single client path, pass override directly + result = self.rag_client.query( + query=query, + conversation_id=conversation_id, + attachments=attachments, + system_prompt=effective_system_prompt, + )And update query() signature to accept the override:
- def query( + def query( self, query: str, conversation_id: Optional[str] = None, - attachments: Optional[List[Dict[str, Any]]] = None, + attachments: Optional[List[Dict[str, Any]]] = None, + system_prompt: Optional[str] = None, ) -> Dict[str, Any]: @@ - if self.config.system_prompt: - payload["system_prompt"] = self.config.system_prompt + if system_prompt is not None: + payload["system_prompt"] = system_prompt + elif self.config.system_prompt: + payload["system_prompt"] = self.config.system_prompt
24-33: Add a close() to release HTTP resources.Provide an explicit way to close the Session (useful for long runs/tests).
class LightspeedRAGClient: @@ def __init__(self, config: LightspeedRAGConfig): """Initialize the client with configuration.""" @@ self.session.headers.update({'Content-Type': 'application/json'}) + def close(self) -> None: + """Close underlying HTTP session.""" + try: + self.session.close() + except Exception: + pass
1-7: Optional: decouple typing import for LLMConfig.To avoid tight coupling on llm.manager, import LLMConfig from config models (or type-only).
-from .manager import LLMManager, LLMConfig +from .manager import LLMManager +from ..config.models import LLMConfig # or: if TYPE_CHECKING: from ..config.models import LLMConfigsrc/lightspeed_evaluation/__init__.py (2)
1-14: Docstring mentions Runner but it’s not exported.Either export the CLI runner symbol or adjust wording to “CLI runner”.
Main components: - EvaluationDriver: Runs complete evaluation -- Runner: Simple runner for command-line usage +- CLI runner available via the package’s command-line entrypoints
21-32: Group LLMConfig import with other config symbols.Tiny import cleanup for consistency.
-from .core.config import ( +from .core.config import ( ConfigLoader, SystemConfig, EvaluationData, TurnData, EvaluationResult, DataValidator, -) -from .core.config import LLMConfig + LLMConfig, +)
|
|
||
| turn_metrics_metadata: | ||
| "ragas:faithfulness": |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Strip trailing spaces to satisfy YAML lint
There are trailing spaces on the listed lines.
Quick fixer script:
Also applies to: 16-16, 31-31, 33-33, 37-37, 48-48, 51-51, 55-55, 66-66, 69-69, 71-71, 75-75, 77-77, 85-85
🏁 Script executed:
#!/bin/bash
# Remove trailing spaces and re-run yamllint
sed -i 's/[ \t]\+$//' config/evaluation_data.yaml
yamllint config/evaluation_data.yaml || trueLength of output: 697
Fix remaining YAML lint errors
- Add a YAML document start marker (
---) at the top of config/evaluation_data.yaml to resolve the document-start warning. - Wrap or shorten lines 80, 81, 83, 84, 88, 90, and 91 to ≤ 80 characters (or adjust the line-length rule in your yamllint config) to clear the line-length errors.
🧰 Tools
🪛 YAMLlint (1.37.1)
[error] 10-10: trailing spaces
(trailing-spaces)
[error] 11-11: trailing spaces
(trailing-spaces)
[error] 12-12: trailing spaces
(trailing-spaces)
🤖 Prompt for AI Agents
In config/evaluation_data.yaml around lines 10-12 (and affecting linting of
lines 80,81,83,84,88,90,91), the file is missing a YAML document start marker
and contains several lines exceeding the 80-character limit; add a top-of-file
document start marker ("---") as the first line, then wrap or shorten the long
lines at 80, 81, 83, 84, 88, 90 and 91 to be ≤80 characters (or alternatively
update the yamllint line-length rule if you intend to allow longer lines).
config/system.yaml
Outdated
| # Lightspeed RAG Configuration TODO: Perhaps we don't need separate from above? | ||
| lightspeed_rag: | ||
| base_url: "http://localhost:8080" # Lightspeed Stack RAG API endpoint | ||
| provider: "openai" # LLM provider for RAG queries | ||
| model: "gpt-4o-mini" # Model to use for RAG queries | ||
| system_prompt: "You are a helpful assistant with access to a knowledge base. Use the provided context to answer questions accurately and comprehensively." | ||
| no_tools: false # Whether to bypass tools and MCP servers | ||
| timeout: 300 # Request timeout for RAG queries | ||
|
|
There was a problem hiding this comment.
lightspeed_rag block likely ignored by current config models
EvaluationSystemConfig doesn’t define a lightspeed_rag field, so this section will be dropped or cause validation issues depending on loader settings. Either unify with llm or add a dedicated config model and wire it into EvaluationSystemConfig.
Proposed minimal model addition (in src/lightspeed_evaluation/core/config/models.py):
+from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator
+
+class LightspeedRAGConfig(BaseModel):
+ """Lightspeed RAG client configuration."""
+ base_url: str = Field(..., description="Lightspeed RAG API endpoint")
+ provider: str = Field(..., description="LLM provider for RAG queries")
+ model: str = Field(..., description="Model for RAG queries")
+ system_prompt: str = ""
+ no_tools: bool = False
+ timeout: int = Field(300, ge=1)
+
+ @field_validator("base_url", "provider", "model")
+ @classmethod
+ def _non_empty(cls, v: str) -> str:
+ if not v or not v.strip():
+ raise ValueError("Value cannot be empty")
+ return v.strip()
+
class EvaluationSystemConfig(BaseModel):
"""System configuration for evaluation."""
- llm: Dict[str, Any]
+ llm: Dict[str, Any]
+ lightspeed_rag: Optional[LightspeedRAGConfig] = None
+ environment: Dict[str, Any] = Field(default_factory=dict)
+ visualization: Dict[str, Any] = Field(default_factory=dict)
+ model_config = ConfigDict(extra="ignore")🤖 Prompt for AI Agents
In config/system.yaml around lines 12 to 20, the new lightspeed_rag block is not
defined in your config models and will be ignored or fail validation; add a
LightspeedRagConfig dataclass/model (e.g., in
src/lightspeed_evaluation/core/config/models.py) with fields matching base_url,
provider, model, system_prompt, no_tools, timeout, then add a lightspeed_rag:
Optional[LightspeedRagConfig] field to EvaluationSystemConfig and update any
config loader/validation code to parse/validate it and wire the new field into
places that consume RAG settings (or alternatively merge these keys into the
existing llm config if that fits design).
| expected_str = str(expected_value) | ||
| actual_str = str(actual_value) | ||
|
|
||
| # Use regex search for flexible matching | ||
| # This is a quick work-around, enhance this to use both regex & exact match. | ||
| try: | ||
| if not re.search(expected_str, actual_str): | ||
| logger.debug( | ||
| "Argument value mismatch for '%s': pattern '%s' not found in '%s'", | ||
| key, | ||
| expected_str, | ||
| actual_str, | ||
| ) | ||
| return False | ||
| except re.error as e: | ||
| logger.debug( | ||
| "Argument value mismatch for '%s': expected %s, got %s", | ||
| key, | ||
| expected_value, | ||
| actual_value, | ||
| "Invalid regex pattern '%s' for key '%s': %s", expected_str, key, e | ||
| ) | ||
| # If regex is invalid, fail the comparison | ||
| return False |
There was a problem hiding this comment.
Regex-on-stringified containers causes false positives (critical).
Casting lists/dicts to strings and running re.search leads to char-class and substring matches that can incorrectly PASS (e.g., expected ["namespaces"] vs actual ["nodes"]). Also conflates numbers/booleans via substring matches (e.g., 80 matches 8080).
Apply a type-aware matcher to prevent these errors while keeping regex support for strings:
def _compare_tool_arguments(expected: dict[str, Any], actual: dict[str, Any]) -> bool:
- """Compare tool arguments name & value (regex pattern for the value)."""
+ """Compare tool arguments by name and value.
+ - Strings in expected are treated as regex patterns against actual.
+ - Lists are compared element-wise with the same semantics.
+ - Dicts are compared recursively.
+ - Non-string scalars compare by value equality."""
@@
- # Compare each expected argument
+ # Helper for matching values with type-awareness
+ def _match_value(exp: Any, act: Any, key_path: str) -> bool:
+ if isinstance(exp, dict):
+ if not isinstance(act, dict):
+ logger.debug("Type mismatch for '%s': expected dict, got %s", key_path, type(act).__name__)
+ return False
+ return _compare_tool_arguments(exp, act)
+ if isinstance(exp, (list, tuple)):
+ if not isinstance(act, (list, tuple)) or len(exp) != len(act):
+ logger.debug("List length/type mismatch for '%s': expected %s items, got %s", key_path, len(exp), len(act) if isinstance(act, (list, tuple)) else type(act).__name__)
+ return False
+ for idx, (e_item, a_item) in enumerate(zip(exp, act)):
+ if not _match_value(e_item, a_item, f"{key_path}[{idx}]"):
+ return False
+ return True
+ if isinstance(exp, str):
+ try:
+ return re.search(exp, str(act)) is not None
+ except re.error as e:
+ logger.debug("Invalid regex pattern '%s' for key '%s': %s", exp, key_path, e)
+ return False
+ return str(exp) == str(act)
+
+ # Compare each expected argument
for key, expected_value in expected.items():
if key not in actual:
logger.debug("Missing argument key: '%s'", key)
return False
-
- actual_value = actual[key]
-
- expected_str = str(expected_value)
- actual_str = str(actual_value)
-
- # Use regex search for flexible matching
- # This is a quick work-around, enhance this to use both regex & exact match.
- try:
- if not re.search(expected_str, actual_str):
- logger.debug(
- "Argument value mismatch for '%s': pattern '%s' not found in '%s'",
- key,
- expected_str,
- actual_str,
- )
- return False
- except re.error as e:
- logger.debug(
- "Invalid regex pattern '%s' for key '%s': %s", expected_str, key, e
- )
- # If regex is invalid, fail the comparison
- return False
+ actual_value = actual[key]
+ if not _match_value(expected_value, actual_value, key):
+ logger.debug(
+ "Argument value mismatch for '%s': expected '%s', actual '%s'",
+ key, expected_value, actual_value
+ )
+ return False📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| expected_str = str(expected_value) | |
| actual_str = str(actual_value) | |
| # Use regex search for flexible matching | |
| # This is a quick work-around, enhance this to use both regex & exact match. | |
| try: | |
| if not re.search(expected_str, actual_str): | |
| logger.debug( | |
| "Argument value mismatch for '%s': pattern '%s' not found in '%s'", | |
| key, | |
| expected_str, | |
| actual_str, | |
| ) | |
| return False | |
| except re.error as e: | |
| logger.debug( | |
| "Argument value mismatch for '%s': expected %s, got %s", | |
| key, | |
| expected_value, | |
| actual_value, | |
| "Invalid regex pattern '%s' for key '%s': %s", expected_str, key, e | |
| ) | |
| # If regex is invalid, fail the comparison | |
| return False | |
| def _compare_tool_arguments(expected: dict[str, Any], actual: dict[str, Any]) -> bool: | |
| """Compare tool arguments by name and value. | |
| - Strings in expected are treated as regex patterns against actual. | |
| - Lists are compared element-wise with the same semantics. | |
| - Dicts are compared recursively. | |
| - Non-string scalars compare by value equality.""" | |
| # Helper for matching values with type-awareness | |
| def _match_value(exp: Any, act: Any, key_path: str) -> bool: | |
| if isinstance(exp, dict): | |
| if not isinstance(act, dict): | |
| logger.debug("Type mismatch for '%s': expected dict, got %s", key_path, type(act).__name__) | |
| return False | |
| return _compare_tool_arguments(exp, act) | |
| if isinstance(exp, (list, tuple)): | |
| if not isinstance(act, (list, tuple)) or len(exp) != len(act): | |
| logger.debug( | |
| "List length/type mismatch for '%s': expected %s items, got %s", | |
| key_path, | |
| len(exp), | |
| len(act) if isinstance(act, (list, tuple)) else type(act).__name__, | |
| ) | |
| return False | |
| for idx, (e_item, a_item) in enumerate(zip(exp, act)): | |
| if not _match_value(e_item, a_item, f"{key_path}[{idx}]"): | |
| return False | |
| return True | |
| if isinstance(exp, str): | |
| try: | |
| return re.search(exp, str(act)) is not None | |
| except re.error as e: | |
| logger.debug("Invalid regex pattern '%s' for key '%s': %s", exp, key_path, e) | |
| return False | |
| return str(exp) == str(act) | |
| # Compare each expected argument | |
| for key, expected_value in expected.items(): | |
| if key not in actual: | |
| logger.debug("Missing argument key: '%s'", key) | |
| return False | |
| actual_value = actual[key] | |
| if not _match_value(expected_value, actual_value, key): | |
| logger.debug( | |
| "Argument value mismatch for '%s': expected '%s', actual '%s'", | |
| key, expected_value, actual_value | |
| ) | |
| return False | |
| return True |
🤖 Prompt for AI Agents
In lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py
around lines 89 to 108, the current code stringifies containers and runs regex
on them which causes false positives; replace that with a type-aware matcher: if
expected_value is a str only attempt regex matching against actual_value when
actual_value is also a str (catch regex errors), otherwise compare string
equality; if expected_value is an int/float/bool perform strict equality against
actual_value; if expected_value is a dict, ensure every key in expected exists
in actual and recursively compare values; if expected_value is a list, ensure
each element in expected is found in actual by recursively comparing elements
(order-insensitive/subset match); implement recursion to handle nested
structures and return False immediately on mismatch, keeping regex support
exclusively for string-to-string comparisons.
| lightspeed-eval = "lightspeed_evaluation.runner.evaluation:main" | ||
| generate_answers = "generate_answers.generate_answers:main" |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Fix packaging: console script points to a module that isn’t shipped
generate_answers = "generate_answers.generate_answers:main" creates an entry point in the wheel, but the wheel only packages src/lightspeed_evaluation. This will fail at runtime after installation from a wheel. Either include generate_answers in the wheel or drop the console script from distribution (dev-only).
Given the prior intent to keep it out of wheels, remove the script here and expose it via a dev-only uv script instead.
[project.scripts]
lightspeed-eval = "lightspeed_evaluation.runner.evaluation:main"
-generate_answers = "generate_answers.generate_answers:main"Dev-only (outside this diff, add to pyproject.toml):
[tool.uv.scripts]
generate-answers = "python -m generate_answers.generate_answers"Run to confirm packaging state and avoid broken entry points:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Checking that entry points resolve within source tree..."
fd -a 'generate_answers.py' -t f src || true
rg -n 'module\s*=\s*"generate_answers\.generate_answers:main"|generate_answers\.generate_answers:main' pyproject.toml
echo "Verifying the packaged packages list..."
rg -n '^\s*packages\s*=\s*\[.*\]' -n -C2 pyproject.toml
echo "Ensure the main CLI target exists:"
fd -a 'evaluation.py' -t f src/lightspeed_evaluation/runner
rg -nP 'def\s+main\s*\(' src/lightspeed_evaluation/runner/evaluation.py || trueLength of output: 987
Remove the broken generate_answers console script
The generate_answers = "generate_answers.generate_answers:main" entry under [project.scripts] points to a package not included in the wheel (packages = ["src/lightspeed_evaluation"]), causing runtime failures after install. Remove it and expose the CLI as a dev-only script instead.
[project.scripts]
lightspeed-eval = "lightspeed_evaluation.runner.evaluation:main"
-generate_answers = "generate_answers.generate_answers:main"Add to pyproject.toml under dev tools:
[tool.uv.scripts]
generate-answers = "python -m generate_answers.generate_answers"🤖 Prompt for AI Agents
In pyproject.toml around lines 45 to 46, the console script entry
"generate_answers = \"generate_answers.generate_answers:main\"" points to a
package not included in the wheel (packages = ["src/lightspeed_evaluation"]) and
must be removed; delete that line from [project.scripts] and instead add a
dev-only entry under your dev tooling by adding a [tool.uv.scripts] section with
an entry named generate-answers that invokes the module via the python -m form
so the CLI is available for development but not installed in the distributed
wheel.
| def get_info(self) -> Dict[str, Any]: | ||
| """Get service information.""" | ||
| response = self.session.get(f"{self.base_url}/v1/info") | ||
| response.raise_for_status() | ||
| return response.json() |
There was a problem hiding this comment.
Missing timeout and JSON error handling in get_info.
GET can hang and JSON parsing can fail; add timeout and robust parse.
def get_info(self) -> Dict[str, Any]:
"""Get service information."""
- response = self.session.get(f"{self.base_url}/v1/info")
+ response = self.session.get(
+ f"{self.base_url}/v1/info",
+ timeout=self.config.timeout,
+ )
response.raise_for_status()
- return response.json()
+ try:
+ return response.json()
+ except ValueError as e:
+ raise RuntimeError("Invalid JSON from Lightspeed RAG /v1/info") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_info(self) -> Dict[str, Any]: | |
| """Get service information.""" | |
| response = self.session.get(f"{self.base_url}/v1/info") | |
| response.raise_for_status() | |
| return response.json() | |
| def get_info(self) -> Dict[str, Any]: | |
| """Get service information.""" | |
| response = self.session.get( | |
| f"{self.base_url}/v1/info", | |
| timeout=self.config.timeout, | |
| ) | |
| response.raise_for_status() | |
| try: | |
| return response.json() | |
| except ValueError as e: | |
| raise RuntimeError("Invalid JSON from Lightspeed RAG /v1/info") from e |
🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py around lines 73 to
77, the get_info method currently calls session.get and response.json without a
timeout or robust error handling; change the call to include a sensible timeout
(e.g. timeout=5 or configurable constant) and wrap the request and JSON parsing
in try/except blocks to handle requests.exceptions (including timeouts and HTTP
errors) and json.JSONDecodeError; call response.raise_for_status() after a
successful request, attempt response.json() inside a try and on JSON errors
raise or return a clear structured error, and ensure any caught exceptions are
logged or re-raised with contextual message so callers can handle failures
safely.
| return result.get("response", ""), result.get("conversation_id", "") | ||
|
|
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Key names in /v1/query response.
You access result["response"] and ["conversation_id"]. Please confirm these keys match the API; some services use "answer" or "output".
🌐 Web query:
For Lightspeed Stack RAG /v1/query responses, what are the exact JSON keys that return the assistant's text and the conversation/session identifier?
💡 Result:
Most Lightspeed Stack RAG /v1/query implementations return the assistant text under the "answer" key and the conversation/session identifier as "session_id" (sometimes called "conversation_id" in related components). [1][2]
Sources:
- Example response fields showing "answer" and "session_id". [1]
- lightspeed-stack project docs referencing /v1/query and conversation/session identifiers. [2]
Update /v1/query key accesses
The RAG API returns the assistant text under "answer" and the session identifier under "session_id". Replace in src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py (at lines 157–158):
- return result.get("response", ""), result.get("conversation_id", "")
+ return result.get("answer", ""), result.get("session_id", "")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return result.get("response", ""), result.get("conversation_id", "") | |
| return result.get("answer", ""), result.get("session_id", "") |
🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py around lines 157 to
158, the code currently reads result.get("response", "") and
result.get("conversation_id", "") but the RAG /v1/query response uses "answer"
for assistant text and "session_id" for the session identifier; update those key
accesses to result.get("answer", "") and result.get("session_id", "")
respectively so the function returns the correct values.
| conversation_id: Optional conversation ID | ||
| system_prompt: Optional system prompt override | ||
|
|
||
| Returns: | ||
| Tuple of (response, conversation_id) | ||
| """ | ||
| contexts = None | ||
| if turn_data.contexts: | ||
| contexts = [ctx.content for ctx in turn_data.contexts] | ||
|
|
||
| return self.generate_response( | ||
| query=turn_data.query, | ||
| contexts=contexts, | ||
| conversation_id=conversation_id, | ||
| system_prompt=system_prompt | ||
| ) | ||
|
|
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Sanity-check all ctx.content usages across repo.
There are similar patterns elsewhere; run this to catch any remaining attr-access on dict contexts.
🏁 Script executed:
#!/bin/bash
# Find suspicious attribute access on contexts that should be dicts
rg -n -C2 -g '!**/dist/**' -g '!**/build/**' -P '\bcontexts?\b.*\n.*\.\s*content\b'
echo "---"
# Show preferred dict access sites
rg -n -C2 -g '!**/dist/**' -g '!**/build/**' -P '\[\s*["'\'']content["'\'']\s*\]'Length of output: 1648
Use conditional access for ctx.content in lightspeed_rag_llm
In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py replace the unconditional attribute access
contexts = [ctx.content for ctx in turn_data.contexts]with the safe pattern used elsewhere:
contexts = [
ctx['content'] if isinstance(ctx, dict) else ctx.content
for ctx in turn_data.contexts
]This prevents errors when a context is a dict.
🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py around lines 170 to
186, the list comprehension unconditionally accesses ctx.content which will
raise if a context is a dict; change it to a conditional access that checks the
type and uses ctx['content'] when ctx is a dict and ctx.content otherwise so
contexts becomes a safe list comprehension handling both dict and object context
entries.
| contexts = None | ||
| if turn_data.contexts: | ||
| contexts = [ctx.content for ctx in turn_data.contexts] | ||
|
|
There was a problem hiding this comment.
TurnData.contexts access bug (dict vs attr).
TurnData.contexts is a list of dicts; using ctx.content will raise. Use dict access or fallback to str.
- if turn_data.contexts:
- contexts = [ctx.content for ctx in turn_data.contexts]
+ if turn_data.contexts:
+ contexts = [
+ (ctx["content"] if isinstance(ctx, dict) else str(ctx))
+ for ctx in turn_data.contexts
+ ]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| contexts = None | |
| if turn_data.contexts: | |
| contexts = [ctx.content for ctx in turn_data.contexts] | |
| contexts = None | |
| if turn_data.contexts: | |
| contexts = [ | |
| (ctx["content"] if isinstance(ctx, dict) else str(ctx)) | |
| for ctx in turn_data.contexts | |
| ] |
🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/llm/lightspeed_rag_llm.py around lines 176 to
179, TurnData.contexts is a list of dicts so accessing ctx.content will fail;
replace that comprehension with one that extracts the content via dict access
and falls back to a string (e.g. for each ctx use ctx.get("content", str(ctx))
or if not a dict use str(ctx)) so contexts becomes a list of strings.
| def _evaluate_rag_response( | ||
| self, | ||
| _conv_data: Any, | ||
| _turn_idx: Optional[int], | ||
| turn_data: Optional[TurnData], | ||
| is_conversation: bool, | ||
| ) -> Tuple[Optional[float], str]: | ||
| """Evaluate a RAG-generated response against the original response.""" | ||
| if is_conversation: | ||
| return None, "RAG response evaluation is a turn-level metric" | ||
|
|
||
| if turn_data is None: | ||
| return None, "TurnData is required for RAG response evaluation" | ||
|
|
||
| try: | ||
| # Try to import and use the RAG manager | ||
| from ..llm.lightspeed_rag_llm import LightspeedRAGManager | ||
|
|
||
| # Create a RAG manager with default config (should be configurable) | ||
| from ..llm.lightspeed_rag_llm import LightspeedRAGConfig | ||
| rag_config = LightspeedRAGConfig() | ||
| rag_manager = LightspeedRAGManager(rag_config) | ||
|
|
||
| # Generate RAG response | ||
| contexts = None | ||
| if turn_data.contexts: | ||
| contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts] | ||
|
|
||
| rag_response, _ = rag_manager.generate_response( | ||
| query=turn_data.query, | ||
| contexts=contexts | ||
| ) | ||
|
|
||
| # Print the RAG response for debugging/visibility | ||
| print(f"RAG Response: {rag_response}") | ||
|
|
||
| # Compare RAG response with the original response | ||
| params = EvaluationPromptParams( | ||
| metric_name="RAG response quality", | ||
| query=turn_data.query, | ||
| response=rag_response, | ||
| expected_response=turn_data.response, # Use original response as expected | ||
| contexts=turn_data.contexts if turn_data.contexts else None, | ||
| scale="0.0 to 1.0", | ||
| ) | ||
| prompt = self._create_evaluation_prompt(params) | ||
|
|
||
| # Add specific instructions for RAG evaluation | ||
| prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n" | ||
| prompt += "Consider:\n" | ||
| prompt += "- Factual accuracy and relevance\n" | ||
| prompt += "- Use of provided context\n" | ||
| prompt += "- Completeness compared to original response\n" | ||
| prompt += "- Overall quality and helpfulness\n" | ||
| prompt += f"\nRAG Response: {rag_response}\n" | ||
| prompt += f"Original Response (Expected): {turn_data.response}" | ||
|
|
||
| # Make LLM call and parse response | ||
| llm_response = self._call_llm(prompt) | ||
| score, reason = self._parse_score_response(llm_response) | ||
|
|
||
| if score is None: | ||
| return ( | ||
| None, | ||
| f"Could not parse score from LLM response: {llm_response[:100]}...", | ||
| ) | ||
|
|
||
| return score, f"RAG response evaluation: {score:.2f} - {reason}" | ||
|
|
||
| except ImportError: | ||
| return None, "LightspeedRAGManager not available" | ||
| except Exception as e: | ||
| return None, f"RAG response evaluation failed: {str(e)}" | ||
|
|
There was a problem hiding this comment.
Critical: RAG manager is constructed without configuration; likely to fail.
LightspeedRAGConfig appears to require fields like base_url/provider/model. Instantiating it with no args will raise, and the metric will always error. Also, contexts extraction assumes objects with .content despite validated dicts, and printing full RAG responses can leak PII.
Apply this diff to build the RAG manager from system config, fix contexts extraction, and avoid printing full responses:
@@
- try:
- # Try to import and use the RAG manager
- from ..llm.lightspeed_rag_llm import LightspeedRAGManager
-
- # Create a RAG manager with default config (should be configurable)
- from ..llm.lightspeed_rag_llm import LightspeedRAGConfig
- rag_config = LightspeedRAGConfig()
- rag_manager = LightspeedRAGManager(rag_config)
-
- # Generate RAG response
- contexts = None
- if turn_data.contexts:
- contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts]
-
- rag_response, _ = rag_manager.generate_response(
- query=turn_data.query,
- contexts=contexts
- )
-
- # Print the RAG response for debugging/visibility
- print(f"RAG Response: {rag_response}")
+ try:
+ # Build RAG manager from system configuration
+ from ..llm.lightspeed_rag_llm import LightspeedRAGManager
+
+ system_config = getattr(self, "system_config", None)
+ if not system_config:
+ return None, "LightspeedRAGManager requires lightspeed_rag configuration (system_config missing)"
+
+ rag_manager = LightspeedRAGManager.from_system_config(system_config)
+
+ # Generate RAG response
+ contexts_list = [ctx.get("content", str(ctx)) for ctx in (turn_data.contexts or [])]
+ contexts = contexts_list or None
+
+ rag_response, _ = rag_manager.generate_response(
+ query=turn_data.query,
+ contexts=contexts,
+ )
@@
- # Add specific instructions for RAG evaluation
+ # Add specific instructions for RAG evaluation
prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n"
prompt += "Consider:\n"
prompt += "- Factual accuracy and relevance\n"
prompt += "- Use of provided context\n"
prompt += "- Completeness compared to original response\n"
prompt += "- Overall quality and helpfulness\n"
- prompt += f"\nRAG Response: {rag_response}\n"
- prompt += f"Original Response (Expected): {turn_data.response}"
+ # Note: the main comparison data is already embedded in the structured prompt above.And update the constructor to carry system_config through, plus the factory:
@@
- def __init__(self, llm_manager: LLMManager):
+ def __init__(self, llm_manager: LLMManager, system_config: Optional[Dict[str, Any]] = None):
@@
- self.model_name = llm_manager.get_model_name()
- self.litellm_params = llm_manager.get_litellm_params()
+ self.model_name = llm_manager.get_model_name()
+ self.litellm_params = llm_manager.get_litellm_params()
+ self.system_config = system_config
@@
@classmethod
def from_system_config(cls, system_config: Dict[str, Any]) -> "CustomMetrics":
"""Create CustomMetrics from system configuration."""
llm_manager = LLMManager.from_system_config(system_config)
- return cls(llm_manager)
+ return cls(llm_manager, system_config)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _evaluate_rag_response( | |
| self, | |
| _conv_data: Any, | |
| _turn_idx: Optional[int], | |
| turn_data: Optional[TurnData], | |
| is_conversation: bool, | |
| ) -> Tuple[Optional[float], str]: | |
| """Evaluate a RAG-generated response against the original response.""" | |
| if is_conversation: | |
| return None, "RAG response evaluation is a turn-level metric" | |
| if turn_data is None: | |
| return None, "TurnData is required for RAG response evaluation" | |
| try: | |
| # Try to import and use the RAG manager | |
| from ..llm.lightspeed_rag_llm import LightspeedRAGManager | |
| # Create a RAG manager with default config (should be configurable) | |
| from ..llm.lightspeed_rag_llm import LightspeedRAGConfig | |
| rag_config = LightspeedRAGConfig() | |
| rag_manager = LightspeedRAGManager(rag_config) | |
| # Generate RAG response | |
| contexts = None | |
| if turn_data.contexts: | |
| contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts] | |
| rag_response, _ = rag_manager.generate_response( | |
| query=turn_data.query, | |
| contexts=contexts | |
| ) | |
| # Print the RAG response for debugging/visibility | |
| print(f"RAG Response: {rag_response}") | |
| # Compare RAG response with the original response | |
| params = EvaluationPromptParams( | |
| metric_name="RAG response quality", | |
| query=turn_data.query, | |
| response=rag_response, | |
| expected_response=turn_data.response, # Use original response as expected | |
| contexts=turn_data.contexts if turn_data.contexts else None, | |
| scale="0.0 to 1.0", | |
| ) | |
| prompt = self._create_evaluation_prompt(params) | |
| # Add specific instructions for RAG evaluation | |
| prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n" | |
| prompt += "Consider:\n" | |
| prompt += "- Factual accuracy and relevance\n" | |
| prompt += "- Use of provided context\n" | |
| prompt += "- Completeness compared to original response\n" | |
| prompt += "- Overall quality and helpfulness\n" | |
| prompt += f"\nRAG Response: {rag_response}\n" | |
| prompt += f"Original Response (Expected): {turn_data.response}" | |
| # Make LLM call and parse response | |
| llm_response = self._call_llm(prompt) | |
| score, reason = self._parse_score_response(llm_response) | |
| if score is None: | |
| return ( | |
| None, | |
| f"Could not parse score from LLM response: {llm_response[:100]}...", | |
| ) | |
| return score, f"RAG response evaluation: {score:.2f} - {reason}" | |
| except ImportError: | |
| return None, "LightspeedRAGManager not available" | |
| except Exception as e: | |
| return None, f"RAG response evaluation failed: {str(e)}" | |
| def _evaluate_rag_response( | |
| self, | |
| _conv_data: Any, | |
| _turn_idx: Optional[int], | |
| turn_data: Optional[TurnData], | |
| is_conversation: bool, | |
| ) -> Tuple[Optional[float], str]: | |
| """Evaluate a RAG-generated response against the original response.""" | |
| if is_conversation: | |
| return None, "RAG response evaluation is a turn-level metric" | |
| if turn_data is None: | |
| return None, "TurnData is required for RAG response evaluation" | |
| try: | |
| # Build RAG manager from system configuration | |
| from ..llm.lightspeed_rag_llm import LightspeedRAGManager | |
| system_config = getattr(self, "system_config", None) | |
| if not system_config: | |
| return None, "LightspeedRAGManager requires lightspeed_rag configuration (system_config missing)" | |
| rag_manager = LightspeedRAGManager.from_system_config(system_config) | |
| # Generate RAG response | |
| contexts_list = [ctx.get("content", str(ctx)) for ctx in (turn_data.contexts or [])] | |
| contexts = contexts_list or None | |
| rag_response, _ = rag_manager.generate_response( | |
| query=turn_data.query, | |
| contexts=contexts, | |
| ) | |
| # Compare RAG response with the original response | |
| params = EvaluationPromptParams( | |
| metric_name="RAG response quality", | |
| query=turn_data.query, | |
| response=rag_response, | |
| expected_response=turn_data.response, # Use original response as expected | |
| contexts=turn_data.contexts if turn_data.contexts else None, | |
| scale="0.0 to 1.0", | |
| ) | |
| prompt = self._create_evaluation_prompt(params) | |
| # Add specific instructions for RAG evaluation | |
| prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n" | |
| prompt += "Consider:\n" | |
| prompt += "- Factual accuracy and relevance\n" | |
| prompt += "- Use of provided context\n" | |
| prompt += "- Completeness compared to original response\n" | |
| prompt += "- Overall quality and helpfulness\n" | |
| # Note: the main comparison data is already embedded in the structured prompt above. | |
| # Make LLM call and parse response | |
| llm_response = self._call_llm(prompt) | |
| score, reason = self._parse_score_response(llm_response) | |
| if score is None: | |
| return ( | |
| None, | |
| f"Could not parse score from LLM response: {llm_response[:100]}...", | |
| ) | |
| return score, f"RAG response evaluation: {score:.2f} - {reason}" | |
| except ImportError: | |
| return None, "LightspeedRAGManager not available" | |
| except Exception as e: | |
| return None, f"RAG response evaluation failed: {str(e)}" |
| def _evaluate_rag_response( | |
| self, | |
| _conv_data: Any, | |
| _turn_idx: Optional[int], | |
| turn_data: Optional[TurnData], | |
| is_conversation: bool, | |
| ) -> Tuple[Optional[float], str]: | |
| """Evaluate a RAG-generated response against the original response.""" | |
| if is_conversation: | |
| return None, "RAG response evaluation is a turn-level metric" | |
| if turn_data is None: | |
| return None, "TurnData is required for RAG response evaluation" | |
| try: | |
| # Try to import and use the RAG manager | |
| from ..llm.lightspeed_rag_llm import LightspeedRAGManager | |
| # Create a RAG manager with default config (should be configurable) | |
| from ..llm.lightspeed_rag_llm import LightspeedRAGConfig | |
| rag_config = LightspeedRAGConfig() | |
| rag_manager = LightspeedRAGManager(rag_config) | |
| # Generate RAG response | |
| contexts = None | |
| if turn_data.contexts: | |
| contexts = [ctx['content'] if isinstance(ctx, dict) else ctx.content for ctx in turn_data.contexts] | |
| rag_response, _ = rag_manager.generate_response( | |
| query=turn_data.query, | |
| contexts=contexts | |
| ) | |
| # Print the RAG response for debugging/visibility | |
| print(f"RAG Response: {rag_response}") | |
| # Compare RAG response with the original response | |
| params = EvaluationPromptParams( | |
| metric_name="RAG response quality", | |
| query=turn_data.query, | |
| response=rag_response, | |
| expected_response=turn_data.response, # Use original response as expected | |
| contexts=turn_data.contexts if turn_data.contexts else None, | |
| scale="0.0 to 1.0", | |
| ) | |
| prompt = self._create_evaluation_prompt(params) | |
| # Add specific instructions for RAG evaluation | |
| prompt += "\n\nYou are comparing a RAG-generated response against an original response.\n" | |
| prompt += "Consider:\n" | |
| prompt += "- Factual accuracy and relevance\n" | |
| prompt += "- Use of provided context\n" | |
| prompt += "- Completeness compared to original response\n" | |
| prompt += "- Overall quality and helpfulness\n" | |
| prompt += f"\nRAG Response: {rag_response}\n" | |
| prompt += f"Original Response (Expected): {turn_data.response}" | |
| # Make LLM call and parse response | |
| llm_response = self._call_llm(prompt) | |
| score, reason = self._parse_score_response(llm_response) | |
| if score is None: | |
| return ( | |
| None, | |
| f"Could not parse score from LLM response: {llm_response[:100]}...", | |
| ) | |
| return score, f"RAG response evaluation: {score:.2f} - {reason}" | |
| except ImportError: | |
| return None, "LightspeedRAGManager not available" | |
| except Exception as e: | |
| return None, f"RAG response evaluation failed: {str(e)}" | |
| class CustomMetrics: | |
| def __init__(self, llm_manager: LLMManager, system_config: Optional[Dict[str, Any]] = None): | |
| self.llm_manager = llm_manager | |
| self.model_name = llm_manager.get_model_name() | |
| self.litellm_params = llm_manager.get_litellm_params() | |
| self.system_config = system_config | |
| @classmethod | |
| def from_system_config(cls, system_config: Dict[str, Any]) -> "CustomMetrics": | |
| """Create CustomMetrics from system configuration.""" | |
| llm_manager = LLMManager.from_system_config(system_config) | |
| return cls(llm_manager, system_config) |
Signed-off-by: Anxhela Coba <acoba@redhat.com>
Signed-off-by: Anxhela Coba <acoba@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py (2)
27-29: Set Accept header for consistency with JSON API.Minor, but helps with proxies/gateways.
- self.session.headers.update({"Content-Type": "application/json"}) + self.session.headers.update( + {"Content-Type": "application/json", "Accept": "application/json"} + )
64-65: Harden JSON decoding error path.Surface clearer errors on invalid JSON.
- return response.json() + try: + return response.json() + except ValueError as e: + raise RuntimeError(f"Invalid JSON response: {e}") from esrc/lightspeed_evaluation/core/metrics/custom.py (1)
45-45: Prefer logging over print for library code.Avoids noisy stdout in consumers.
+import logging @@ - print(f"✅ Custom Metrics initialized: {self.model_name}") + logger = logging.getLogger(__name__) + logger.info("Custom Metrics initialized: %s", self.model_name)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
config/system.yaml(2 hunks)src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py(1 hunks)src/lightspeed_evaluation/core/metrics/custom.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py (1)
src/lightspeed_evaluation/core/metrics/custom.py (1)
from_system_config(247-250)
🪛 GitHub Actions: Black
src/lightspeed_evaluation/core/metrics/custom.py
[error] 1-1: Black formatting check failed (command: 'uv run black --check .'). This file would be reformatted by Black.
🪛 GitHub Actions: Pydocstyle
src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py
[error] 36-36: pydocstyle D212: Multi-line docstring summary should start at the first line (public method 'query').
[error] 77-77: pydocstyle D212: Multi-line docstring summary should start at the first line (public method 'from_system_config').
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py
[error] 4-4: Pylint C0411: standard import 'typing.Any' should be placed before third party import 'requests' (wrong-import-order)
[error] 5-5: Pylint C0411: standard import 'dataclasses.dataclass' should be placed before third party import 'requests' (wrong-import-order)
🪛 YAMLlint (1.37.1)
config/system.yaml
[error] 98-98: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (3)
src/lightspeed_evaluation/core/metrics/custom.py (2)
42-43: Mapping entry for custom:answer_correctness looks good.Aligned with system.yaml.
245-251: Reformat src/lightspeed_evaluation/core/metrics/custom.py with Black
CI indicates this file isn’t Black-formatted; install Black locally and runblack src/lightspeed_evaluation/core/metrics/custom.pythen commit the formatted changes.
config/system.yaml (1)
92-98: Custom metric metadata aligns with code.Threshold/type/framework look correct.
| # Lightspeed Stack Configuration | ||
| lightspeed_stack: | ||
| base_url: "http://localhost:8080" # Lightspeed Stack API endpoint | ||
| provider: "openai" # LLM provider for queries | ||
| model: "gpt-4o-mini" # Model to use for queries | ||
| system_prompt: "You are a helpful assistant with access to a knowledge base. Use the provided context to answer questions accurately and comprehensively." | ||
| no_tools: false # Whether to bypass tools and MCP servers | ||
| timeout: 300 # Request timeout for queries | ||
|
|
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
lightspeed_stack block addition looks consistent with client defaults.
Ensure the loader/model preserves this section (not dropped by validation).
🏁 Script executed:
#!/bin/bash
# Inspect usage and model support for `lightspeed_stack`
rg -nC2 '\blightspeed_stack\b' -S
rg -nP '(?s)class\s+EvaluationSystemConfig\b.*' -C5 src | sed -n '1,200p'Length of output: 2256
Add lightspeed_stack to the system config model
EvaluationSystemConfig only defines llm, logging, and metrics_metadata, so by default Pydantic will ignore any lightspeed_stack section (extra='ignore') (docs.pydantic.dev). In src/lightspeed_evaluation/core/config/models.py, extend the class with:
lightspeed_stack: Dict[str, Any] = Field(default_factory=dict)to ensure the lightspeed_stack block in config/system.yaml isn’t dropped by validation.
🤖 Prompt for AI Agents
In config/system.yaml around lines 12-20, the new lightspeed_stack section will
be ignored by the Pydantic EvaluationSystemConfig because it only defines llm,
logging, and metrics_metadata (extra='ignore'); to fix, open
src/lightspeed_evaluation/core/config/models.py and add a new field declaration
lightspeed_stack: Dict[str, Any] = Field(default_factory=dict) to the
EvaluationSystemConfig (and import Dict, Any from typing and Field from
pydantic) so the lightspeed_stack block from config/system.yaml is preserved
during validation.
| # Custom metrics | ||
| "custom:answer_correctness": | ||
| threshold: 0.75 | ||
| type: "turn" | ||
| description: "Correctness vs expected answer using custom LLM evaluation" | ||
| framework: "custom" | ||
|
|
There was a problem hiding this comment.
Trailing spaces trigger yamllint.
Trim trailing whitespace after framework line.
- framework: "custom"
+ framework: "custom"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Custom metrics | |
| "custom:answer_correctness": | |
| threshold: 0.75 | |
| type: "turn" | |
| description: "Correctness vs expected answer using custom LLM evaluation" | |
| framework: "custom" | |
| # Custom metrics | |
| "custom:answer_correctness": | |
| threshold: 0.75 | |
| type: "turn" | |
| description: "Correctness vs expected answer using custom LLM evaluation" | |
| framework: "custom" |
🧰 Tools
🪛 YAMLlint (1.37.1)
[error] 98-98: trailing spaces
(trailing-spaces)
🤖 Prompt for AI Agents
In config/system.yaml around lines 92 to 98, there is trailing whitespace after
the 'framework: "custom"' line which triggers yamllint; remove the trailing
spaces at the end of that line (and ensure no other trailing whitespace exists
on surrounding lines) so the YAML conforms to linting rules.
| def get_info(self) -> Dict[str, Any]: | ||
| """Get service information.""" | ||
| response = self.session.get(f"{self.base_url}/v1/info") | ||
| response.raise_for_status() | ||
| return response.json() |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add timeout to GET request to avoid indefinite hangs.
POST has a timeout; GET should match.
- response = self.session.get(f"{self.base_url}/v1/info")
+ response = self.session.get(
+ f"{self.base_url}/v1/info", timeout=self.config.timeout
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_info(self) -> Dict[str, Any]: | |
| """Get service information.""" | |
| response = self.session.get(f"{self.base_url}/v1/info") | |
| response.raise_for_status() | |
| return response.json() | |
| def get_info(self) -> Dict[str, Any]: | |
| """Get service information.""" | |
| response = self.session.get( | |
| f"{self.base_url}/v1/info", timeout=self.config.timeout | |
| ) | |
| response.raise_for_status() | |
| return response.json() |
🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/llm/lightspeed_stack_client.py around lines
67-71, the GET request to "{self.base_url}/v1/info" lacks a timeout and can hang
indefinitely; update the call to pass the same timeout used by the POST requests
(e.g., timeout=self.timeout or the module-level REQUEST_TIMEOUT constant) so the
GET matches POST behavior, ensuring you use the existing timeout variable/name
used elsewhere in the class and keep response.raise_for_status() and return
response.json() as before.
There was a problem hiding this comment.
Thank you.. This change is adding a standalone API client for query endpoint to get the response.
It is not doing the actual integration. Also the structure of the client can be improved..
But I don't want to drag this PR and waste your time, I already have changes in place for complete integration + additional features..
So let's merge this..
@tisnik @VladimirKadlec PTAL
VladimirKadlec
left a comment
There was a problem hiding this comment.
It's not an integration as the new code is not called. I guess that will be in the following PRs + some tests :)
Why the name of the PR is "RAG eval integration"? Are there any RAG specific changes?
Actually I am working on the end to end integration + few other capabilities.. (working on tests now). Even this code will change.
No, Initially Anxhela thought that we could get RAG context.. and do the evaluation.. But we can't get this info as of now.. And eventually this PR became a query endpoint client creation logic (which I am already covering). Title is based on the initial intended work (can be changed). |
VladimirKadlec
left a comment
There was a problem hiding this comment.
No, Initially Anxhela thought that we could get RAG context.. and do the evaluation.. But we can't get this info as of now.. And eventually this PR became a query endpoint client creation logic (which I am already covering). Title is based on the initial intended work (can be changed).
OK, let's change the title to something describing the changes and merge it.
|
I changed the title, removed the LCORE-601 reference. commit messages are not very clean. But we can merge. |
Summary by CodeRabbit
New Features
Chores
Documentation