Conversation
WalkthroughREADME.md rewritten to document an API-centric evaluation workflow with toggleable API modes, expanded YAML configuration schema (llm, api, metrics_metadata, output, visualization), enriched metrics taxonomy (including tool_eval), detailed evaluation data and tool-call structures, installation/usage scenarios, and output/reporting guidance. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Evaluator
participant LLM as LLM/API Provider
participant Store as DataStore
participant Metrics
participant Visualizer
User->>Evaluator: Start evaluation (config + api.enabled?)
alt API enabled
Evaluator->>LLM: Request response (model, timeout, retries)
LLM-->>Evaluator: Response + tool_calls + context
Evaluator->>Store: Persist generated turns
else API disabled
User->>Evaluator: Supply static evaluation dataset
end
Evaluator->>Metrics: Compute turn & conversation metrics (including tool_eval)
Metrics-->>Evaluator: Scores, statuses, reasoning
Evaluator->>Store: Write reports (JSON/CSV/TXT)
Evaluator->>Visualizer: Generate graphs/PNG
Visualizer-->>User: Render visualizations
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
README.md (5)
5-5: Polish the WIP banner for clarity and tone.-**This is WIP; We are actively working to add more features, fix any issues & add more examples. Please give a try, provide your feedback & report any bug.** +**This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.**
44-44: Tighten section title wording.-#### 1. API-Enabled Real-time actual data collection +#### 1. API-Enabled Real-time data collection
46-46: Fix typos in environment variable setup comments.-# Set equired environment variable(s) for both Judge-LLM and API authentication (for MCP) +# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)-# Set equired environment variable(s) for Judge-LLM +# Set required environment variable(s) for Judge-LLMAlso applies to: 60-60
183-189: Fix brand capitalization (“OpenShift”).- query: What is Openshift Virtualization? + query: What is OpenShift Virtualization?
205-206: Clarify data persistence wording and capitalization.-- **Data Save**: Update original data which can be used with api disabled +- **Data persistence**: Updates the original evaluation data so it can be reused with API disabled
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
README.md(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
README.md
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
README.md
🪛 markdownlint-cli2 (0.17.2)
README.md
226-226: Table pipe style
Expected: leading_and_trailing; Actual: no_leading_or_trailing; Missing leading pipe
(MD055, table-pipe-style)
226-226: Table pipe style
Expected: leading_and_trailing; Actual: no_leading_or_trailing; Missing trailing pipe
(MD055, table-pipe-style)
226-226: Table column count
Expected: 5; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
226-226: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
50688ba to
0545eb7
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
README.md (1)
200-206: Document API-enabled behavior: fresh call per turn; pre-filled data ignored/overwritten.Align README with implementation and prior guidance so users aren’t surprised when pre-filled response/tool_calls are replaced.
Apply:
#### With API Enabled (`api.enabled: true`) - **Real-time data generation**: Queries are sent to external API - **Dynamic responses**: `response` and `tool_calls` fields populated by API +- **Fresh calls per turn**: Every turn triggers a new API call; any pre-filled `response`/`tool_calls` are ignored and overwritten to ensure consistency across runs - **Conversation context**: Conversation context is maintained across turns - **Authentication**: Use `API_KEY` environment variable -- **Data persistence**: Writes `response`/`tool_calls` back to the original evaluation data so it can be reused with API disabled +- **Data persistence**: Writes back and overwrites `response`/`tool_calls` in the original evaluation data for reuse when API is disabled
🧹 Nitpick comments (5)
README.md (5)
46-47: Expand acronym “MCP” on first use.Spell out to avoid ambiguity.
-# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP) +# Set required environment variable(s) for both Judge-LLM and API authentication (for Model Context Protocol, MCP)
35-40: Make provider-agnostic in Quick Start.Clarify that the env var depends on chosen provider (OpenAI, Watsonx, etc.) to reduce confusion.
-# Set required environment variable(s) for Judge-LLM -export OPENAI_API_KEY="your-key" +# Set required environment variable(s) for your Judge-LLM provider +# e.g., OpenAI: +export OPENAI_API_KEY="your-key" +# or IBM Watsonx: +# export WATSONX_API_KEY="your-key"
236-245: Clarify nested structure semantics forexpected_tool_calls.Briefly describe list-of-lists meaning (steps/turns vs parallel calls) to prevent schema confusion.
-#### Tool Call Structure +#### Tool Call Structure +Each turn’s `expected_tool_calls` is a list of steps; every step is a list of tool calls expected within that step (i.e., `list[list[dict]]`).
58-66: Cross-reference required fields for static mode.Remind users which fields must be present when API is disabled.
#### 2. Static Data Evaluation (API Disabled) ... -# Pre-fill response, contexts & tool_calls data in YAML +# Pre-fill response, contexts & tool_calls data in YAML +# Note: Some metrics require additional fields, e.g., `expected_response` for `custom:answer_correctness` +# and `expected_tool_calls` for `custom:tool_eval` (see examples below).
5-5: Tone tweak: replace “WIP” banner with a badge or softer phrasing.Keeps main-branch README polished while still signaling active development.
-**This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.** +**Actively evolving**: We’re adding features and examples. Try it out and share feedback or bugs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
README.md(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
README.md
🔇 Additional comments (3)
README.md (3)
64-65: Good fix: usecontexts(plural).This aligns with the schema and resolves the prior comment.
108-112: Allowed values confirmed — "streaming" and "query".Parser enforces choices=["streaming","query"] and client/validators check the same; README.md and config/system.yaml already match — no changes needed.
295-301: Verified: linked docs exist. Both referenced files are present: lsc_agent_eval/README.md and README-generate-answers.md.
0545eb7 to
a61e1da
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
README.md (1)
200-206: Make API-enabled behavior explicit: fresh API call per turn; pre-filled data ignored.This aligns docs with implementation and prior guidance. Please add the following bullet.
#### With API Enabled (`api.enabled: true`) - **Real-time data generation**: Queries are sent to external API - **Dynamic responses**: `response` and `tool_calls` fields populated by API +- **Fresh calls per turn**: Every turn triggers a new API call; any pre-filled `response`/`tool_calls` in the data are ignored to ensure consistency across runs - **Conversation context**: Conversation context is maintained across turns - **Authentication**: Use `API_KEY` environment variable - **Data persistence**: Writes `response`/`tool_calls` back to the original evaluation data so it can be reused with API disabled
🧹 Nitpick comments (2)
README.md (2)
251-263: Add env vars for Anthropic and Azure for completeness (mentioned as supported).You list these providers but don’t show how to configure keys. Add optional snippets.
#### For LLM Evaluation (Always Required) ```bash # OpenAI export OPENAI_API_KEY="your-openai-key" +# Anthropic +export ANTHROPIC_API_KEY="your-key" + # IBM Watsonx export WATSONX_API_KEY="your-key" export WATSONX_API_BASE="https://us-south.ml.cloud.ibm.com" export WATSONX_PROJECT_ID="your-project-id" # Gemini export GEMINI_API_KEY="your-key" + +# Azure OpenAI +export AZURE_OPENAI_API_KEY="your-key" +export AZURE_OPENAI_ENDPOINT="https://<your-endpoint>.openai.azure.com" +# Optional per-SDK: AZURE_OPENAI_API_VERSION="2024-xx-xx"
165-176: Optional: quote metric identifiers in lists for consistency.List items like ragas:faithfulness and deepeval:conversation_completeness parse as plain scalars, but quoting improves readability and symmetry with mapping keys.
turn_metrics: - - "ragas:faithfulness" - - "custom:answer_correctness" + - "ragas:faithfulness" + - "custom:answer_correctness" ... conversation_metrics: - - "deepeval:conversation_completeness" + - "deepeval:conversation_completeness"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
README.md(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: black
- GitHub Check: mypy
- GitHub Check: tests (3.13)
a61e1da to
070e580
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (5)
README.md (5)
200-206: Document “fresh API call per turn” and make persistence optional.Make API-enabled behavior explicit (per retrieved learnings) and avoid implying persistence is always on.
#### With API Enabled (`api.enabled: true`) - **Real-time data generation**: Queries are sent to external API - **Dynamic responses**: `response` and `tool_calls` fields populated by API + - **Fresh calls per turn**: Every turn triggers a new API call; any pre-filled `response`/`tool_calls` are ignored to ensure consistency across runs - **Conversation context**: Conversation context is maintained across turns - **Authentication**: Use `API_KEY` environment variable -- **Data persistence**: Writes `response`/`tool_calls` back to the original evaluation data so it can be reused with API disabled + - **Data persistence**: Optionally writes `response`/`tool_calls` back to the evaluation data for reuse in API‑disabled runsNote: This aligns with the retrieved learning dated 2025-09-09 stating that when API is enabled, each turn must make a fresh API call. Please confirm current pipeline behavior matches this.
216-227: Add a blank line after the table to satisfy markdownlint (MD058).Minor formatting to keep tables clean.
| `tool_calls` | list[list[dict]] | ❌ | Actual tool calls from API | ✅ (if API enabled) | + Note: Context will be collected automatically in the future.
44-56: Call out that pre-filled response/tool_calls are ignored when API is enabled.A short inline note reduces surprises during runs.
# Run with API-enabled configuration +# Note: With api.enabled: true, each turn triggers a fresh API call; any pre-filled +# `response`/`tool_calls` in eval data are ignored. lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml
106-117: Clarify endpoint_type values and when api.provider/model apply.Avoid ambiguity by documenting accepted values and defaults.
api: enabled: true # Enable/disable API calls api_base: http://localhost:8080 # Base API URL - endpoint_type: streaming # streaming or query endpoint + endpoint_type: streaming # Allowed: "streaming" or "query" (default: "query") timeout: 300 # API request timeout in seconds - + provider: openai # LLM provider for API queries (optional) model: gpt-4o-mini # Model to use for API queries (optional) no_tools: null # Whether to bypass tools (optional) system_prompt: null # Custom system prompt (optional) + +# Note: provider/model/no_tools/system_prompt are used only if your API expects the client +# to specify these for its own LLM calls; omit if the server decides them.
251-269: Expand acronym for clarity.Spell out MCP once.
-# API authentication for external system (MCP) +# API authentication for external system (MCP — Model Context Protocol) export API_KEY="your-api-endpoint-key"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
README.md(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: pydocstyle
🔇 Additional comments (1)
README.md (1)
261-263: Nice addition: Gemini env var.Good to see provider coverage documented.
|
@tisnik PTAL. |
Update readme as per latest changes.
Summary by CodeRabbit
New Features
Documentation