update readme by asamal4 · Pull Request #54 · lightspeed-core/lightspeed-evaluation

asamal4 · 2025-09-12T15:04:03Z

Update readme as per latest changes.

Summary by CodeRabbit

New Features
- Added optional API-enabled mode for real-time data generation (with API key auth and persistence) plus a static mode for reproducible evaluations; CLI usage updated for quick start; added tool evaluation at turn level.
Documentation
- Expanded README into an API-centric guide: Installation/Quick Start, usage scenarios (API on/off), auth/provider guidance, extended config schema and metrics metadata, detailed evaluation data model, and outputs/visualizations (CSV/JSON/TXT/PNG).

coderabbitai · 2025-09-12T15:04:11Z

Walkthrough

README.md rewritten to document an API-centric evaluation workflow with toggleable API modes, expanded YAML configuration schema (llm, api, metrics_metadata, output, visualization), enriched metrics taxonomy (including tool_eval), detailed evaluation data and tool-call structures, installation/usage scenarios, and output/reporting guidance.

Changes

Cohort / File(s)	Summary of Changes
Docs: README rewrite `README.md`	Full overhaul: reworded intro; added Installation and Quick Start; introduced API modes (enabled/disabled) and usage scenarios; expanded YAML config (llm: max_tokens/timeout/num_retries; api: enabled/api_base/endpoint_type/timeout/provider/model/no_tools/system_prompt); added metrics_metadata and Tool Evaluation (tool_eval); provided Evaluation Data and Tool Call structures/examples; detailed authentication/env guidance; clarified output formats and visualization options (CSV/JSON/TXT/PNG, graphs).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Evaluator
  participant LLM as LLM/API Provider
  participant Store as DataStore
  participant Metrics
  participant Visualizer

  User->>Evaluator: Start evaluation (config + api.enabled?)
  alt API enabled
    Evaluator->>LLM: Request response (model, timeout, retries)
    LLM-->>Evaluator: Response + tool_calls + context
    Evaluator->>Store: Persist generated turns
  else API disabled
    User->>Evaluator: Supply static evaluation dataset
  end
  Evaluator->>Metrics: Compute turn & conversation metrics (including tool_eval)
  Metrics-->>Evaluator: Scores, statuses, reasoning
  Evaluator->>Store: Write reports (JSON/CSV/TXT)
  Evaluator->>Visualizer: Generate graphs/PNG
  Visualizer-->>User: Render visualizations

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

tisnik

Poem

I hop through README lines so bright,
Toggling APIs by moonlit night.
Turns, tools, metrics — neat and spry,
Reports and graphs that catch the eye.
A rabbit's cheer for docs done right. 🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request modifies README documentation—adding API-focused workflows, expanded config/metrics schema, usage scenarios, and output details—so the title "update readme" directly reflects the main change and is therefore related to the changeset. The title is concise and acceptable, though somewhat generic and could be more descriptive to improve discoverability in history.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (5)

README.md (5)

5-5: Polish the WIP banner for clarity and tone.

-**This is WIP; We are actively working to add more features, fix any issues & add more examples. Please give a try, provide your feedback & report any bug.**
+**This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.**

44-44: Tighten section title wording.

-#### 1. API-Enabled Real-time actual data collection
+#### 1. API-Enabled Real-time data collection

46-46: Fix typos in environment variable setup comments.

-# Set equired environment variable(s) for both Judge-LLM and API authentication (for MCP)
+# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)

-# Set equired environment variable(s) for Judge-LLM
+# Set required environment variable(s) for Judge-LLM

Also applies to: 60-60

183-189: Fix brand capitalization (“OpenShift”).

-      query: What is Openshift Virtualization?
+      query: What is OpenShift Virtualization?

205-206: Clarify data persistence wording and capitalization.

-- **Data Save**: Update original data which can be used with api disabled
+- **Data persistence**: Updates the original evaluation data so it can be reused with API disabled

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64d6a14 and 50688ba.

📒 Files selected for processing (1)

README.md (4 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

README.md

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

README.md

🪛 markdownlint-cli2 (0.17.2)

README.md

226-226: Table pipe style
Expected: leading_and_trailing; Actual: no_leading_or_trailing; Missing leading pipe

(MD055, table-pipe-style)

226-226: Table pipe style
Expected: leading_and_trailing; Actual: no_leading_or_trailing; Missing trailing pipe

(MD055, table-pipe-style)

226-226: Table column count
Expected: 5; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

226-226: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

README.md

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

README.md (1)

200-206: Document API-enabled behavior: fresh call per turn; pre-filled data ignored/overwritten.

Align README with implementation and prior guidance so users aren’t surprised when pre-filled response/tool_calls are replaced.

Apply:

 #### With API Enabled (`api.enabled: true`)
 - **Real-time data generation**: Queries are sent to external API
 - **Dynamic responses**: `response` and `tool_calls` fields populated by API
+- **Fresh calls per turn**: Every turn triggers a new API call; any pre-filled `response`/`tool_calls` are ignored and overwritten to ensure consistency across runs
 - **Conversation context**: Conversation context is maintained across turns
 - **Authentication**: Use `API_KEY` environment variable
-- **Data persistence**: Writes `response`/`tool_calls` back to the original evaluation data so it can be reused with API disabled
+- **Data persistence**: Writes back and overwrites `response`/`tool_calls` in the original evaluation data for reuse when API is disabled

🧹 Nitpick comments (5)

README.md (5)
46-47: Expand acronym “MCP” on first use.

Spell out to avoid ambiguity.
-# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)
+# Set required environment variable(s) for both Judge-LLM and API authentication (for Model Context Protocol, MCP)
35-40: Make provider-agnostic in Quick Start.

Clarify that the env var depends on chosen provider (OpenAI, Watsonx, etc.) to reduce confusion.
-# Set required environment variable(s) for Judge-LLM
-export OPENAI_API_KEY="your-key"
+# Set required environment variable(s) for your Judge-LLM provider
+# e.g., OpenAI:
+export OPENAI_API_KEY="your-key"
+# or IBM Watsonx:
+# export WATSONX_API_KEY="your-key"
236-245: Clarify nested structure semantics for expected_tool_calls.

Briefly describe list-of-lists meaning (steps/turns vs parallel calls) to prevent schema confusion.
-#### Tool Call Structure
+#### Tool Call Structure
+Each turn’s `expected_tool_calls` is a list of steps; every step is a list of tool calls expected within that step (i.e., `list[list[dict]]`).
58-66: Cross-reference required fields for static mode.

Remind users which fields must be present when API is disabled.
 #### 2. Static Data Evaluation (API Disabled)
 ...
-# Pre-fill response, contexts & tool_calls data in YAML
+# Pre-fill response, contexts & tool_calls data in YAML
+# Note: Some metrics require additional fields, e.g., `expected_response` for `custom:answer_correctness`
+# and `expected_tool_calls` for `custom:tool_eval` (see examples below).
5-5: Tone tweak: replace “WIP” banner with a badge or softer phrasing.

Keeps main-branch README polished while still signaling active development.
-**This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.**
+**Actively evolving**: We’re adding features and examples. Try it out and share feedback or bugs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50688ba and 0545eb7.

📒 Files selected for processing (1)

README.md (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

README.md

🔇 Additional comments (3)

README.md (3)

64-65: Good fix: use contexts (plural).

This aligns with the schema and resolves the prior comment.

108-112: Allowed values confirmed — "streaming" and "query".

Parser enforces choices=["streaming","query"] and client/validators check the same; README.md and config/system.yaml already match — no changes needed.

295-301: Verified: linked docs exist. Both referenced files are present: lsc_agent_eval/README.md and README-generate-answers.md.

README.md

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

README.md (1)

200-206: Make API-enabled behavior explicit: fresh API call per turn; pre-filled data ignored.

This aligns docs with implementation and prior guidance. Please add the following bullet.

 #### With API Enabled (`api.enabled: true`)
 - **Real-time data generation**: Queries are sent to external API
 - **Dynamic responses**: `response` and `tool_calls` fields populated by API
+- **Fresh calls per turn**: Every turn triggers a new API call; any pre-filled `response`/`tool_calls` in the data are ignored to ensure consistency across runs
 - **Conversation context**: Conversation context is maintained across turns
 - **Authentication**: Use `API_KEY` environment variable
 - **Data persistence**: Writes `response`/`tool_calls` back to the original evaluation data so it can be reused with API disabled

🧹 Nitpick comments (2)

README.md (2)
251-263: Add env vars for Anthropic and Azure for completeness (mentioned as supported).

You list these providers but don’t show how to configure keys. Add optional snippets.
 #### For LLM Evaluation (Always Required)
 ```bash
 # OpenAI
 export OPENAI_API_KEY="your-openai-key"
 
+# Anthropic
+export ANTHROPIC_API_KEY="your-key"
+
 # IBM Watsonx
 export WATSONX_API_KEY="your-key"
 export WATSONX_API_BASE="https://us-south.ml.cloud.ibm.com"
 export WATSONX_PROJECT_ID="your-project-id"
 
 # Gemini
 export GEMINI_API_KEY="your-key"
+
+# Azure OpenAI
+export AZURE_OPENAI_API_KEY="your-key"
+export AZURE_OPENAI_ENDPOINT="https://<your-endpoint>.openai.azure.com"
+# Optional per-SDK: AZURE_OPENAI_API_VERSION="2024-xx-xx"
165-176: Optional: quote metric identifiers in lists for consistency.

List items like ragas:faithfulness and deepeval:conversation_completeness parse as plain scalars, but quoting improves readability and symmetry with mapping keys.
   turn_metrics:
-    - "ragas:faithfulness"
-    - "custom:answer_correctness"
+    - "ragas:faithfulness"
+    - "custom:answer_correctness"
 ...
   conversation_metrics:
-    - "deepeval:conversation_completeness"
+    - "deepeval:conversation_completeness"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0545eb7 and a61e1da.

📒 Files selected for processing (1)

README.md (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

README.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: black
GitHub Check: mypy
GitHub Check: tests (3.13)

README.md

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

README.md (5)

200-206: Document “fresh API call per turn” and make persistence optional.

Make API-enabled behavior explicit (per retrieved learnings) and avoid implying persistence is always on.

 #### With API Enabled (`api.enabled: true`)
 - **Real-time data generation**: Queries are sent to external API
 - **Dynamic responses**: `response` and `tool_calls` fields populated by API
+ - **Fresh calls per turn**: Every turn triggers a new API call; any pre-filled `response`/`tool_calls` are ignored to ensure consistency across runs
 - **Conversation context**: Conversation context is maintained across turns
 - **Authentication**: Use `API_KEY` environment variable
-- **Data persistence**: Writes `response`/`tool_calls` back to the original evaluation data so it can be reused with API disabled
+ - **Data persistence**: Optionally writes `response`/`tool_calls` back to the evaluation data for reuse in API‑disabled runs

Note: This aligns with the retrieved learning dated 2025-09-09 stating that when API is enabled, each turn must make a fresh API call. Please confirm current pipeline behavior matches this.

216-227: Add a blank line after the table to satisfy markdownlint (MD058).

Minor formatting to keep tables clean.

 | `tool_calls`          | list[list[dict]] | ❌       | Actual tool calls from API           | ✅ (if API enabled)   |
+
 Note: Context will be collected automatically in the future.

44-56: Call out that pre-filled response/tool_calls are ignored when API is enabled.

A short inline note reduces surprises during runs.

 # Run with API-enabled configuration
+# Note: With api.enabled: true, each turn triggers a fresh API call; any pre-filled
+# `response`/`tool_calls` in eval data are ignored.
 lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml

106-117: Clarify endpoint_type values and when api.provider/model apply.

Avoid ambiguity by documenting accepted values and defaults.

 api:
   enabled: true                        # Enable/disable API calls
   api_base: http://localhost:8080      # Base API URL
-  endpoint_type: streaming             # streaming or query endpoint
+  endpoint_type: streaming             # Allowed: "streaming" or "query" (default: "query")
   timeout: 300                         # API request timeout in seconds
-  
+
   provider: openai                     # LLM provider for API queries (optional)
   model: gpt-4o-mini                   # Model to use for API queries (optional)
   no_tools: null                       # Whether to bypass tools (optional)
   system_prompt: null                  # Custom system prompt (optional)
+
+# Note: provider/model/no_tools/system_prompt are used only if your API expects the client
+# to specify these for its own LLM calls; omit if the server decides them.

251-269: Expand acronym for clarity.

Spell out MCP once.

-# API authentication for external system (MCP)
+# API authentication for external system (MCP — Model Context Protocol)
 export API_KEY="your-api-endpoint-key"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a61e1da and 070e580.

📒 Files selected for processing (1)

README.md (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

README.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: pydocstyle

🔇 Additional comments (1)

README.md (1)

261-263: Nice addition: Gemini env var.

Good to see provider coverage documented.

asamal4 · 2025-09-15T09:56:27Z

@tisnik PTAL.

tisnik

LGTM

coderabbitai bot reviewed Sep 12, 2025

View reviewed changes

README.md Show resolved Hide resolved

README.md Show resolved Hide resolved

README.md Show resolved Hide resolved

asamal4 force-pushed the update-readme branch from 50688ba to 0545eb7 Compare September 12, 2025 15:56

coderabbitai bot reviewed Sep 12, 2025

View reviewed changes

lpiwowar reviewed Sep 15, 2025

View reviewed changes

README.md Show resolved Hide resolved

VladimirKadlec approved these changes Sep 15, 2025

View reviewed changes

asamal4 force-pushed the update-readme branch from 0545eb7 to a61e1da Compare September 15, 2025 09:06

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

update readme

070e580

asamal4 force-pushed the update-readme branch from a61e1da to 070e580 Compare September 15, 2025 09:26

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

tisnik approved these changes Sep 18, 2025

View reviewed changes

tisnik merged commit d62ffaf into lightspeed-core:main Sep 18, 2025
15 checks passed

coderabbitai bot mentioned this pull request Sep 25, 2025

docs: add links to metrics docs #60

Merged

coderabbitai bot mentioned this pull request Nov 3, 2025

add agent-eval deprecation note #91

Merged

coderabbitai bot mentioned this pull request Nov 28, 2025

docs: reorganize docs, add configuration docs #111

Merged

15 tasks

coderabbitai bot mentioned this pull request Dec 23, 2025

chore: update docs/key features & add ds_store to gitignore #125

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update readme#54

update readme#54
tisnik merged 1 commit intolightspeed-core:mainfrom
asamal4:update-readme

asamal4 commented Sep 12, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

asamal4 commented Sep 15, 2025

Uh oh!

tisnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

asamal4 commented Sep 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Sep 15, 2025

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

asamal4 commented Sep 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 12, 2025 •

edited

Loading