Releases: lightspeed-core/lightspeed-evaluation
Releases · lightspeed-core/lightspeed-evaluation
LightSpeed Evaluation v0.4.0
What's Changed
Key Changes
- Flexible Tool Evaluation: Configurable ordered/unordered & full/partial match modes for tool call validation
- Classical Evaluation Metrics: Support for traditional evaluation metrics (bleu, rouge, distance metrics)
- Alternate Expected Response: Ability to set alternate ground-truth responses for static evaluation metrics
- Eval Configuration Tracking: Evaluation configuration details now included in generated reports for better reproducibility
- API Latency Metrics: Latency tracking and reporting for API performance analysis (for API streaming endpoint)
- Data Grouping: Tag-based grouping of evaluation conversations for better organization
- Data Filtering: Filter evaluation datasets by tags and conversation IDs (CLI arguments) for targeted testing
- Cache Warmup: New optional CLI argument to pre-warm (clear) caches before evaluation runs
Pull Requests
- bump eval to v0.4.0 by @asamal4 in #128
- fix: azure env variable names for judgeLLM by @asamal4 in #129
- [LEADS-141] Add Latency Metrics to Evaluation Reports by @bsatapat-jpg in #127
- chore: consolidate test_data models by @asamal4 in #131
- chore: refactor generator & statistics module by @asamal4 in #132
- Add optional property tag to group eval conversations by @asamal4 in #134
- add git hooks by @VladimirKadlec in #133
- [LEADS-172] Support classical evaluation metrics by @bsatapat-jpg in #130
- fix: align docs for updated make targets by @asamal4 in #135
- [LEADS-153] Adding the ordered matching logic in tool eval by @bsatapat-jpg in #136
- [LEADS-153] Implement match logic (full/partial) by @bsatapat-jpg in #137
- Remove duplicate data validation in pipeline by @asamal4 in #141
- chore: refactor evaluation runner by @asamal4 in #140
- feat: add data filter by tags & conv_ids by @asamal4 in #143
- [LEADS-153] Wiring the configuration and adding the config in system.yaml by @bsatapat-jpg in #139
- [LEADS-182] - Add eval config data to the report by @arin-deloatch in #142
- Leads 6 set expected responses by @xmican10 in #138
- map max_tokens to max_completion_tokens internally by @asamal4 in #144
- fix: Do subset matching for full_match=false by @saswatamcode in #145
- Enhance test quality by @xmican10 in #146
- use .model_dump instead of .dict by @asamal4 in #147
- add cache-warmup flag by @VladimirKadlec in #149
- Leads 212 remove unittest mocking by @xmican10 in #148
New Contributors
- @saswatamcode made their first contribution in #145
Full Changelog: v0.3.0...v0.4.0
LightSpeed Evaluation v0.3.0
What's Changed
Key Changes
- Token Usage Statistics: Track and report token consumption during evaluations (both API and JudgeLLM usage)
- Certificate Support for JudgeLLM: Configure custom certificates when connecting to Judge LLM endpoints
- Skip on Failure: Optional config to skip remaining evaluations in a conversation group when any evaluation criteria fails
- Optional Packages: torch and nvidia-* packages are now optional, significantly reducing install size for use cases that don't require them
PRs
- bump eval version to 0.3.0 by @asamal4 in #113
- docs: reorganize docs, add configuration docs by @VladimirKadlec in #111
- Configuration base url update by @yangcao77 in #110
- [LEADS-40]: Get statistics about the token usage for lightspeed-evaluation by @bsatapat-jpg in #112
- LEADS-160: Adding python 3.13 compatibility by @bsatapat-jpg in #115
- add additional fields to output for non-error scenarios by @asamal4 in #114
- remove dynamic all by @asamal4 in #116
- make agents.md more concise by @asamal4 in #117
- add bandit to make target by @asamal4 in #118
- chore: refactor processor & errors.py by @asamal4 in #119
- [LEADS-119] code scanning found multiple security problems by @bsatapat-jpg in #122
- Skip rest of the eval for an metric failure within a conversation group by @asamal4 in #121
- Leads 44 certificates for judge llm by @xmican10 in #120
- [LEADS-140] lightspeed-evaluation has dependency on torch and nvidia* packages that are not required for all usecases by @bsatapat-jpg in #123
- doc: note for rhaiis, models.corp judgellm by @asamal4 in #124
- chore: update docs/key features by @asamal4 in #125
- doc: Add troubleshooting for known issues by @asamal4 in #126
New Contributors
- @yangcao77 made their first contribution in #110
- @xmican10 made their first contribution in #120
Full Changelog: v0.2.0...v0.3.0
LightSpeed Evaluation v0.2.0
What's Changed
- bump lightspeed evaluation version by @asamal4 in #78
- LCORE-723: Added statistical comparision between two evaluation result files by @bsatapat-jpg in #74
- remove unused LightspeedStackClient module by @asamal4 in #81
- add agents.md by @asamal4 in #82
- LCORE-417 Convert unittest mocking to pytest mocking by @max-svistunov in #84
- Concurent eval by @VladimirKadlec in #85
- LCORE-834: Added script to run evaluation across multiple providers and models by @bsatapat-jpg in #83
- add .caches/ folder to gitignore by @asamal4 in #87
- LCORE-899: created the evaluation methodology by @bsatapat-jpg in #88
- remove archived OLS eval tool by @asamal4 in #86
- add CLAUDE.md by @asamal4 in #89
- add agent-eval deprecation note by @asamal4 in #91
- LCORE-900: Added the parallel execution for multi-modal evaluation in… by @bsatapat-jpg in #92
- Ability to set alternate tool calls for eval by @asamal4 in #90
- LCORE-748: Addded unit test cases coverage for the evaluation framework by @bsatapat-jpg in #95
- LEADS-113: Added support for gemini embedding models by @bsatapat-jpg in #99
- LEADS-2: Fix Path Object Serialization in Amended YAML Files by @bsatapat-jpg in #100
- handle no tool call alternative by @asamal4 in #101
- LCORE-916: configuration for CodeRabbitAI by @tisnik in #103
- GEval Integration by @arin-deloatch in #97
- Add keyword eval metric by @asamal4 in #93
- fix: run turn evaluation immediately after api call by @asamal4 in #105
- LCORE-664: Section about AI tools by @tisnik in #107
- LCORE-974: fixed issues found by Pyright by @tisnik in #108
- LEADS-8: Lazy imports for eval tool by @bsatapat-jpg in #106
- add support for fail_on_invalid_data option by @VladimirKadlec in #94
- LEADS-26: Increased Unit test cases coverage by @bsatapat-jpg in #109
New Contributors
- @max-svistunov made their first contribution in #84
- @arin-deloatch made their first contribution in #97
Full Changelog: v0.1.0...v0.2.0
LightSpeed Evaluation v0.1.0
What's Changed
- initial copy of OLS eval by @asamal4 in #1
- merge ols and road-core, first working version by @VladimirKadlec in #2
- delete old scripts/evaluation, add README by @VladimirKadlec in #3
- add evaluation datasets by @VladimirKadlec in #4
- LCORE-162: Setup all CI all linters/checkers by @matysek in #5
- Add some type hints into rag_eval.py by @tisnik in #6
- Fixed docstrings by @tisnik in #7
- Added type hints for functions without return value by @tisnik in #8
- LCORE-276: Pin HTTPX version for now by @tisnik in #9
- add generate answers tool by @VladimirKadlec in #10
- Update dependencies by @tisnik in #12
- Fix error: missing argument by @tisnik in #13
- Check provider models by @tisnik in #14
- fix readme reference post migration by @asamal4 in #11
- LCORE: 210 Added Contribution Guide by @jrobertboos in #15
- fix empty question, change retry strategy by @VladimirKadlec in #17
- fix few lint issues by @asamal4 in #18
- feat: add agent e2e eval by @asamal4 in #19
- agent eval: verbose print and fixes by @asamal4 in #20
- temp-fix: fix/suppress pyright issues by @asamal4 in #21
- agent eval: multi-turn & refactoring by @asamal4 in #22
- agent-eval: py version by @asamal4 in #23
- Agent eval: add tool call comparison by @asamal4 in #24
- update dependencies by @VladimirKadlec in #25
- fix: streaming error handling by @asamal4 in #26
- Generic eval tool by @asamal4 in #28
- fix runner by @asamal4 in #31
- use uv instead of pdm by @Anxhela21 in #30
- Fix Bandit checker on CI by @tisnik in #32
- archive old eval and make lsc eval as primary by @asamal4 in #35
- switch to regex check for tool arg value by @asamal4 in #41
- docs: Add input data to generate answers documentation by @are-ces in #36
- fix rule for black & pydocstyle by @asamal4 in #45
- Added Unit test cases as well as integration test cases by @bsatapat-jpg in #42
- Add client for query endpoint by @Anxhela21 in #43
- Feature: Add response_eval:intent evaluation type for LLM response intent assessment by @ItzikEzra-rh in #46
- API integration & refactoring by @asamal4 in #47
- [nit] Clean up evaluation_data.yaml by @lpiwowar in #52
- fix: use uv pip instead of pip by @are-ces in #50
- [LCORE-646] Disable default tracking in RAGAS by @lpiwowar in #49
- [LCORE-648] Fix processing of
float('NaN')values when OutputParserException by @lpiwowar in #48 - allow none llm for LS API by @asamal4 in #53
- feat: Added parallelism for answer generation by @are-ces in #39
- update readme by @asamal4 in #54
- fix: propagate arg output dir by @asamal4 in #57
- Turn metric override by @asamal4 in #55
- feat: add support for custom embedding model by @VladimirKadlec in #56
- keep original input file intact by @asamal4 in #59
- docs: add links to metrics docs by @VladimirKadlec in #60
- Retrieved RAG context from lightspeed-stack API by @bsatapat-jpg in #58
- Setting the execution bit only if it's not set by @andrej1991 in #61
- provider vertex support for judge llm by @andrej1991 in #29
- update tool call property by @asamal4 in #64
- add vertex to main eval & refactor by @asamal4 in #63
- Env setup/cleanup ability and verify through script by @asamal4 in #62
- add example & check for vLLM hosted inference server by @asamal4 in #66
- fix sample data by @asamal4 in #69
- use absolute imports by @asamal4 in #68
- fix: propagate api error message by @asamal4 in #72
- add common custom llm by @asamal4 in #70
- LCORE-723: Compute correct confidence interval by @bsatapat-jpg in #71
- Simplify custom prompt handling & re-organize by @asamal4 in #73
- add support for caching llm and api responses by @VladimirKadlec in #75
- standardize file name as per framework name in metric by @asamal4 in #76
- add intent eval by @asamal4 in #77
New Contributors
- @asamal4 made their first contribution in #1
- @VladimirKadlec made their first contribution in #2
- @matysek made their first contribution in #5
- @tisnik made their first contribution in #6
- @jrobertboos made their first contribution in #15
- @Anxhela21 made their first contribution in #30
- @are-ces made their first contribution in #36
- @bsatapat-jpg made their first contribution in #42
- @ItzikEzra-rh made their first contribution in #46
- @lpiwowar made their first contribution in #52
- @andrej1991 made their first contribution in #61
Full Changelog: https://github.com/lightspeed-core/lightspeed-evaluation/commits/v0.1.0