Skip to content

[LCORE-648] Fix processing of float('NaN') values when OutputParserException#48

Merged
tisnik merged 2 commits intolightspeed-core:mainfrom
lpiwowar:lpiwowar/fix-output-parser-exception
Sep 12, 2025
Merged

[LCORE-648] Fix processing of float('NaN') values when OutputParserException#48
tisnik merged 2 commits intolightspeed-core:mainfrom
lpiwowar:lpiwowar/fix-output-parser-exception

Conversation

@lpiwowar
Copy link
Contributor

@lpiwowar lpiwowar commented Sep 9, 2025

The RAGAS framework returns float('NaN') when it encounters malformed output from the LLM. The malformed output is accompanied by an OutputParserException in the logs, but this exception is caught internally.

The NaN causes a later failure during the generation of statistics like standard deviation at the end of the evaluation and ultimately causes no results to be obtained from the evaluation when the malformed output is encountered by RAGAS.

This commit fixes this issue by checking whether float('NaN') was returned from RAGAS and, if so, ensures that the evaluate() function returns None, as in other cases of failure. This ensures that NaN does not reach the computation of the final statistics.

Resolves: #44

Summary by CodeRabbit

  • Bug Fixes
    • Better handling of network/timeout errors (e.g., broken connections) with clearer, user-friendly messages.
    • Detects and safely handles malformed or NaN model outputs, returning a safe result and explanatory error instead of crashing.
    • Unified post-processing ensures consistent result handling and more informative feedback during evaluation.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 9, 2025

Walkthrough

Adds math import and unified post-processing in ragas.evaluate: stores metric result in a local variable, refines OSError handling with a composed message (special-case errno 32 for broken-pipe/timeouts), checks for NaN in the metric score and returns a structured "malformed output" error before returning the final result.

Changes

Cohort / File(s) Summary of Changes
RAGAS evaluation error handling
src/lightspeed_evaluation/core/metrics/ragas.py
Add import math; capture metric computation in a local result variable; improve OSError handling by composing err_msg and appending a network/LLM timeout note for errno 32; add post-try check returning (None, "malformed output from the LLM") when result[0] is NaN; return result after checks.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant ragas_evaluate as ragas.evaluate
  participant MetricFn as MetricFn

  Caller->>ragas_evaluate: evaluate(input)
  ragas_evaluate->>MetricFn: lookup_and_compute(...)
  alt MetricFn raises OSError
    MetricFn-->>ragas_evaluate: OSError
    ragas_evaluate->>Caller: err_msg (includes timeout note if errno 32)
  else MetricFn returns result
    MetricFn-->>ragas_evaluate: result
    alt result[0] is NaN
      Note over ragas_evaluate: Guard for malformed LLM output
      ragas_evaluate->>Caller: (None, "malformed output from the LLM")
    else
      ragas_evaluate->>Caller: result
    end
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks (5 passed)

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title is concise and accurately describes the primary change: fixing handling of float('NaN') produced when an OutputParserException occurs. It directly reflects the code changes in src/lightspeed_evaluation/core/metrics/ragas.py and is clear to reviewers.
Linked Issues Check ✅ Passed The changes implement a clear NaN-check (using math.isnan) and return None with a "malformed output from the LLM" error when ragas returns NaN, which prevents NaN scores from reaching aggregate statistics and addresses the core failure described in issue #44; the added OSError messaging is related error-handling and does not conflict with the issue objectives. The PR description and testing notes indicate final reports are generated after the fix, so the coding objectives from the linked issue are met.
Out of Scope Changes Check ✅ Passed The diff is limited to changes in src/lightspeed_evaluation/core/metrics/ragas.py (import and evaluate error/NaN handling) and does not modify public signatures or other unrelated files, so there are no apparent out-of-scope changes introduced by this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Poem

I nibbled through the error vine,
Found a NaN—a bitter sign.
Sewed a message for the broken pipe,
Whispered timeouts when networks gripe.
Now evaluations hop along just fine. 🐇✨

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a138ce and 05d0880.

📒 Files selected for processing (1)
  • src/lightspeed_evaluation/core/metrics/ragas.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lightspeed_evaluation/core/metrics/ragas.py
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lpiwowar lpiwowar force-pushed the lpiwowar/fix-output-parser-exception branch from bf635e2 to 8382f64 Compare September 9, 2025 13:30
@lpiwowar lpiwowar changed the title Fix processing of NaN values when OutputParserException Fix processing of float('NaN') values when OutputParserException Sep 9, 2025
@lpiwowar lpiwowar force-pushed the lpiwowar/fix-output-parser-exception branch from 162690e to 4a138ce Compare September 9, 2025 13:47
@lpiwowar
Copy link
Contributor Author

lpiwowar commented Sep 9, 2025

Proof of Testing

With The Fix

lightspeed-evaluation reaches the end and generates the final report.

lightspeed-eval --system-config config/system.yaml --eval-data ./config/evaluation_data_rhoso_faithfulness.yaml
🚀 LightSpeed Evaluation Framework
==================================================

📋 Loading Configuration...
✅ All data validation passed
📋 Evaluation data loaded: 2 conversations
✅ System config: gemini/gemini-2.0-flash
✅ Evaluation data: 2 conversation groups

⚙️ Initializing Evaluation Driver...
✅ LLM Manager: gemini/gemini-2.0-flash -> gemini/gemini-2.0-flash
✅ Ragas Custom LLM: gemini/gemini-2.0-flash
✅ Ragas LLM Manager configured
✅ DeepEval LLM Manager: gemini/gemini-2.0-flash
✅ Custom Metrics initialized: gemini/gemini-2.0-flash
✅ Evaluation Driver initialized

🔄 Running Evaluation...
🚀 Starting evaluation...

1️⃣ Validating data...
✅ All data validation passed

2️⃣ Processing conversations...

📋 Evaluating: authentication-0
🔄 Turn-level metrics: ['ragas:faithfulness']
    ragas:faithfulness (threshold: 0.8)
Evaluating:   0%|                                                                                                                                                                                                                                                   | 0/1 [00:00<?, ?it/s]15:18:30 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:18:30,346 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:18:31 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:18:31,211 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
15:18:31 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:18:31,216 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:18:33 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:18:33,308 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.97s/it]
      ❌ FAIL: 0.750

📋 Evaluating: authentication-1
🔄 Turn-level metrics: ['ragas:faithfulness']
    ragas:faithfulness (threshold: 0.8)
Evaluating:   0%|                                                                                                                                                                                                                                                   | 0/1 [00:00<?, ?it/s]15:18:34 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:18:34,654 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:18:36 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:18:36,038 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
15:18:36 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:18:36,043 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:18:38 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:18:38,765 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
15:18:38 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:18:38,800 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:18:41 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:18:41,559 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
2025-09-09 15:18:41,596 - ragas.executor - ERROR - Exception raised in Job[0]: OutputParserException(Failed to parse NLIStatementOutput from completion {"statements": [{"statement": xxxxxxxxxGot: 1 validation error for NLIStatementOutput
statements.6.verdict
  Field required [type=missing, input_value={'statement': 'This might...plicitly states: "This'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE )
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.95s/it]
      ❌ ERROR: Ragas faithfulness evaluation failed due to malformed output from the LLM

✅ Evaluation complete: 2 results generated

📊 Generating Reports...
✅ Output handler initialized: eval_output

📊 Generating reports: evaluation_20250909_151842
  ✅ CSV: eval_output/evaluation_20250909_151842_detailed.csv
  ✅ JSON: eval_output/evaluation_20250909_151842_summary.json
  ✅ TXT: eval_output/evaluation_20250909_151842_summary.txt
2025-09-09 15:18:43,147 - lightspeed_evaluation.GraphGenerator - INFO - Generated 4 graphs
  ✅ Graphs: 4 files

🎉 Evaluation Complete!
📊 2 evaluations completed
📁 Reports generated in: eval_output
✅ Pass: 0, ❌ Fail: 1, ⚠️ Error: 1
⚠️ 1 evaluations had errors - check detailed report

Without The Fix

lightspeed-evaluation does not generate the final report when RAGAS encountered the OutputParserException.

lightspeed-eval --system-config config/system.yaml --eval-data ./config/evaluation_data_rhoso_faithfulness.yaml
🚀 LightSpeed Evaluation Framework
==================================================

📋 Loading Configuration...
✅ All data validation passed
📋 Evaluation data loaded: 2 conversations
✅ System config: gemini/gemini-2.0-flash
✅ Evaluation data: 2 conversation groups

⚙️ Initializing Evaluation Driver...
✅ LLM Manager: gemini/gemini-2.0-flash -> gemini/gemini-2.0-flash
✅ Ragas Custom LLM: gemini/gemini-2.0-flash
✅ Ragas LLM Manager configured
✅ DeepEval LLM Manager: gemini/gemini-2.0-flash
✅ Custom Metrics initialized: gemini/gemini-2.0-flash
✅ Evaluation Driver initialized

🔄 Running Evaluation...
🚀 Starting evaluation...

1️⃣ Validating data...
✅ All data validation passed

2️⃣ Processing conversations...

📋 Evaluating: authentication-0
🔄 Turn-level metrics: ['ragas:faithfulness']
    ragas:faithfulness (threshold: 0.8)
Evaluating:   0%|                                                                                                                                                                                                                                                   | 0/1 [00:00<?, ?it/s]15:20:12 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:20:12,224 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:20:13 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:20:13,417 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
15:20:13 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:20:13,421 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:20:15 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:20:15,717 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.51s/it]
      ❌ FAIL: 0.750

📋 Evaluating: authentication-1
🔄 Turn-level metrics: ['ragas:faithfulness']
    ragas:faithfulness (threshold: 0.8)
Evaluating:   0%|                                                                                                                                                                                                                                                   | 0/1 [00:00<?, ?it/s]15:20:17 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:20:17,078 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:20:18 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:20:18,351 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
15:20:18 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:20:18,355 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:20:21 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:20:21,479 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
15:20:21 - LiteLLM:INFO: utils.py:3338 -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
2025-09-09 15:20:21,520 - LiteLLM - INFO -
LiteLLM completion() model= gemini-2.0-flash; provider = gemini
15:20:24 - LiteLLM:INFO: utils.py:1274 - Wrapper: Completed Call, calling success_handler
2025-09-09 15:20:24,529 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
2025-09-09 15:20:24,566 - ragas.executor - ERROR - Exception raised in Job[0]: OutputParserException(Failed to parse NLIStatementOutput from completion {"statements": [{"statemexxxxxxxxxxxx. Got: 1 validation error for NLIStatementOutput
statements.6.verdict
  Field required [type=missing, input_value={'statement': 'This might...ly states: "This might'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE )
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.49s/it]
      ❌ FAIL: nan

✅ Evaluation complete: 2 results generated

📊 Generating Reports...
✅ Output handler initialized: eval_output

📊 Generating reports: evaluation_20250909_152025
Traceback (most recent call last):
  File "/home/lpiwowar/git_repositories/github.com/lightspeed-core/lightspeed-evaluation/.venv/bin/lightspeed-eval", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/lpiwowar/git_repositories/github.com/lightspeed-core/lightspeed-evaluation/src/lightspeed_evaluation/runner/evaluation.py", line 126, in main
    summary = run_evaluation(args.system_config, args.eval_data, args.output_dir)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lpiwowar/git_repositories/github.com/lightspeed-core/lightspeed-evaluation/src/lightspeed_evaluation/runner/evaluation.py", line 61, in run_evaluation
    output_handler.generate_reports(
  File "/home/lpiwowar/git_repositories/github.com/lightspeed-core/lightspeed-evaluation/src/lightspeed_evaluation/core/output/generator.py", line 44, in generate_reports
    calculate_detailed_stats(results)
  File "/home/lpiwowar/git_repositories/github.com/lightspeed-core/lightspeed-evaluation/src/lightspeed_evaluation/core/output/statistics.py", line 63, in calculate_detailed_stats
    _finalize_metric_stats(stats)
  File "/home/lpiwowar/git_repositories/github.com/lightspeed-core/lightspeed-evaluation/src/lightspeed_evaluation/core/output/statistics.py", line 122, in _finalize_metric_stats
    "std": statistics.stdev(scores) if len(scores) > 1 else 0.0,
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/statistics.py", line 922, in stdev
    return _float_sqrt_of_frac(mss.numerator, mss.denominator)
                               ^^^^^^^^^^^^^
AttributeError: 'float' object has no attribute 'numerator'

@lpiwowar lpiwowar marked this pull request as ready for review September 9, 2025 13:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/metrics/ragas.py (1)

105-112: Broaden network OSError handling; avoid a single errno special-case.

Consider catching other transient network errno/winerror codes (e.g., 104/110 on POSIX; 10054/10060 on Windows) and emitting a unified “network/LLM connectivity” message.

 except OSError as e:
-    err_msg = f"Ragas {metric_name} evaluation failed: {str(e)}"
-    if e.errno == 32:  # Broken pipe
-        err_msg = (
-            f"Ragas {metric_name} evaluation failed due to broken pipe "
-            f"(network/LLM timeout): {str(e)}"
-        )
-    return None, err_msg
+    err = getattr(e, "errno", None)
+    winerr = getattr(e, "winerror", None)
+    # EPIPE(32), ECONNRESET(104), ETIMEDOUT(110), EHOSTUNREACH(113), ENETUNREACH(101)
+    network_errnos = {32, 104, 110, 113, 101}
+    # Windows: WSAECONNRESET(10054), WSAETIMEDOUT(10060)
+    network_winerr = {10054, 10060}
+    if err in network_errnos or winerr in network_winerr:
+        return None, (
+            f"Ragas {metric_name} evaluation failed due to network/LLM "
+            f"connectivity issue: {str(e)}"
+        )
+    return None, f"Ragas {metric_name} evaluation failed: {str(e)}"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00d7182 and 4a138ce.

📒 Files selected for processing (1)
  • src/lightspeed_evaluation/core/metrics/ragas.py (3 hunks)
🔇 Additional comments (4)
src/lightspeed_evaluation/core/metrics/ragas.py (4)

3-3: Import for NaN checks — good.

This enables the downstream NaN guard. No issues.


95-98: Unifying return path via local result — good.

This makes room for centralized post-processing and error normalization.


124-125: Final return after checks — good.

Flow is clear and safe post-validation.


115-123: Ragas non-finite score guard in place – aggregates only append non-None scores, so no NaN/Inf can flow into statistics; no further changes required.

@lpiwowar lpiwowar changed the title Fix processing of float('NaN') values when OutputParserException [LCORE-648] Fix processing of float('NaN') values when OutputParserException Sep 9, 2025
Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !! LGTM

The RAGAS framework returns NaN when it encounters malformed output from
the LLM. The malformed output is accompanied by an OutputParserException
in the logs, but this exception is caught internally.

The NaN causes later failure during the generation of statistics like
standard deviation at the end of the evaluation and ultimately causes
no results to be obtained from the evaluation when the malformed output
is encountered by RAGAS.

This commit fixes this issue by checking whether NaN was returned from
RAGAS and, if so, ensures that the evaluate() function returns None, as
in other cases of failure. This ensures that NaN does not reach the
computation of the final statistics.

Resolves: lightspeed-core#44
@lpiwowar lpiwowar force-pushed the lpiwowar/fix-output-parser-exception branch from 4a138ce to 05d0880 Compare September 12, 2025 09:52
Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit 64d6a14 into lightspeed-core:main Sep 12, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ragas:response_relevancy sometimes fails parsing output, causing evaluation crash

3 participants