Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: JSON format errors when using buil-in metrics: Hallucination and Answer Relevance #506

Closed
1 of 4 tasks
SrBliss opened this issue Oct 30, 2024 · 2 comments
Closed
1 of 4 tasks
Labels
bug Something isn't working work in progress

Comments

@SrBliss
Copy link

SrBliss commented Oct 30, 2024

Willingness to contribute

No. I can't contribute a fix for this bug at this time.

What component(s) are affected?

  • Python SDK
  • Opik UI
  • Opik Server
  • Documentation

Opik version

  • Opik version: 1.0.2

Describe the problem

When calling evaluate using built-in metrics Hallucination or Answer Relevance I get a JSON format related error and evaluation fails.

Reproduction steps

Snippet:

# Define the metrics
hallucination_metric = Hallucination(name="Hallucination")
answerrelevance_metric = AnswerRelevance(name="AnswerRelevance")

SWEEP_ID = "03"

for i, prompt in enumerate(system_prompts):
    SYSTEM_PROMPT = prompt
    experiment_config = {"system_prompt": SYSTEM_PROMPT, "model": "gpt-3.5-turbo"}
    experiment_name = f"comet-chatbot-{SWEEP_ID}-{i}"

    res = evaluate(
        experiment_name=experiment_name,
        dataset=dataset,
        experiment_config=experiment_config,
        task=evaluation_task,
        scoring_metrics=[hallucination_metric,
                         answerrelevance_metric]
    )

Error:

Evaluation:   0%|          | 0/5 [00:00<?, ?it/s]OPIK: Failed to compute metric Hallucination. Score result will be marked as failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/opik/evaluation/metrics/llm_judges/hallucination/metric.py", line 121, in _parse_model_output
    dict_content = json.loads(content)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/opik/evaluation/tasks_scorer.py", line 29, in _score_test_case
    result = metric.score(**score_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/opik/evaluation/metrics/llm_judges/hallucination/metric.py", line 87, in score
    return self._parse_model_output(model_output)
  File "/usr/local/lib/python3.10/dist-packages/opik/evaluation/metrics/llm_judges/hallucination/metric.py", line 130, in _parse_model_output
    raise exceptions.MetricComputationError(
opik.evaluation.metrics.exceptions.MetricComputationError: Failed hallucination detection
@SrBliss SrBliss added the bug Something isn't working label Oct 30, 2024
@jverre
Copy link
Collaborator

jverre commented Oct 30, 2024

@SrBliss This is because the LLM does not consistently return a valid JSON, I've opened a PR to add support for structured outputs: #506

@jverre
Copy link
Collaborator

jverre commented Nov 3, 2024

We are now enforcing structured outputs in our evaluation metrics so you shouldn't face this issue anymore

@jverre jverre closed this as completed Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working work in progress
Projects
None yet
Development

No branches or pull requests

2 participants