[Feature] Give evaluators access to all columns in the test set #1395

mmabrouk · 2024-02-19T12:19:20Z

Context:

In entity recognition tasks, the user needs to evaluate multiple outputs. For instance, say the task is to extract the author and a date from a pdf. The user would create an LLM application that returns a JSON with author and date field.
To evaluate this application they would like to create three evaluator, one that evalutes the precision of the whole prediction (the fraction of the fields that have been predicted correctly), another that evaluates the prediction for author, and a last one for date.

The challenge lie if the test set contains the correct answer in two different columns: one for author and one for date.

The user would want to create an evaluator specifying the field in the llm output and the column in the test set with the correct answer. Currently this is not possible, since the evaluator worker has only access to the correct_answer column.

This issue is not only relevant for entity recognition task, but also for other evaluators such as RAGAS which require in addition to the ground truth, a context column. For these cases, we need to provide the evaluator function with custom list of columns.

After finishing this issue, we will be able to:

Description:

Design a solution to enable the evaluators (in agenta-backend/agenta_backend/resources/evaluators/evaluators.py) to access any column in the test set.

aakrem · 2024-04-29T14:18:38Z

Suggested solution:

What we can do here is

1. Testsets

Have the user specify the correct answer columns in the testset from ui.
1st good thing no schema limitation

class TestSetDB(Document):
    name: str
    app: Link[AppDB]
    csvdata: List[Dict[str, str]]
    ground_truth_columns: [str] # new field
    # ...

2. Evaluators

User will specify the evaluator-config-settings for each ground truth column
and we would need to create an evaluator config for each ground truth column

also here good thing no schema limitation

class EvaluatorConfigDB(Document):
    app: Link[AppDB]
    user: Link[UserDB]
    name: str
    evaluator_key: str
    ground_truth_column: str # new field
    settings_values: Dict[str, Any] = Field(default=dict)

3. Evaluation

Adjust the code here to pass the correct answer from the evaluator config
https://github.com/Agenta-AI/agenta/blob/main/agenta-backend/agenta_backend/tasks/evaluations.py#L220

4. Evaluation Results

It will be a packed view.

Maybe we can improve 2 but I can't find better solution there.
Or maybe it's fine as it is.

mmabrouk · 2024-04-29T15:50:09Z

@aakrem , RAG evaluator don't have one correct answer column, but multiple columns. So at the end there is no one correct answer column that the user would label in the test set.
Same for 2, the RAG evaluator does not have one ground_truth column but multiple.

I think you are trying to keep the logic in the evaluators as is while changing the rest. I don't think that can work. I think the solution is to give the evaluators access to all the columns in the test set and not only one correct_answer value / column

This was referenced Feb 19, 2024

[Roadmap] Improve Evaluation Workflow and Add new Evaluators #1393

Closed

[Feature] Add new evaluators #1394

Closed

mmabrouk assigned aakrem and mmabrouk and unassigned aakrem and mmabrouk Apr 28, 2024

bekossy mentioned this issue May 13, 2024

[Sub issue]: Improve Evaluation comparison view to show all ground truth columns #1645

Merged

mmabrouk added the evaluation label May 28, 2024

mmabrouk closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Give evaluators access to all columns in the test set #1395

[Feature] Give evaluators access to all columns in the test set #1395

mmabrouk commented Feb 19, 2024 •

edited

Loading

aakrem commented Apr 29, 2024

mmabrouk commented Apr 29, 2024

[Feature] Give evaluators access to all columns in the test set #1395

[Feature] Give evaluators access to all columns in the test set #1395

Comments

mmabrouk commented Feb 19, 2024 • edited Loading

aakrem commented Apr 29, 2024

Suggested solution:

1. Testsets

2. Evaluators

3. Evaluation

4. Evaluation Results

mmabrouk commented Apr 29, 2024

mmabrouk commented Feb 19, 2024 •

edited

Loading