Automated evaluation (end-to-end)

## Using LangChain+

We could use [LangChain+](https://docs.langchain.plus/), so we don't need to code everything from scratch.

## Custom solution

The idea is, that we need to have a way to track the agent behavior. It's important to try to measure how well he does and in which cases he fails.

We need to be able to run it after making changes to be able to measure the impact and don't introduce any regressions in performance.

The evaluation should be done by a state-of-the-art LLM. For the time being, this would be `gpt-4`.

We need to:
- [ ] Prepare a set of sample questions/inputs which could come from a user and functionality we want to provide.

With this, we'll then:
- [ ] Create a python script which will run over these queries and use the `main-agent` to work through those.

To evaluate the input:

- [ ] The input and output as well as a “perfect answer” for reference is passed to `gpt-4` which will evaluate how well it did.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated evaluation (end-to-end) #69

Using LangChain+

Custom solution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Automated evaluation (end-to-end) #69

Description

Using LangChain+

Custom solution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions