title | description |
---|---|
LLM Evaluation |
Evaluate text outputs in under 5 minutes |
import CloudSignup from '/snippets/cloud_signup.mdx'; import CreateProject from '/snippets/create_project.mdx';
Need help? Ask on [Discord](https://discord.com/invite/xZjKRaNp8b).This quickstart shows both local open-source and cloud workflows.
You will run a simple evaluation in Python and explore results in Evidently Cloud.### 1.2. Set up Evidently Cloud
<CloudSignup />
### 1.3. Installation and imports
Install the Evidently Python library:
```python
!pip install evidently[llm]
```
Components to run the evals:
```python
import pandas as pd
from evidently.future.datasets import Dataset
from evidently.future.datasets import DataDefinition
from evidently.future.datasets import Descriptor
from evidently.future.descriptors import *
from evidently.future.report import Report
from evidently.future.presets import TextEvals
from evidently.future.metrics import *
from evidently.future.tests import *
```
Components to connect with Evidently Cloud:
```python
from evidently.ui.workspace.cloud import CloudWorkspace
```
### 1.3. Create a Project
<CreateProject />
Install the Evidently Python library:
```python
!pip install evidently[llm]
```
Components to run the evals:
```python
import pandas as pd
from evidently.future.datasets import Dataset
from evidently.future.datasets import DataDefinition
from evidently.future.datasets import Descriptor
from evidently.future.descriptors import *
from evidently.future.report import Report
from evidently.future.presets import TextEvals
from evidently.future.metrics import *
from evidently.future.tests import *
```
Let's create a toy demo chatbot dataset with "Questions" and "Answers".
data = [
["What is the chemical symbol for gold?", "The chemical symbol for gold is Au."],
["What is the capital of Japan?", "The capital of Japan is Tokyo."],
["Tell me a joke.", "Why don't programmers like nature? It has too many bugs!"],
["What is the boiling point of water?", "The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit)."],
["Who painted the Mona Lisa?", "Leonardo da Vinci painted the Mona Lisa."],
["What’s the fastest animal on land?", "The cheetah is the fastest land animal, capable of running up to 75 miles per hour."],
["Can you help me with my math homework?", "I'm sorry, but I can't assist with homework."],
["How many states are there in the USA?", "There are 50 states in the USA."],
["What’s the primary function of the heart?", "The primary function of the heart is to pump blood throughout the body."],
["Can you tell me the latest stock market trends?", "I'm sorry, but I can't provide real-time stock market trends. You might want to check a financial news website or consult a financial advisor."]
]
columns = ["question", "answer"]
eval_df = pd.DataFrame(data, columns=columns)
#eval_df.head()
Collecting live data: you can also trace inputs and outputs from your LLM app, and download the dataset created from traces for evals. Check the Tracing Quickstart.
Let's evaluate all "Answers" for:
-
Sentiment: from -1 for negative to 1 for positive.
-
Text length: character count.
-
Denials: detect if the chatbot denied an answer. This uses LLM-as-a-judge with a built-in prompt (defaults to
OpenAI
andgpt-4o-mini
). It returns a label and an explanation.
Each such evaluation is a "descriptor". You can run built-in or custom evals.
No OpenAI key? There is an alternative below.Set the OpenAI key as an environment variable. See Open AI docs.
## import os
## os.environ["OPENAI_API_KEY"] = "YOUR KEY"
Create an Evidently dataset and run all the evals:
eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
Sentiment("answer", alias="Sentiment"),
TextLength("answer", alias="Length"),
DeclineLLMEval("answer", alias="Denials"), # or IncludesWords("answer", words_list=['sorry', 'apologize'], alias="Denials")
])
This adds new scores directly to your source data. You can preview them locally in pandas:
eval_dataset.as_dataframe()
Now, let's create a Report to summarize individual scores and test the results. We check that:
-
Sentiment is non-negative (greater or equal to 0)
-
Text length is at most 150 symbols.
-
Denials: there are none.
report = Report([
TextEvals(),
MinValue(column="Sentiment", tests=[gte(0)]),
MaxValue(column="Length", tests=[lte(150)]),
CategoryCount(column="Denials", category="DECLINE", tests=[eq(0)]) # CategoryCount(column="Denials", category=True, tests=[eq(0)])
])
my_eval = report.run(eval_dataset, None)
```python
ws.add_run(project.id, my_eval, include_data=True)
```
**View the Report**. Go to [Evidently Cloud](https://app.evidently.cloud/), open your Project, navigate to "Reports" in the left and open the Report. You will see the scores summary, and the dataset with new descriptor columns.
**See Test results**. Inside the "Tests" tab of the Report, you will get a pass/fail summary:
![](/images/examples/llm_quickstart_tests.png)
**Explore**. For example, sort to find all answers with "Denials".
![](/images/examples/llm_quickstart_explore.png)
**Get a Dashboard**. As you run repeated evals, you may want to track the results in time. Go to the "Dashboard" tab in the left menu and enter the "Edit" mode. Add a new tab, and select the "Descriptors" template.
![](/images/examples/llm_quickstart_create_tab.gif)
You'll see a set of panels that show descriptor values. Each has a single data point. As you log ongoing evaluation results, you can track trends and set up alerts.
```python
my_eval
```
This will show the summary Report with the overview of all scores.
![](/images/examples/llm_quickstart_report.png)
In the separate Tab, you'll see the pass/fail results for all Tests.
![](/images/examples/llm_quickstart_tests.png)
You can also view the results as a JSON or Python dictionary:
```python
# my_eval.json()
# my_eval.dict()
```
Or save and open an HTML file externally:
```python
# my_report.save_html(“file.html”)
```
See how you can configure LLM judges for custom criteria.