title	description
LLM Evaluation	Evaluate text outputs in under 5 minutes

import CloudSignup from '/snippets/cloud_signup.mdx'; import CreateProject from '/snippets/create_project.mdx';

Need help? Ask on [Discord](https://discord.com/invite/xZjKRaNp8b).

1. Set up your environment

This quickstart shows both local open-source and cloud workflows.

You will run a simple evaluation in Python and explore results in Evidently Cloud.

### 1.2. Set up Evidently Cloud

<CloudSignup />

### 1.3. Installation and imports

Install the Evidently Python library:

```python
!pip install evidently[llm]
```

Components to run the evals:

```python
import pandas as pd
from evidently.future.datasets import Dataset
from evidently.future.datasets import DataDefinition
from evidently.future.datasets import Descriptor
from evidently.future.descriptors import *
from evidently.future.report import Report
from evidently.future.presets import TextEvals
from evidently.future.metrics import *
from evidently.future.tests import *
```

Components to connect with Evidently Cloud:

```python
from evidently.ui.workspace.cloud import CloudWorkspace
```

### 1.3. Create a Project

<CreateProject />

You will run a simple evaluation locally and preview the results in your Python environment.

Install the Evidently Python library:

```python
!pip install evidently[llm]
```

Components to run the evals:

```python
import pandas as pd
from evidently.future.datasets import Dataset
from evidently.future.datasets import DataDefinition
from evidently.future.datasets import Descriptor
from evidently.future.descriptors import *
from evidently.future.report import Report
from evidently.future.presets import TextEvals
from evidently.future.metrics import *
from evidently.future.tests import *
```

2. Prepare a toy dataset

Let's create a toy demo chatbot dataset with "Questions" and "Answers".

data = [
["What is the chemical symbol for gold?", "The chemical symbol for gold is Au."],
["What is the capital of Japan?", "The capital of Japan is Tokyo."],
["Tell me a joke.", "Why don't programmers like nature? It has too many bugs!"],
["What is the boiling point of water?", "The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit)."],
["Who painted the Mona Lisa?", "Leonardo da Vinci painted the Mona Lisa."],
["What’s the fastest animal on land?", "The cheetah is the fastest land animal, capable of running up to 75 miles per hour."],
["Can you help me with my math homework?", "I'm sorry, but I can't assist with homework."],
["How many states are there in the USA?", "There are 50 states in the USA."],
["What’s the primary function of the heart?", "The primary function of the heart is to pump blood throughout the body."],
["Can you tell me the latest stock market trends?", "I'm sorry, but I can't provide real-time stock market trends. You might want to check a financial news website or consult a financial advisor."]
]

columns = ["question", "answer"]

eval_df = pd.DataFrame(data, columns=columns)
#eval_df.head()

Collecting live data: you can also trace inputs and outputs from your LLM app, and download the dataset created from traces for evals. Check the Tracing Quickstart.

3. Score individual outputs

Let's evaluate all "Answers" for:

Sentiment: from -1 for negative to 1 for positive.
Text length: character count.
Denials: detect if the chatbot denied an answer. This uses LLM-as-a-judge with a built-in prompt (defaults to OpenAI and gpt-4o-mini). It returns a label and an explanation.

Each such evaluation is a "descriptor". You can run built-in or custom evals.

No OpenAI key? There is an alternative below.

Set the OpenAI key as an environment variable. See Open AI docs.

## import os
## os.environ["OPENAI_API_KEY"] = "YOUR KEY"

Create an Evidently dataset and run all the evals:

eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
    Sentiment("answer", alias="Sentiment"),
    TextLength("answer", alias="Length"),
    DeclineLLMEval("answer", alias="Denials"), # or IncludesWords("answer", words_list=['sorry', 'apologize'], alias="Denials")
])

**Alternative**. Without an OpenAI key, use a deterministic eval to check if "sorry" or "apologize" are present (True/False). Use `IncludesWords` as shown in the code comment.

This adds new scores directly to your source data. You can preview them locally in pandas:

eval_dataset.as_dataframe()

4. Run tests

Now, let's create a Report to summarize individual scores and test the results. We check that:

Sentiment is non-negative (greater or equal to 0)
Text length is at most 150 symbols.
Denials: there are none.

report = Report([
    TextEvals(),
    MinValue(column="Sentiment", tests=[gte(0)]),
    MaxValue(column="Length", tests=[lte(150)]),
    CategoryCount(column="Denials", category="DECLINE", tests=[eq(0)]) # CategoryCount(column="Denials", category=True, tests=[eq(0)])
])

my_eval = report.run(eval_dataset, None)

**Tests are optional**. You can simply run the Report with `TextEvals()` to summarize the results.

5. Explore the results

**Upload the Report** and include raw data for detailed analysis:

```python
ws.add_run(project.id, my_eval, include_data=True)
```

**View the Report**. Go to [Evidently Cloud](https://app.evidently.cloud/), open your Project, navigate to "Reports" in the left and open the Report. You will see the scores summary, and the dataset with new descriptor columns.

**See Test results**. Inside the "Tests" tab of the Report, you will get a pass/fail summary:

![](/images/examples/llm_quickstart_tests.png)

**Explore**. For example, sort to find all answers with "Denials".

![](/images/examples/llm_quickstart_explore.png)

**Get a Dashboard**. As you run repeated evals, you may want to track the results in time. Go to the "Dashboard" tab in the left menu and enter the "Edit" mode. Add a new tab, and select the "Descriptors" template.

![](/images/examples/llm_quickstart_create_tab.gif)

You'll see a set of panels that show descriptor values. Each has a single data point. As you log ongoing evaluation results, you can track trends and set up alerts.

To view the Report in an interactive Python environment like Jupyter notebook or Colab, run:

```python
my_eval
```

This will show the summary Report with the overview of all scores.

![](/images/examples/llm_quickstart_report.png)

In the separate Tab, you'll see the pass/fail results for all Tests.

![](/images/examples/llm_quickstart_tests.png)

You can also view the results as a JSON or Python dictionary:

```python
# my_eval.json()
# my_eval.dict()
```

Or save and open an HTML file externally:

```python
# my_report.save_html(“file.html”)
```

What's next?

See how you can configure LLM judges for custom criteria.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quickstart_llm.mdx

quickstart_llm.mdx

1. Set up your environment

2. Prepare a toy dataset

3. Score individual outputs

4. Run tests

5. Explore the results

What's next?

Files

quickstart_llm.mdx

Latest commit

History

quickstart_llm.mdx

File metadata and controls

1. Set up your environment

2. Prepare a toy dataset

3. Score individual outputs

4. Run tests

5. Explore the results

What's next?