Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion lib/redirects.js
Original file line number Diff line number Diff line change
Expand Up @@ -807,7 +807,7 @@ const mainDocsReorder202507 = [
"/docs/evaluation/features/prompt-experiments",
"/docs/evaluation/dataset-runs/native-run",
],
["/docs/evaluation/data-model", "/docs/evaluation/dataset-runs/data-model"],
["/docs/evaluation/experiments/data-model", "/docs/evaluation/dataset-runs/data-model"],
[
"/guides/cookbook/integration_mirascope",
"/integrations/frameworks/mirascope",
Expand Down Expand Up @@ -1090,6 +1090,11 @@ const sdkReferenceRedirects202512 = [
["/docs/sdk/typescript/guide#score", "/docs/evaluation/evaluation-methods/scores-via-sdk#ingesting-scores-via-apisdks"],
];

// Data model moved under experiments (Jan 2026)
const dataModelExperimentsRedirects202601 = [
["/docs/evaluation/data-model", "/docs/evaluation/experiments/data-model"],
];

const permanentRedirects = [
...selfHostFolders202508,
...newSelfHostingSection,
Expand All @@ -1110,6 +1115,7 @@ const permanentRedirects = [
...instanceManagementApiRedirects202512,
...evaluationMethodsRestructure202512,
...sdkReferenceRedirects202512,
...dataModelExperimentsRedirects202601,
];

module.exports = { nonPermanentRedirects, permanentRedirects };
2 changes: 1 addition & 1 deletion pages/blog/2025-09-05-automated-evaluations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ For this guide, we will set up an evaluator for the "Out of Scope" failure mode.

## How to Measure [#how-to-measure]

In Langfuse, all evaluations are tracked as **Scores**, which [can be attached to traces, observations, sessions or dataset runs](/docs/evaluation/evaluation-methods/data-model). Evaluations in Langfuse can be set up in two main ways:
In Langfuse, all evaluations are tracked as **Scores**, which [can be attached to traces, observations, sessions or dataset runs](/docs/evaluation/experiments/data-model#scores). Evaluations in Langfuse can be set up in two main ways:

**In the Langfuse UI:** In Langfuse, you can set up **LLM-as-a-Judge Evaluators** that use another LLM to evaluate your application's output on subjective and nuanced criteria. These are easily configured directly in Langfuse. For a guide on setting them up in the UI, check the documentation on **[LLM-as-a-Judge evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge)**.

Expand Down
2 changes: 1 addition & 1 deletion pages/docs/administration/billable-units.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: Learn how billable units are calculated in Langfuse.

# Billable Units

Langfuse Cloud [pricing](/pricing) is based on the number of ingested units per billing period. Units are either [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations) or [scores](/docs/evaluation/evaluation-methods/data-model#scores).
Langfuse Cloud [pricing](/pricing) is based on the number of ingested units per billing period. Units are either [traces](/docs/observability/data-model#traces), [observations](/docs/observability/data-model#observations) or [scores](/docs/evaluation/experiments/data-model#scores).

`Units` = `Count of Traces` + `Count of Observations` + `Count of Scores`

Expand Down
2 changes: 1 addition & 1 deletion pages/docs/evaluation/_meta.tsx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
export default {
overview: "Overview",
concepts: "Concepts",
"core-concepts": "Concepts",
"evaluation-methods": "Evaluation Methods",

experiments: "Experiments",
Expand Down
134 changes: 0 additions & 134 deletions pages/docs/evaluation/concepts.mdx

This file was deleted.

109 changes: 109 additions & 0 deletions pages/docs/evaluation/core-concepts.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
title: Concepts
description: Learn the fundamental concepts behind LLM evaluation in Langfuse - Scores, Evaluation Methods, Datasets, and Experiments.
---
# Core Concepts

This page digs into the different concepts of evaluations, and what's available in Langfuse.

Ready to start?

- [Create a dataset](/docs/evaluation/experiments/datasets) to measure your LLM application's performance consistently
- [Run an experiment](/docs/evaluation/experiments/overview) get an overview of how your application is doing
- [Set up a live evaluator](/docs/evaluation/evaluation-methods/llm-as-a-judge) to monitor your live traces


## The Evaluation Loop

LLM applications often have a constant loop of testing and monitoring.

**Experiments** let you test your application against a fixed dataset before you deploy. You run your new prompt or model against test cases, review the scores, iterate until the results look good, then deploy your changes.

**Online evaluation** scores live traces to catch issues in real traffic. When you find edge cases your dataset didn't cover, you add them back to your dataset so future experiments will catch them.

<Frame fullWidth>
![Online and offline evaluation loop](/images/docs/evaluation/online-offline-eval-loop.png)
</Frame>


> **Here's an example workflow** for building a customer support chatbot
> 1. You update your prompt to make responses less formal.
> 2. Before deploying, you run an **experiment**: test the new prompt against your dataset of customer questions.
> 3. You review the scores and outputs. The tone improved, but responses are longer and some miss important links.
> 4. You refine the prompt and run the experiment again.
> 5. The results look good now. You deploy the new prompt to production.
> 6. You monitor with **online evaluation** to catch any new edge cases.
> 7. You notice that a customer asked a question in French, but the bot responded in English.
> 8. You add this French query to your dataset so future experiments will catch this issue.
> 9. You update your prompt to support French responses and run another experiment.
>
> Over time, your dataset grows from a couple of examples to a diverse, representative set of real-world test cases.

## Experiments [#experiments]

An experiment runs your application against a dataset and evaluates the outputs. This is how you test changes before deploying to production.

### Definitions

Before diving into experiments, it's helpful to understand the building blocks in Langfuse: datasets, dataset items, scores, tasks, evaluators, and experiments.

| Object | Definition |
| --- | --- |
| **Dataset** | A collection of test cases (dataset items). You can run experiments on a dataset. |
| **Dataset item** | One item in a dataset.Each dataset item contains an input (the scenario to test) and optionally an expected output. |
| **Task** | The application code that you want to test in an experiment. This will be performed on each dataset item, and you will score the output.
| **Evaluator** | A function that scores experiment results. In the context of a Langfuse experiment, this can be a [deterministic check](/docs/evaluation/evaluation-methods/custom-scores), or [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge). |
| **Score** | The result of an evaluator. This can be numeric, categorical, or boolean. See [Scores](/docs/evaluation/experiments/data-model#scores) for more details.|
| **Experiment Run** | A single execution of your task against all items in a dataset, producing outputs (and scores). |

You can find the data model for these objects [here](/docs/evaluation/experiments/data-model).


### How these work together

This is what happens conceptually:

When you run an experiment on a given **dataset**, each of the **dataset items** will be passed to the **task function** you defined. The task function is generally an LLM call that happens in your application, that you want to test. The task function produces an output for each dataset item. This process is called an **experiment run**. The resulting collection of outputs linked to the dataset items are the **experiment results**.

Often, you want to score these experiment results using different **evaluators**. These evaluators take in the dataset item and the output produced by the task function, and produce a score based on a specific criteria you define. Based on these scores, you can then get a complete picture of how your application performs across all test cases.

<Frame fullWidth>
![Experiments flow](/images/docs/evaluation/experiments-flow.png)
</Frame>

You can compare experiment runs to see if a new prompt version improves scores, or identify specific inputs where your application struggles. Based on these experiment results, you can decide whether the change is ready to be deployed to production.

You can find more details on how these objects link together under the hood on the [data model page](/docs/evaluation/experiments/data-model).


### Two ways to run experiments

You can **run experiments programmatically using the Langfuse SDK**. This gives you full control over what the task, evaluator, etc. look like. [Learn more about running experiments via SDK](/docs/evaluation/run-via-sdk).

Another way is to **run experiments directly from the Langfuse interface** by selecting a dataset and prompt version. This is useful for quick iterations on prompts without writing code. [Learn more about running experiments via UI](/docs/evaluation/run-via-ui).


## Online Evaluation [#online-evaluation]

For online evaluation, evaluators are triggered to produce a score on production traces. This helps you catch issues immediately.

Langfuse currently supports LLM-as-a-Judge and human annotation checks for online evaluation. [Deterministic checks are on the roadmap](https://github.com/orgs/langfuse/discussions/6087).


### Monitoring with dashboards

Langfuse offers dashboards to monitor your application performance in real-time. You can also monitor scores in dashboards. You can find more details on how to use dashboards [here](/docs/metrics/features/custom-dashboards).


## Evaluation Methods

For both experiments and online evaluation, you can use a variety of evaluation methods to add [scores](/docs/evaluation/experiments/data-model#scores) to traces, observations, sessions, or dataset runs.


| Method | What | Use when |
| --- | --- | --- |
| [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge) | Use an LLM to evaluate outputs based on custom criteria | Subjective assessments at scale (tone, accuracy, helpfulness) |
| [Scores via UI](/docs/evaluation/evaluation-methods/scores-via-ui) | Manually add scores to traces directly in the Langfuse UI | Quick quality spot checks, reviewing individual traces |
| [Annotation Queues](/docs/evaluation/evaluation-methods/annotation-queues) | Structured human review workflows with customizable queues | Building ground truth, systematic labeling, team collaboration |
| [Scores via API/SDK](/docs/evaluation/evaluation-methods/scores-via-sdk) | Programmatically add scores using the Langfuse API or SDK | Custom evaluation pipelines, deterministic checks, automated workflows |

Loading
Loading