Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 2 additions & 7 deletions _search-plugins/search-relevance/compare-query-sets.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,15 @@
layout: default
title: Comparing query sets
nav_order: 12
parent: Using Search Relevance Workbench
parent: Search Relevance Workbench
grand_parent: Search relevance
has_children: false
has_toc: false
---

# Comparing query sets

This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
{: .warning}

To compare the results of two different search configurations, you can run a pairwise experiment. To achieve this, you need two search configurations and a query set to use for the search configuration.


For more information about creating a query set, see [Query Sets]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/query-sets/).

For more information about creating search configurations, see [Search Configurations]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/search-configurations/).
Expand Down Expand Up @@ -48,7 +43,7 @@ Field | Data type | Description
`querySetId` | String | The query set ID.
`searchConfigurationList` | List | A list of search configuration IDs to use for comparison.
`size` | Integer | The number of documents to return in the results.
`type` | String | Defines the type of experiment to run. Valid values are `PAIRWISE_COMPARISON`, `HYBRID_OPTIMIZER`, or `POINTWISE_EVALUATION`. Depending on the experiment type, you must provide different body fields in the request. `PAIRWISE_COMPARISON` is for comparing two search configurations against a query set and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/compare-query-sets/). `HYBRID_OPTIMIZER` is for combining results and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/optimize-hybrid-search/). `POINTWISE_EVALUATION` is for evaluating a search configuration against judgments and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/evaluate-search-quality/).
`type` | String | Defines the type of experiment to run. Valid values are `PAIRWISE_COMPARISON`, `HYBRID_OPTIMIZER`, or `POINTWISE_EVALUATION`. Depending on the experiment type, you must provide different body fields in the request. `PAIRWISE_COMPARISON` is for comparing two search configurations against a query set and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/compare-query-sets/). `HYBRID_OPTIMIZER` is for combining results and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/optimize-hybrid-search/). `POINTWISE_EVALUATION` is for evaluating a search configuration against judgments and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/evaluate-search-quality/).

The response contains the experiment ID of the created experiment:

Expand Down
5 changes: 2 additions & 3 deletions _search-plugins/search-relevance/compare-search-results.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
---
layout: default
title: Comparing single queries
nav_order: 11
parent: Using Search Relevance Workbench
nav_order: 10
parent: Search Relevance Workbench
grand_parent: Search relevance
has_children: false
has_toc: false
---

# Comparing single queries
Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,14 @@
---
layout: default
title: Comparing search results
nav_order: 10
parent: Using Search Relevance Workbench
nav_order: 11
parent: Search Relevance Workbench
grand_parent: Search relevance
has_children: true
has_toc: false
---

# Comparing search results

This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
{: .warning}

Comparing search results, also called a _pairwise experiment_, in OpenSearch Dashboards allows you to compare results of multiple search configurations. Using this tool helps assess how results change when applying different search configurations to queries.

For example, you can see how results change when you apply one of the following query changes:
Expand Down
8 changes: 3 additions & 5 deletions _search-plugins/search-relevance/evaluate-search-quality.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,13 @@
layout: default
title: Evaluating search quality
nav_order: 50
parent: Using Search Relevance Workbench
parent: Search Relevance Workbench
grand_parent: Search relevance
has_children: false
has_toc: false
---

# Evaluating search quality

This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
{: .warning}

Search Relevance Workbench can run pointwise experiments to evaluate search configuration quality using provided queries and relevance judgments.

For more information about creating a query set, see [Query sets]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/query-sets/).
Expand Down Expand Up @@ -210,3 +206,5 @@ The results include the original request parameters along with the following met
- `MAP@k`: The Mean Average Precision, which calculates the average precision across all documents. For more information, see [Average precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Average_precision).

- `NDCG@k`: The Normalized Discounted Cumulative Gain, which compares the actual ranking of results against a perfect ranking, with higher weights given to top results. This measures the quality of result ordering.

To review these results visually, see [Exploring search evaluation results]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/explore-experiment-results/).
75 changes: 75 additions & 0 deletions _search-plugins/search-relevance/explore-experiment-results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
layout: default
title: Exploring search evaluation results
nav_order: 65
parent: Search Relevance Workbench
grand_parent: Search relevance
has_children: false
---

# Exploring search evaluation results
Introduced 3.2
{: .label .label-purple }

In addition to retrieving the experiment results using the API, you can explore the results visually. The Search Relevance Workbench comes with dashboards that you can install to review search evaluation and hybrid search optimization experiment results.

## Installing the dashboards

You can install the dashboards in one of the following ways:

* In the **Actions** column, select a visualization icon in the experiment overview.

* Select the **Install Dashboards** button in the upper-right corner of the experiment overview.

<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/experiment_overview_dashboard_installation_options.png" alt="Experiment overview of the Search Relevance Workbench including dashboard installation options"/>{: .img-fluid }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please minimize the number of screenshots. Modal screenshots are not necessary since they are self-explanatory. All screenshots must have an intro sentence ending with "as shown in the following image".


The modal offers to install the dashboards for the user.

<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/install_dashboards_modal.png" alt="Modal to install dashboards"/>{: .img-fluid }

## Using the dashboards

Once you install the dashboards, in the **Actions** column, select the visualization icon in the experiment overview. This opens the experiment result dashboard. The view presented depends on the type of experiment you chose:

* The search evaluation dashboard focuses on the individual query level and provides insights about well-performing queries and queries with open relevance potential.

* The hybrid search dashboard provides an overview of how the different hybrid search parameter configurations performed and lets you identify candidate queries for further exploration and experimentation.

### Search evaluation dashboard

The search evaluation dashboard, shown in the following image, aggregates performance metrics across all queries in your selected experiment. Use the search evaluation dashboard to get a high-level view of overall experiment performance and identify the queries that need attention.

<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/search_evaulation_dashboard.png" alt="Search evaluation dashboard with visualizations"/>{: .img-fluid }

The **Deep Dive Summary** panel shows the aggregate metrics for NDCG, MAP, precision, and coverage (see [Evaluating search quality]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/search-configurations/)).

The **Deep Dive Query Scores** pane shows individual query performance ranked by NDCG score (highest to lowest). Use this pane to identify your best- and worst-performing queries.

The **Deep Dive Score Densities** pane shows how metric values are distributed across your query set. Use this pane to understand whether poor performance is widespread or concentrated in specific queries. The x-axis shows metric values, while the y-axis shows how frequently those values occur.

The **Deep Dive Score Scatter Plot** pane shows an interactive view of the preceding distribution data, with each query shown as a separate point. Use this pane to investigate specific queries at performance extremes. Points are scattered vertically to prevent overlap while maintaining the same x-axis metric values as the preceding distribution view.

### Hybrid search evaluation dashboard

Use the hybrid search evaluation dashboard, shown in the following image, to compare experiment variants and identify the optimal parameter configurations for your hybrid experiment.

<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/hybrid_search_optimizer_dashboard.png" alt="Hybrid search optimization evaluation dashboard with visualizations"/>{: .img-fluid }

The **Variant Performance Chart** shows your experiment variants arranged visually from best to worst performing (left to right, by decreasing NDCG). Use this chart to quickly identify your top-performing queries and view performance patterns across different parameter combinations at a glance.

The **Variant Performance** pane shows the same variant data in a sortable table format with all metrics visible. Use this pane to compare specific metric values across variants and customize your analysis by sorting on different performance measures. To sort by a column, select the column header.


### Customizing the dashboards

The dashboards are installed as saved objects. After installing them, you can edit the dashboards or clone and customize them to your specific requirements.

To learn how to customize the source files, see [Updating the default dashboards](https://github.com/opensearch-project/dashboards-search-relevance/blob/main/DEVELOPER_GUIDE.md#updating-default-dashboards).

### Resetting dashboards

To reset the dashboards, select the **Install Dashboards** button in the upper-right corner of the experiment overview. This will reinstall the dashboards.




18 changes: 8 additions & 10 deletions _search-plugins/search-relevance/judgments.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,13 @@
layout: default
title: Judgments
nav_order: 8
parent: Using Search Relevance Workbench
parent: Search Relevance Workbench
grand_parent: Search relevance
has_children: false
has_toc: false
---

# Judgments

This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
{: .warning}

A judgment is a relevance rating assigned to a specific document in the context of a particular query. Multiple judgments are grouped together into judgment lists.
Typically, judgments are categorized into two types---implicit and explicit:

Expand Down Expand Up @@ -120,10 +116,10 @@ To use AI-assisted judgment generation, ensure that you have configured the foll
* A query set: Together with the `size` parameter, the query set defines the scope for generating judgments. For each query, the top k documents are retrieved from the specified index, where k is defined in the `size` parameter.
* A search configuration: A search configuration defines how documents are retrieved for use in query/document pairs.

The AI-assisted judgment process works as follows:
- For each query, the top k documents are retrieved using the defined search configuration, which includes the index information. The query and each document from the result list create a query/document pair.
- Each query and document pair forms a query/document pair.
- The LLM is then called with a predefined prompt (stored as a static variable in the backend) to generate a judgment for each query/document pair.
The AI-assisted judgment process works as follows:
- For each query, the top k documents are retrieved using the defined search configuration, which includes the index information. The query and each document from the result list create a query/document pair.
- Each query and document pair forms a query/document pair.
- The LLM is then called with a predefined prompt (stored as a static variable in the backend) to generate a judgment for each query/document pair.
- All generated judgments are stored in the judgments index for reuse in future experiments.

To create a judgment list, provide the model ID of the LLM, an available query set, and a created search configuration:
Expand All @@ -132,7 +128,7 @@ To create a judgment list, provide the model ID of the LLM, an available query s
```json
PUT _plugins/_search_relevance/judgments
{
"name":"COEC",
"name":"AI-assisted judgment list",
"type":"LLM_JUDGMENT",
"querySetId":"5f0115ad-94b9-403a-912f-3e762870ccf6",
"searchConfigurationList":["2f90d4fd-bd5e-450f-95bb-eabe4a740bd1"],
Expand Down Expand Up @@ -177,6 +173,8 @@ Parameter | Data type | Description
`clickModel` | String | The model used to calculate implicit judgments. Only `coec` (Clicks Over Expected Clicks) is supported.
`type` | String | Set to `UBI_JUDGMENT`.
`maxRank` | Integer | The maximum rank to consider when including events in the judgment calculation.
`startDate` | Date | The optional starting date from which behavioral data events are considered for implicit judgment generation. The format is`yyyy-MM-dd`.
`endDate` | Date | The optional end date until which behavioral data events are considered for implicit judgment generation. The format is`yyyy-MM-dd`.

## Managing judgment lists

Expand Down
8 changes: 3 additions & 5 deletions _search-plugins/search-relevance/optimize-hybrid-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,13 @@
layout: default
title: Optimizing hybrid search
nav_order: 60
parent: Using Search Relevance Workbench
parent: Search Relevance Workbench
grand_parent: Search relevance
has_children: false
has_toc: false
---

# Optimizing hybrid search

This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
{: .warning}

A key challenge of using hybrid search in OpenSearch is combining results from lexical and vector-based search effectively. OpenSearch provides different techniques and various parameters you can experiment with to find the best setup for your application. What works best, however, depends heavily on your data, user behavior, and application domain—there is no one-size-fits-all solution.

Search Relevance Workbench helps you systematically find the ideal set of parameters for your needs.
Expand Down Expand Up @@ -114,3 +110,5 @@ POST _plugins/_sql
}
```
{% include copy-curl.html %}

To review these results visually, see [Exploring search evaluation results]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/explore-experiment-results/).
60 changes: 52 additions & 8 deletions _search-plugins/search-relevance/query-sets.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,13 @@
layout: default
title: Query sets
nav_order: 3
parent: Using Search Relevance Workbench
parent: Search Relevance Workbench
grand_parent: Search relevance
has_children: false
has_toc: false
---

# Query sets

This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
{: .warning}

A query set is a collection of queries. These queries are used in experiments for search relevance evaluation. Search Relevance Workbench offers different sampling techniques for creating query sets from real user data that adheres to the [User Behavior Insights (UBI)]({{site.url}}{{site.baseurl}}/search-plugins/ubi/schemas/) specification.
Additionally, Search Relevance Workbench allows you to import a query set.

Expand All @@ -37,10 +33,10 @@ The following table lists the available input parameters.

Field | Data type | Description
:--- | :--- | :---
`name` | String | The name of the query set.
`description` | String | A short description of the query set.
`name` | String | The name of the query set. The maximum length is 50 characters.
`description` | String | A short description of the query set. The maximum length is 250 characters.
`sampling` | String | Defines which sampler to use. Valid values are `pptss` (Probability-Proportional-to-Size-Sampling), `random`, `topn` (most frequent queries), and `manual`.
`querySetSize` | Integer | The target number of queries in the query set. Depending on the number of unique queries in `ubi_queries`, the resulting query set may contain fewer queries.
`querySetSize` | Integer | The target number of queries in the query set. Depending on the number of unique queries in `ubi_queries`, the resulting query set may contain fewer queries. Must be a positive integer.

### Example request: Sampling 20 queries with the Top N sampler

Expand Down Expand Up @@ -73,6 +69,54 @@ PUT _plugins/_search_relevance/query_sets
}
```

## Query set formats

Search Relevance Workbench supports two formats for query sets, each designed for different use cases. Both formats are a collection of user queries, but they differ in whether they include an expected answer.

* **Basic query set**: A list of user queries without any additional information. This is useful for general relevance testing where no specific answer is expected.

* **Query set with reference answers**: A list of user queries, in which each query is paired with its expected answer. This format is particularly useful for evaluating applications designed to provide a specific answer, such as question-answering systems.

### Fields

All query sets comprise one or more entries. Each entry is a JSON object containing the following fields.

| Field | Data type | Description |
| :--- | :--- | :--- |
| `queryText` | String | The user query string. Required. |
| `referenceAnswer` | String | The expected or correct answer to the user query. This field is used for generating judgments, especially with large language models (LLMs). Optional. |

### Basic query set example

A basic query set contains only the `queryText` field for each entry. It is suitable for general relevance tests where no single "correct" answer exists.

#### Example query set without reference answers

```json
{"queryText": "t towels kitchen"}
{"queryText": "table top bandsaw for metal"}
{"queryText": "tan strappy heels for women"}
{"queryText": "tank top plus size women"}
{"queryText": "tape and mudding tools"}
```

### Query set with reference answers example

This format includes the `referenceAnswer` field alongside the `queryText`. It is ideal for evaluating applications designed to provide specific answers, such as chatbots or question-answering systems.

#### Example query set with reference answers

```json
{"queryText": "What is the capital of France?", "referenceAnswer": "Paris"}
{"queryText": "Who wrote 'Romeo and Juliet'?", "referenceAnswer": "William Shakespeare"}
{"queryText": "What is the chemical symbol for water?", "referenceAnswer": "H2O"}
{"queryText": "What is the highest mountain in the world?", "referenceAnswer": "Mount Everest"}
{"queryText": "When was the first iPhone released?", "referenceAnswer": "June 29, 2007"}
```


The `referenceAnswer` field is particularly useful when using [LLMs to generate judgments]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/judgments/). The LLM can use the reference answer as a ground truth to compare against the retrieved search results, allowing it to accurately score the relevance of the response.

## Managing query sets

You can retrieve or delete query sets using the following APIs.
Expand Down
Loading
Loading