Skip to content

Commit 0ad53ab

Browse files
wrigleyDanepughkolchfa-awsnatebower
authored
update Search Relevance Workbench docs for 3.2 (#10514)
* add query set size validation information Signed-off-by: wrigleyDan <dwrigley@opensourceconnections.com> * add explanation of different types of query sets (with and without reference texts) Signed-off-by: wrigleyDan <dwrigley@opensourceconnections.com> * add experiment dashbords to docs Signed-off-by: wrigleyDan <dwrigley@opensourceconnections.com> * add and parameters to implicit judgment generation Signed-off-by: wrigleyDan <dwrigley@opensourceconnections.com> * change misleading name in LLM_JUDGMENT example Signed-off-by: wrigleyDan <dwrigley@opensourceconnections.com> * update navigation structure in left nav Signed-off-by: wrigleyDan <dwrigley@opensourceconnections.com> * Update _search-plugins/search-relevance/explore-experiment-results.md Co-authored-by: Eric Pugh <epugh@opensourceconnections.com> Signed-off-by: Daniel Wrigley <54574577+wrigleyDan@users.noreply.github.com> * remove experimental tag Signed-off-by: wrigleyDan <dwrigley@opensourceconnections.com> * add section on reinstalling dashboards and rework section on updating dashboards Signed-off-by: wrigleyDan <dwrigley@opensourceconnections.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Nathan Bower <nbower@amazon.com> --------- Signed-off-by: wrigleyDan <dwrigley@opensourceconnections.com> Signed-off-by: Daniel Wrigley <54574577+wrigleyDan@users.noreply.github.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: Nathan Bower <nbower@amazon.com> Co-authored-by: Eric Pugh <epugh@opensourceconnections.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
1 parent f1c94c0 commit 0ad53ab

15 files changed

+148
-53
lines changed

_search-plugins/search-relevance/compare-query-sets.md

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,15 @@
22
layout: default
33
title: Comparing query sets
44
nav_order: 12
5-
parent: Using Search Relevance Workbench
5+
parent: Search Relevance Workbench
66
grand_parent: Search relevance
77
has_children: false
8-
has_toc: false
98
---
109

1110
# Comparing query sets
1211

13-
This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
14-
{: .warning}
15-
1612
To compare the results of two different search configurations, you can run a pairwise experiment. To achieve this, you need two search configurations and a query set to use for the search configuration.
1713

18-
1914
For more information about creating a query set, see [Query Sets]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/query-sets/).
2015

2116
For more information about creating search configurations, see [Search Configurations]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/search-configurations/).
@@ -48,7 +43,7 @@ Field | Data type | Description
4843
`querySetId` | String | The query set ID.
4944
`searchConfigurationList` | List | A list of search configuration IDs to use for comparison.
5045
`size` | Integer | The number of documents to return in the results.
51-
`type` | String | Defines the type of experiment to run. Valid values are `PAIRWISE_COMPARISON`, `HYBRID_OPTIMIZER`, or `POINTWISE_EVALUATION`. Depending on the experiment type, you must provide different body fields in the request. `PAIRWISE_COMPARISON` is for comparing two search configurations against a query set and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/compare-query-sets/). `HYBRID_OPTIMIZER` is for combining results and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/optimize-hybrid-search/). `POINTWISE_EVALUATION` is for evaluating a search configuration against judgments and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/evaluate-search-quality/).
46+
`type` | String | Defines the type of experiment to run. Valid values are `PAIRWISE_COMPARISON`, `HYBRID_OPTIMIZER`, or `POINTWISE_EVALUATION`. Depending on the experiment type, you must provide different body fields in the request. `PAIRWISE_COMPARISON` is for comparing two search configurations against a query set and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/compare-query-sets/). `HYBRID_OPTIMIZER` is for combining results and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/optimize-hybrid-search/). `POINTWISE_EVALUATION` is for evaluating a search configuration against judgments and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/evaluate-search-quality/).
5247

5348
The response contains the experiment ID of the created experiment:
5449

_search-plugins/search-relevance/compare-search-results.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,10 @@
11
---
22
layout: default
33
title: Comparing single queries
4-
nav_order: 11
5-
parent: Using Search Relevance Workbench
4+
nav_order: 10
5+
parent: Search Relevance Workbench
66
grand_parent: Search relevance
77
has_children: false
8-
has_toc: false
98
---
109

1110
# Comparing single queries

_search-plugins/search-relevance/comparing-search-results.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,14 @@
11
---
22
layout: default
33
title: Comparing search results
4-
nav_order: 10
5-
parent: Using Search Relevance Workbench
4+
nav_order: 11
5+
parent: Search Relevance Workbench
66
grand_parent: Search relevance
77
has_children: true
8-
has_toc: false
98
---
109

1110
# Comparing search results
1211

13-
This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
14-
{: .warning}
15-
1612
Comparing search results, also called a _pairwise experiment_, in OpenSearch Dashboards allows you to compare results of multiple search configurations. Using this tool helps assess how results change when applying different search configurations to queries.
1713

1814
For example, you can see how results change when you apply one of the following query changes:

_search-plugins/search-relevance/evaluate-search-quality.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,13 @@
22
layout: default
33
title: Evaluating search quality
44
nav_order: 50
5-
parent: Using Search Relevance Workbench
5+
parent: Search Relevance Workbench
66
grand_parent: Search relevance
77
has_children: false
8-
has_toc: false
98
---
109

1110
# Evaluating search quality
1211

13-
This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
14-
{: .warning}
15-
1612
Search Relevance Workbench can run pointwise experiments to evaluate search configuration quality using provided queries and relevance judgments.
1713

1814
For more information about creating a query set, see [Query sets]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/query-sets/).
@@ -210,3 +206,5 @@ The results include the original request parameters along with the following met
210206
- `MAP@k`: The Mean Average Precision, which calculates the average precision across all documents. For more information, see [Average precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Average_precision).
211207

212208
- `NDCG@k`: The Normalized Discounted Cumulative Gain, which compares the actual ranking of results against a perfect ranking, with higher weights given to top results. This measures the quality of result ordering.
209+
210+
To review these results visually, see [Exploring search evaluation results]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/explore-experiment-results/).
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
layout: default
3+
title: Exploring search evaluation results
4+
nav_order: 65
5+
parent: Search Relevance Workbench
6+
grand_parent: Search relevance
7+
has_children: false
8+
---
9+
10+
# Exploring search evaluation results
11+
Introduced 3.2
12+
{: .label .label-purple }
13+
14+
In addition to retrieving the experiment results using the API, you can explore the results visually. The Search Relevance Workbench comes with dashboards that you can install to review search evaluation and hybrid search optimization experiment results.
15+
16+
## Installing the dashboards
17+
18+
You can install the dashboards in one of the following ways:
19+
20+
* In the **Actions** column, select a visualization icon in the experiment overview.
21+
22+
* Select the **Install Dashboards** button in the upper-right corner of the experiment overview.
23+
24+
<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/experiment_overview_dashboard_installation_options.png" alt="Experiment overview of the Search Relevance Workbench including dashboard installation options"/>{: .img-fluid }
25+
26+
The modal offers to install the dashboards for the user.
27+
28+
<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/install_dashboards_modal.png" alt="Modal to install dashboards"/>{: .img-fluid }
29+
30+
## Using the dashboards
31+
32+
Once you install the dashboards, in the **Actions** column, select the visualization icon in the experiment overview. This opens the experiment result dashboard. The view presented depends on the type of experiment you chose:
33+
34+
* The search evaluation dashboard focuses on the individual query level and provides insights about well-performing queries and queries with open relevance potential.
35+
36+
* The hybrid search dashboard provides an overview of how the different hybrid search parameter configurations performed and lets you identify candidate queries for further exploration and experimentation.
37+
38+
### Search evaluation dashboard
39+
40+
The search evaluation dashboard, shown in the following image, aggregates performance metrics across all queries in your selected experiment. Use the search evaluation dashboard to get a high-level view of overall experiment performance and identify the queries that need attention.
41+
42+
<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/search_evaulation_dashboard.png" alt="Search evaluation dashboard with visualizations"/>{: .img-fluid }
43+
44+
The **Deep Dive Summary** panel shows the aggregate metrics for NDCG, MAP, precision, and coverage (see [Evaluating search quality]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/search-configurations/)).
45+
46+
The **Deep Dive Query Scores** pane shows individual query performance ranked by NDCG score (highest to lowest). Use this pane to identify your best- and worst-performing queries.
47+
48+
The **Deep Dive Score Densities** pane shows how metric values are distributed across your query set. Use this pane to understand whether poor performance is widespread or concentrated in specific queries. The x-axis shows metric values, while the y-axis shows how frequently those values occur.
49+
50+
The **Deep Dive Score Scatter Plot** pane shows an interactive view of the preceding distribution data, with each query shown as a separate point. Use this pane to investigate specific queries at performance extremes. Points are scattered vertically to prevent overlap while maintaining the same x-axis metric values as the preceding distribution view.
51+
52+
### Hybrid search evaluation dashboard
53+
54+
Use the hybrid search evaluation dashboard, shown in the following image, to compare experiment variants and identify the optimal parameter configurations for your hybrid experiment.
55+
56+
<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/hybrid_search_optimizer_dashboard.png" alt="Hybrid search optimization evaluation dashboard with visualizations"/>{: .img-fluid }
57+
58+
The **Variant Performance Chart** shows your experiment variants arranged visually from best to worst performing (left to right, by decreasing NDCG). Use this chart to quickly identify your top-performing queries and view performance patterns across different parameter combinations at a glance.
59+
60+
The **Variant Performance** pane shows the same variant data in a sortable table format with all metrics visible. Use this pane to compare specific metric values across variants and customize your analysis by sorting on different performance measures. To sort by a column, select the column header.
61+
62+
63+
### Customizing the dashboards
64+
65+
The dashboards are installed as saved objects. After installing them, you can edit the dashboards or clone and customize them to your specific requirements.
66+
67+
To learn how to customize the source files, see [Updating the default dashboards](https://github.com/opensearch-project/dashboards-search-relevance/blob/main/DEVELOPER_GUIDE.md#updating-default-dashboards).
68+
69+
### Resetting dashboards
70+
71+
To reset the dashboards, select the **Install Dashboards** button in the upper-right corner of the experiment overview. This will reinstall the dashboards.
72+
73+
74+
75+

_search-plugins/search-relevance/judgments.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,13 @@
22
layout: default
33
title: Judgments
44
nav_order: 8
5-
parent: Using Search Relevance Workbench
5+
parent: Search Relevance Workbench
66
grand_parent: Search relevance
77
has_children: false
8-
has_toc: false
98
---
109

1110
# Judgments
1211

13-
This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
14-
{: .warning}
15-
1612
A judgment is a relevance rating assigned to a specific document in the context of a particular query. Multiple judgments are grouped together into judgment lists.
1713
Typically, judgments are categorized into two types---implicit and explicit:
1814

@@ -120,10 +116,10 @@ To use AI-assisted judgment generation, ensure that you have configured the foll
120116
* A query set: Together with the `size` parameter, the query set defines the scope for generating judgments. For each query, the top k documents are retrieved from the specified index, where k is defined in the `size` parameter.
121117
* A search configuration: A search configuration defines how documents are retrieved for use in query/document pairs.
122118

123-
The AI-assisted judgment process works as follows:
124-
- For each query, the top k documents are retrieved using the defined search configuration, which includes the index information. The query and each document from the result list create a query/document pair.
125-
- Each query and document pair forms a query/document pair.
126-
- The LLM is then called with a predefined prompt (stored as a static variable in the backend) to generate a judgment for each query/document pair.
119+
The AI-assisted judgment process works as follows:
120+
- For each query, the top k documents are retrieved using the defined search configuration, which includes the index information. The query and each document from the result list create a query/document pair.
121+
- Each query and document pair forms a query/document pair.
122+
- The LLM is then called with a predefined prompt (stored as a static variable in the backend) to generate a judgment for each query/document pair.
127123
- All generated judgments are stored in the judgments index for reuse in future experiments.
128124

129125
To create a judgment list, provide the model ID of the LLM, an available query set, and a created search configuration:
@@ -132,7 +128,7 @@ To create a judgment list, provide the model ID of the LLM, an available query s
132128
```json
133129
PUT _plugins/_search_relevance/judgments
134130
{
135-
"name":"COEC",
131+
"name":"AI-assisted judgment list",
136132
"type":"LLM_JUDGMENT",
137133
"querySetId":"5f0115ad-94b9-403a-912f-3e762870ccf6",
138134
"searchConfigurationList":["2f90d4fd-bd5e-450f-95bb-eabe4a740bd1"],
@@ -177,6 +173,8 @@ Parameter | Data type | Description
177173
`clickModel` | String | The model used to calculate implicit judgments. Only `coec` (Clicks Over Expected Clicks) is supported.
178174
`type` | String | Set to `UBI_JUDGMENT`.
179175
`maxRank` | Integer | The maximum rank to consider when including events in the judgment calculation.
176+
`startDate` | Date | The optional starting date from which behavioral data events are considered for implicit judgment generation. The format is`yyyy-MM-dd`.
177+
`endDate` | Date | The optional end date until which behavioral data events are considered for implicit judgment generation. The format is`yyyy-MM-dd`.
180178

181179
## Managing judgment lists
182180

_search-plugins/search-relevance/optimize-hybrid-search.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,13 @@
22
layout: default
33
title: Optimizing hybrid search
44
nav_order: 60
5-
parent: Using Search Relevance Workbench
5+
parent: Search Relevance Workbench
66
grand_parent: Search relevance
77
has_children: false
8-
has_toc: false
98
---
109

1110
# Optimizing hybrid search
1211

13-
This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
14-
{: .warning}
15-
1612
A key challenge of using hybrid search in OpenSearch is combining results from lexical and vector-based search effectively. OpenSearch provides different techniques and various parameters you can experiment with to find the best setup for your application. What works best, however, depends heavily on your data, user behavior, and application domain—there is no one-size-fits-all solution.
1713

1814
Search Relevance Workbench helps you systematically find the ideal set of parameters for your needs.
@@ -114,3 +110,5 @@ POST _plugins/_sql
114110
}
115111
```
116112
{% include copy-curl.html %}
113+
114+
To review these results visually, see [Exploring search evaluation results]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/explore-experiment-results/).

_search-plugins/search-relevance/query-sets.md

Lines changed: 52 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,13 @@
22
layout: default
33
title: Query sets
44
nav_order: 3
5-
parent: Using Search Relevance Workbench
5+
parent: Search Relevance Workbench
66
grand_parent: Search relevance
77
has_children: false
8-
has_toc: false
98
---
109

1110
# Query sets
1211

13-
This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).
14-
{: .warning}
15-
1612
A query set is a collection of queries. These queries are used in experiments for search relevance evaluation. Search Relevance Workbench offers different sampling techniques for creating query sets from real user data that adheres to the [User Behavior Insights (UBI)]({{site.url}}{{site.baseurl}}/search-plugins/ubi/schemas/) specification.
1713
Additionally, Search Relevance Workbench allows you to import a query set.
1814

@@ -37,10 +33,10 @@ The following table lists the available input parameters.
3733

3834
Field | Data type | Description
3935
:--- | :--- | :---
40-
`name` | String | The name of the query set.
41-
`description` | String | A short description of the query set.
36+
`name` | String | The name of the query set. The maximum length is 50 characters.
37+
`description` | String | A short description of the query set. The maximum length is 250 characters.
4238
`sampling` | String | Defines which sampler to use. Valid values are `pptss` (Probability-Proportional-to-Size-Sampling), `random`, `topn` (most frequent queries), and `manual`.
43-
`querySetSize` | Integer | The target number of queries in the query set. Depending on the number of unique queries in `ubi_queries`, the resulting query set may contain fewer queries.
39+
`querySetSize` | Integer | The target number of queries in the query set. Depending on the number of unique queries in `ubi_queries`, the resulting query set may contain fewer queries. Must be a positive integer.
4440

4541
### Example request: Sampling 20 queries with the Top N sampler
4642

@@ -73,6 +69,54 @@ PUT _plugins/_search_relevance/query_sets
7369
}
7470
```
7571

72+
## Query set formats
73+
74+
Search Relevance Workbench supports two formats for query sets, each designed for different use cases. Both formats are a collection of user queries, but they differ in whether they include an expected answer.
75+
76+
* **Basic query set**: A list of user queries without any additional information. This is useful for general relevance testing where no specific answer is expected.
77+
78+
* **Query set with reference answers**: A list of user queries, in which each query is paired with its expected answer. This format is particularly useful for evaluating applications designed to provide a specific answer, such as question-answering systems.
79+
80+
### Fields
81+
82+
All query sets comprise one or more entries. Each entry is a JSON object containing the following fields.
83+
84+
| Field | Data type | Description |
85+
| :--- | :--- | :--- |
86+
| `queryText` | String | The user query string. Required. |
87+
| `referenceAnswer` | String | The expected or correct answer to the user query. This field is used for generating judgments, especially with large language models (LLMs). Optional. |
88+
89+
### Basic query set example
90+
91+
A basic query set contains only the `queryText` field for each entry. It is suitable for general relevance tests where no single "correct" answer exists.
92+
93+
#### Example query set without reference answers
94+
95+
```json
96+
{"queryText": "t towels kitchen"}
97+
{"queryText": "table top bandsaw for metal"}
98+
{"queryText": "tan strappy heels for women"}
99+
{"queryText": "tank top plus size women"}
100+
{"queryText": "tape and mudding tools"}
101+
```
102+
103+
### Query set with reference answers example
104+
105+
This format includes the `referenceAnswer` field alongside the `queryText`. It is ideal for evaluating applications designed to provide specific answers, such as chatbots or question-answering systems.
106+
107+
#### Example query set with reference answers
108+
109+
```json
110+
{"queryText": "What is the capital of France?", "referenceAnswer": "Paris"}
111+
{"queryText": "Who wrote 'Romeo and Juliet'?", "referenceAnswer": "William Shakespeare"}
112+
{"queryText": "What is the chemical symbol for water?", "referenceAnswer": "H2O"}
113+
{"queryText": "What is the highest mountain in the world?", "referenceAnswer": "Mount Everest"}
114+
{"queryText": "When was the first iPhone released?", "referenceAnswer": "June 29, 2007"}
115+
```
116+
117+
118+
The `referenceAnswer` field is particularly useful when using [LLMs to generate judgments]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/judgments/). The LLM can use the reference answer as a ground truth to compare against the retrieved search results, allowing it to accurately score the relevance of the response.
119+
76120
## Managing query sets
77121

78122
You can retrieve or delete query sets using the following APIs.

0 commit comments

Comments
 (0)