Skip to content

Semantic_text match_all with Highlighter #128702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Samiul-TheSoccerFan
Copy link
Contributor

@Samiul-TheSoccerFan Samiul-TheSoccerFan commented May 30, 2025

This PR addresses the missing highlighting for semantic_text fields when a match_all query is used. We guide users to use semantic highlighting to view their chunks, but highlighting does not currently return any results with match_all. Highlighting is usually unnecessary for non-inference fields with match_all, but it's essential for semantic_text fields to return the generated chunks. This change ensures those highlighting results are returned as expected.

Test cases:

Only text fields

PUT my-index/
{
  "mappings": {
    "properties": {
      "semantic1": {
        "type": "text"
      },
      "semantic2": {
        "type": "text"
      }
    }
  }
}

POST my-index/_doc/1
{
  "semantic1": "Puggles are pugs and beagles",
  "semantic2": "Chiweenies are chihuahuas and dachshunds"
}


GET my-index/_search
{
  "query": {
    "match_all": {}
  }
}

only inference fields

PUT my-semantic-index/
{
  "mappings": {
    "properties": {
      "semantic1": {
        "type": "semantic_text"
      },
      "semantic2": {
        "type": "semantic_text"
      }
    }
  }
}

POST my-semantic-index/_doc/1
{
  "semantic1": "Puggles are pugs and beagles",
  "semantic2": "Chiweenies are chihuahuas and dachshunds"
}

GET my-semantic-index/_search
{
  "query": {
    "match_all": {}
  },
  "highlight": {
    "fields": {
      "semantic1": {
        "type": "semantic",
        "number_of_fragments": 1
      },
      "semantic2": {
        "type": "semantic",
        "number_of_fragments": 1
      }
    }
  }
}

Both infer and non-inference fields

PUT test-index
{
  "mappings": {
    "properties": {
      "test_field": {
        "type": "semantic_text"
      },
      "non_infer_field": {
        "type": "text"
      }
    }
  }
}

PUT test-index/_doc/doc1
{
  "test_field": "these are not the droids you're looking for. He's free to go around",
  "non_infer_field": "this is a non-inference field"
}

GET test-index/_search
{
  "query": {
    "match_all": {}
  },
  "highlight": {
    "fields": {
      "test_field": {
        "type": "semantic",
        "number_of_fragments": 1
      }
    }
  }
}

Both dense vector fields

PUT _inference/text_embedding/my-e5-model
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": ".multilingual-e5-small"
  }
}

PUT my-dense-semantic-index/
{
  "mappings": {
    "properties": {
      "semantic1": {
        "type": "semantic_text",
        "inference_id": "my-e5-model"
      },
      "semantic2": {
        "type": "semantic_text",
        "inference_id": "my-e5-model"
      }
    }
  }
}

POST my-dense-semantic-index/_doc/1
{
  "semantic1": ["Puggles are pugs and beagles", "    ", "now with chunks"],
  "semantic2": ["Chiweenies are chihuahuas and dachshunds"  , "    ", "now with chunks"]
}

GET my-dense-semantic-index/_search
{
  "query": {
    "match_all": {}
  },
  "highlight": {
    "fields": {
      "semantic1": {
        "type": "semantic",
        "number_of_fragments": 3
      },
      "semantic2": {
        "type": "semantic",
        "number_of_fragments": 3
      }
    }
  }
}

Some additional cases

GET my-*/_search
{
  "query": {
    "match_all": {}
  },
  "highlight": {
    "fields": {
      "semantic1": {
        "number_of_fragments": 1
      },
      "semantic2": {
        "number_of_fragments": 1
      }
    }
  }
}

GET my-*/_search
{
  "query": {
    "match_all": {}
  },
  "highlight": {
    "fields": {
      "semantic1": {
        "type": "semantic",
        "number_of_fragments": 1
      },
      "semantic2": {
        "type": "semantic",
        "number_of_fragments": 1
      }
    }
  }
}

// throws error as fields are semantic_text field
GET my-*/_search
{
  "query": {
    "match_all": {}
  },
  "highlight": {
    "fields": {
      "semantic1": {
        "type": "score",
        "number_of_fragments": 1
      },
      "semantic2": {
        "type": "score",
        "number_of_fragments": 1
      }
    }
  }
}

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start, I've added some high level comments on overall organization. We can do a deeper review when it's closer to ready.

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a much simpler way to fix this by updating the semantic highlighter. Take a look at how you can change QueryVisitor in extractDenseVectorQueries and extractSparseVectorQueries:  https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/highlight/SemanticTextHighlighter.java#L250

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@Samiul-TheSoccerFan Samiul-TheSoccerFan marked this pull request as ready for review June 3, 2025 13:39
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 3, 2025
@Samiul-TheSoccerFan Samiul-TheSoccerFan added auto-backport Automatically create backport pull requests when merged :SearchOrg/Relevance Label for the Search (solution/org) Relevance team v8.19.0 and removed needs:triage Requires assignment of a team area label labels Jun 3, 2025
@elasticsearchmachine elasticsearchmachine added Team:SearchOrg Meta label for the Search Org (Enterprise Search) Team:Search - Relevance The Search organization Search Relevance team labels Jun 3, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-eng (Team:SearchOrg)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-relevance (Team:Search - Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes, looks great @Samiul-TheSoccerFan !

@@ -0,0 +1,5 @@
pr: 128702
summary: Highlighting support added to semantic_text with `match_all`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Since this is a bug, perhaps we call the changelog summary something like `Fix issue where highlighting was not returned with match_all queries"

@@ -267,6 +268,8 @@ public void visitLeaf(Query query) {
queries.add(fieldType.createExactKnnQuery(VectorData.fromFloats(knnQuery.getTargetCopy()), null));
} else if (query instanceof KnnByteVectorQuery knnQuery) {
queries.add(fieldType.createExactKnnQuery(VectorData.fromBytes(knnQuery.getTargetCopy()), null));
} else if (query instanceof MatchAllDocsQuery) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍 Nice optimization from the first solution!

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good! Caught a copy/paste error in the tests that we should fix though.

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Good clean solution and tests :)

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@Samiul-TheSoccerFan Samiul-TheSoccerFan merged commit d1b5532 into elastic:main Jun 4, 2025
18 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 128702

@Samiul-TheSoccerFan
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.19

Questions ?

Please refer to the Backport tool documentation

Samiul-TheSoccerFan added a commit to Samiul-TheSoccerFan/elasticsearch that referenced this pull request Jun 4, 2025
* initial implementation for match_All

* reformat

* [CI] Auto commit changes from spotless

* Excluding matchAllintercepter

* Adding matchAllDocs support for vector fields

* [CI] Auto commit changes from spotless

* Remove previous implementation

* Adding yaml tests for match_all

* fixed yaml tests

* Update docs/changelog/128702.yaml

* Update changelog

* changelog - update summary

* Fix wrong inference names for the yaml tests

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
(cherry picked from commit d1b5532)

# Conflicts:
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceFeatures.java
elasticsearchmachine pushed a commit that referenced this pull request Jun 4, 2025
* initial implementation for match_All

* reformat

* [CI] Auto commit changes from spotless

* Excluding matchAllintercepter

* Adding matchAllDocs support for vector fields

* [CI] Auto commit changes from spotless

* Remove previous implementation

* Adding yaml tests for match_all

* fixed yaml tests

* Update docs/changelog/128702.yaml

* Update changelog

* changelog - update summary

* Fix wrong inference names for the yaml tests

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
(cherry picked from commit d1b5532)

# Conflicts:
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceFeatures.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending >bug :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:Search - Relevance The Search organization Search Relevance team Team:SearchOrg Meta label for the Search Org (Enterprise Search) v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants