[RFC] Enhanced Access to Term-Level Statistics in OpenSearch

### **Is your feature request related to a problem? Please describe.**
> A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

In its present state, OpenSearch, a fork of Elasticsearch, offers only constrained access to term-level statistics extracted from Lucene via its scripting functionality. The current process requires setting the similarity model, which can include scripted similarity, at the index level during index creation. This entails defining the settings and mappings for an index, specifying the similarity model for a specific field or for the whole index. Subsequently, during search operations, OpenSearch uses the predefined similarity model to calculate scores for the documents in the index.

This design choice has been made for performance optimization. The similarity model is employed at index time to precompute certain values required at search time. Additionally, considering it influences how the inverted index is stored and queried, altering the similarity settings on a per-query basis is not practical.

### **Describe the solution you'd like**
> A clear and concise description of what you want to happen.

To enhance OpenSearch's capabilities, we suggest broadening the direct access to detailed statistics like term frequency (termfreq), term frequency-inverse document frequency (tf-idf), total term frequency (totaltermfreq), sum of total term frequencies (sumtotaltermfreq), and payload information. This improved access can spur the creation of more refined information retrieval and ranking algorithms.

We propose augmenting OpenSearch's scripting functionality to include more Lucene ValueSource statistics. This would involve extending existing scripting classes and creating new ones as necessary, leveraging Lucene's existing ValueSource and Similarity classes for the underlying statistics. This new functionality needs to be carefully integrated and thoroughly tested for reliability and performance. This would empower script creators with new tools for customizing information retrieval and ranking in OpenSearch.

### **Describe alternatives you've considered**
> A clear and concise description of any alternative solutions or features you've considered.

1. Implementing this functionality outside OpenSearch: This would involve pulling data out of OpenSearch, calculating the statistics externally, and then pushing the data back into OpenSearch. However, this approach is likely to be inefficient and would not benefit from the optimizations available within OpenSearch and Lucene.
2. Relying solely on OpenSearch's existing scripting functionality: While OpenSearch's scripting does provide some access to term-level statistics, it's not flexible as tuning and customizing during the fetch phase.
    1. Term vector: As described in https://github.com/opensearch-project/OpenSearch/issues/7558#issuecomment-1623180520, it’s not one-pass since the doc ids have to be granted
    2. Rank feature: Rank feature do scoring by adding the weight to the original score, for example: `<BM25> + boost * <value>`
    3. Scripted similarity: As described in https://github.com/opensearch-project/OpenSearch/issues/7558#issuecomment-1635164575, script similarity doesn't allow parameters to be included into the similarity score on a per query basis. While the multiplier and default_value can be injected by function_score query, the target term must be in query context which is not configurable as params.

### **Additional context**
> Add any other context or screenshots about the feature request here.

Related issue: https://github.com/opensearch-project/OpenSearch/issues/7558

The proposed enhancement to OpenSearch's scripting functionality will provide a wider range of statistics for use in complex information retrieval and ranking algorithms. This opens up new possibilities for improving the accuracy and relevance of search results, tailoring the retrieval process to specific use cases, and optimizing performance. These statistics can be particularly useful in domains such as information retrieval research, e-commerce, document classification, and others where fine-grained control over the ranking algorithm is desirable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development