Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add main classes for Query and basic unit tests #172

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented May 17, 2023

Description

Part of Normalization and Score Combination feature.
Adding new query, query builder, scorer and weight classes for new hybrid query. Includes basic unit tests, integ tests are coming in next PRs.

Issues Resolved

#175

Check List

  • New functionality includes testing.
    • All tests pass
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski force-pushed the feature/normalization-query branch 3 times, most recently from 3b9b049 to 5f07090 Compare May 17, 2023 18:54
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the feature/normalization-query branch from 5f07090 to b4f6717 Compare May 17, 2023 18:58
@martin-gaievski martin-gaievski changed the title [Normalization] Add main classes for Query along with basic unit tests [Normalization] Add main classes for Query and basic unit tests May 17, 2023
@codecov
Copy link

codecov bot commented May 17, 2023

Codecov Report

Merging #172 (975d80f) into feature/normalization (5feefd5) will decrease coverage by 8.00%.
The diff coverage is 69.36%.

@@                     Coverage Diff                     @@
##             feature/normalization     #172      +/-   ##
===========================================================
- Coverage                    89.55%   81.56%   -8.00%     
- Complexity                     103      161      +58     
===========================================================
  Files                            7       11       +4     
  Lines                          316      537     +221     
  Branches                        52       87      +35     
===========================================================
+ Hits                           283      438     +155     
- Misses                          16       68      +52     
- Partials                        17       31      +14     
Impacted Files Coverage Δ
...g/opensearch/neuralsearch/plugin/NeuralSearch.java 0.00% <0.00%> (ø)
...ensearch/neuralsearch/query/HybridQueryWeight.java 52.00% <52.00%> (ø)
...nsearch/neuralsearch/query/HybridQueryBuilder.java 65.04% <65.04%> (ø)
...org/opensearch/neuralsearch/query/HybridQuery.java 77.77% <77.77%> (ø)
...ensearch/neuralsearch/query/HybridQueryScorer.java 81.25% <81.25%> (ø)

... and 1 file with indirect coverage changes

@martin-gaievski martin-gaievski changed the base branch from feature/normalization to main May 17, 2023 19:05
@martin-gaievski martin-gaievski changed the base branch from main to feature/normalization May 17, 2023 19:06
@martin-gaievski martin-gaievski changed the base branch from feature/normalization to main May 17, 2023 19:14
@martin-gaievski martin-gaievski changed the base branch from main to feature/normalization May 17, 2023 19:14
@martin-gaievski martin-gaievski added Features Introduces a new unit of functionality that satisfies a requirement skip-changelog and removed Features Introduces a new unit of functionality that satisfies a requirement labels May 17, 2023
@martin-gaievski martin-gaievski marked this pull request as ready for review May 17, 2023 19:17
* Implementation fo Query interface for type "hybrid". It allows execution of multiple sub-queries and collect individual
* scores for each sub-query.
*/
public class HybridQuery extends Query implements Iterable<Query> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add unit test for HybridQuery and hybrid scorer too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to add tests for Query, need more time to check is for Scorer it's possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit test for scorer

Comment on lines 31 to 34
/**
* Implementation fo Query interface for type "hybrid". It allows execution of multiple sub-queries and collect individual
* scores for each sub-query.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add @opensearch.internal tag on all these classes. These shouldn't be extended.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opensearch.internal cannot be used, seems it's specific to core OpenSearch repo, and for plugins javadoc tasks throws

error: unknown tag: opensearch.internal
 * @opensearch.internal
   ^

we can use final for classes for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is same here. opensearch.internal is for OpenSearch core to indicate that the class is not public to plugins.

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the feature/normalization-query branch 4 times, most recently from d6b43bb to 247d6f4 Compare May 22, 2023 05:01
In standard scorer.score implementation return sum of all sub-scores as
one score for doc id.
Fixed unit tests

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the feature/normalization-query branch from 247d6f4 to 6bd9f8e Compare May 22, 2023 05:25
private final List<Query> subQueries;

public HybridQuery(Collection<Query> subQueries) {
Objects.requireNonNull(subQueries, "Collection of Queries must not be null");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check empty list also here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me add this check. Such empty collect doesn't harm but also does not make sense to process such search request.

}

if (subQueries.size() == 1) {
return subQueries.iterator().next();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont we want to call the rewrite if that query here? We are just returning the query iterator?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why use iterator here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it's an artifact from POC, we need to call rewrite and wrap it into HybridQuery even for 1 sub-query, I'll do the change

Comment on lines +113 to +116
public void visit(QueryVisitor queryVisitor) {
QueryVisitor v = queryVisitor.getSubVisitor(BooleanClause.Occur.SHOULD, this);
for (Query q : subQueries) {
q.visit(v);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this visitor can provide some details here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main purpose is to allow visitor pattern for Hybrid query clients in future. It's an abstract method in Query class, we need to provide some implementation. As for the visitor, I've found one example of visitor in core: In TopDocsCollectorContext we do have a visitor that is checking maxScore flag in each sub-query
We can throw UnsupportedOperation for now as I do not recall usage of visitor in other parts of normalization related code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked it, visitors are executed by IndexSearcher, so we need some sort of implementation. Leaving it as is for now.

Comment on lines 228 to 233
static void writeQueries(StreamOutput out, List<? extends QueryBuilder> queries) throws IOException {
out.writeVInt(queries.size());
for (QueryBuilder query : queries) {
out.writeNamedWriteable(query);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this one. Let me send you that.

}
}

static Collection<Query> toQueries(Collection<QueryBuilder> queryBuilders, QueryShardContext context) throws QueryShardException,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this function is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call it from doToQuery in order to get Query objects from QueryBuilder objects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is the case why this is package static private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, not needed, changing to just "private"


private final HybridQuery queries;
// The Weights for our subqueries, in 1-1 correspondence
protected final ArrayList<Weight> weights;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why protected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, can be private

Comment on lines 37 to 39
float[] subScores;

Map<Query, Integer> queryToIndex;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why they are not private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, making private and final

@martin-gaievski martin-gaievski force-pushed the feature/normalization-query branch from 820d543 to 5eb6bc3 Compare May 22, 2023 20:37
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the feature/normalization-query branch from 5eb6bc3 to cbd9552 Compare May 22, 2023 20:37
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
* @return
* @throws IOException
*/
public float[] hybridScores() throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how we are going to use it? Since there can be 2 query scores, the return of the float[] is for which query?

Copy link
Member Author

@martin-gaievski martin-gaievski May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for later stages, we'll call it from collector. Something like this call in POC code https://github.com/navneet1v/neural-search/blob/normalization-poc/src/main/java/org/opensearch/neuralsearch/search/CompoundTopScoreDocCollector.java#L74.

Array returned here is for all sub-queries, say we do have term1, neural1, term2, then array will be of length 3, all order will correspond to other of sub-queries parsed by query builder, e.g. hybridScores[0] -> term1, hybridScores[1] -> neural1, hybridScores[2] -> term2. For example, we return

for doc id "1": hybridScores: [2.4, 0.676, 0]
for doc id "2": hybridScores: [0.0, 0.576, 1.4]

in this case we have some score for neural1, as k-NN return some score for practically any doc. doc 1 doesn't match term query term2, and doc 2 doesn't match term1, so each has score 0.0 at corresponding index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hybridScores is our custom method, not required for Query phase. I put it here so we can test earlier if scorer is actually getting results for all sub-queries.

@martin-gaievski martin-gaievski requested a review from navneet1v May 30, 2023 22:45
Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make sure to add more UTs in follow up PRs to cover all the lines,

Copy link
Collaborator

@heemin32 heemin32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try stream as much as you can. Declarative style has less error prone.

Comment on lines 31 to 34
/**
* Implementation fo Query interface for type "hybrid". It allows execution of multiple sub-queries and collect individual
* scores for each sub-query.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is same here. opensearch.internal is for OpenSearch core to indicate that the class is not public to plugins.

Copy link
Collaborator

@heemin32 heemin32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try stream as much as you can. Declarative style has less error prone.

heemin32

This comment was marked as duplicate.

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the feature/normalization-query branch from c8d945d to d170529 Compare May 31, 2023 22:08
@martin-gaievski martin-gaievski requested a review from heemin32 May 31, 2023 22:12
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the feature/normalization-query branch from 8bae248 to 975d80f Compare June 1, 2023 20:28
@martin-gaievski martin-gaievski merged commit 99ec3b4 into opensearch-project:feature/normalization Jun 1, 2023
martin-gaievski added a commit that referenced this pull request Aug 3, 2023
* Add main classes for Query along with basic unit tests

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants