Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CachingLeafSlicesSupplier to compute the LeafSlices for concurrent segment search #12374

Merged
merged 1 commit into from
Jun 30, 2023

Conversation

sohami
Copy link
Contributor

@sohami sohami commented Jun 14, 2023

Description

Add a constructor which takes in the computed slices from extensions and uses that for running the search concurrently on provided executor. This is based on the discussion on the issue #12347

@sohami
Copy link
Contributor Author

sohami commented Jun 15, 2023

@javanna Can you help to look into this PR ? Would be great if we can merge it in before release branch cut off.

@javanna
Copy link
Contributor

javanna commented Jun 15, 2023

hi @sohami thanks for opening this PR! I have been considering a different approach, in the attempt of not adding additional constructors to IndexSearcher, as it already has quite some.

I am thinking that pre-computing the slices in the constructor may not be so important. How about removing that and computing the slices each time they are retrieved? There's only three methods where that is used between searchAfter and search. If needed, subclasses can still pre-compute and cache the slices if that's considered important. This way, subclasses are free to override how slices are computed, and the change is a little less invasive.

@sohami
Copy link
Contributor Author

sohami commented Jun 15, 2023

hi @sohami thanks for opening this PR! I have been considering a different approach, in the attempt of not adding additional constructors to IndexSearcher, as it already has quite some.

I am thinking that pre-computing the slices in the constructor may not be so important. How about removing that and computing the slices each time they are retrieved? There's only three methods where that is used between searchAfter and search. If needed, subclasses can still pre-compute and cache the slices if that's considered important. This way, subclasses are free to override how slices are computed, and the change is a little less invasive.

@javanna Thanks for the suggestion. I would think the base implementation will want to avoid computing it on every usage specially for the immutable indices where IndexSearcher is suggested to be reused. If it is not cached then for each search it will compute the slices which seems redundant and also deviating from current baseline. If we want to cache the computed slices in base class but remove it from getting initialized from constructor, then essentially we are making leafSlices variable non-final and relying on users of the leafSlice variable across the methods to use getSlices() atleast once which doesn't seem clean as well.

I think if the concern is to avoid adding yet another constructor then other way could be to make the leafSlice variable protected and non-final and initialize it in constructor of base class. The subclasses can then choose to override it as needed with the understanding that it should not be changed in between the execution (not clean but probably a middle ground and similar to how similarity implementation is provided to IndexSearcher)

@sohami
Copy link
Contributor Author

sohami commented Jun 16, 2023

@javanna I am also thinking in regards to too many constructor, can we see if deprecating the ones with IndexReader as inputs and keeping the ones with IndexReaderContext over the time will make sense. For the new one as well we can avoid adding one with IndexReader if we go down that path vs non final variable.

@jpountz
Copy link
Contributor

jpountz commented Jun 19, 2023

I'm not entirely happy with this new constructor either, because now there are two ways to customize slicing: either with this new constructor or by overriding the slices() method. I'd like to have a single one, and the approach to override a method to compute slices given a list of segments looks more user-friendly to me than having to compute slices up-front?

For reference, this 20-years-old comment on IndexSearcher about making sure to reuse IndexSearcher instances is outdated, all IndexSearcher constructors are cheap compared to the work that running any query would need to do. It's actually required to create a new IndexSearcher for every search if you wish to leverage IndexSearcher's timeout support.

If there is reluctance to computing slices on every call to #search, could we compute slices lazily and cache them in getSlices()? This way, it wouldn't be needed to increase the visibility of slices to protected?

@andrross
Copy link
Contributor

+1 to not having two ways to customize slicing.

the approach to override a method to compute slices given a list of segments looks more user-friendly to me than having to compute slices up-front?

@jpountz I agree. I think the basic problem here is that the protected method is called from the constructor of the parent class, which really limits the flexibility and can lead to unexpected behavior. The simplest approach is to just compute the slices on every call to #search. If we want to be more defensive about possible performance regressions, then lazy computation + caching like you said is the way to go (i.e. memoization). Is there any utility in Lucene like this memoizing supplier in Guava? (I couldn't find one but I'm not super familiar with the code base). Hand rolling the memoization logic inside IndexSearch is an option too, we'll just need to take some care to ensure that it is thread safe.

@sohami
Copy link
Contributor Author

sohami commented Jun 20, 2023

+1 to not having two ways to customize slicing.

the approach to override a method to compute slices given a list of segments looks more user-friendly to me than having to compute slices up-front?

Agreed to this however the slices method doesn't provide good way to overwrite the computation for all the cases as it is called from base class constructor.

For reference, this 20-years-old comment on IndexSearcher about making sure to reuse IndexSearcher instances is outdated, all IndexSearcher constructors are cheap compared to the work that running any query would need to do. It's actually required to create a new IndexSearcher for every search if you wish to leverage IndexSearcher's timeout support.

Thanks for sharing the context @jpountz

If there is reluctance to computing slices on every call to #search, could we compute slices lazily and cache them in getSlices()? This way, it wouldn't be needed to increase the visibility of slices to protected?

With getSlices() based approach and using member variable leafSlices to cache the computed value, my only concern is that it will be hard to enforce any usage to fetch the slices within IndexSearcher to use getSlices always (and not the cached member directly). With that in mind I guess its cleaner to compute slices always and remove the member leafSlices. Thoughts ?

Also wondering what is the disadvantage in making leafSlices, protected and non final. This way we will not add any new constructor and keep current way of overriding the slices but also allow extensions to update it if needed ? I may be missing something here.

@jpountz
Copy link
Contributor

jpountz commented Jun 20, 2023

Is there any utility in Lucene like this memoizing supplier in Guava? (I couldn't find one but I'm not super familiar with the code base). Hand rolling the memoization logic inside IndexSearch is an option too, we'll just need to take some care to ensure that it is thread safe.

No, Lucene doesn't have such a utility. +1 to hand rolling the memoization logic. I don't think it needs to be fancy in terms of concurrency, IndexSearcher is expected to spend much more time scoring hits than computing slices, so making it synchronized should be good enough?

With that in mind I guess its cleaner to compute slices always and remove the member leafSlices. Thoughts ?

I am happy either way. Caching feels like it has less potential for being perceived as a regression by some users. One approach could be to do the caching approach now which is a bit safer, and then move to the approach that you're suggesting on the main branch only so that it goes in the next major where changes in runtime behavior would be less surprising. I don't have strong feelings about this.

Also wondering what is the disadvantage in making leafSlices, protected and non final.

I don't like this approach because it exposes internals of IndexSearcher than I'd like users to not touch or have to know about. For instance if we made it protected and non-final, and later made a change to compute slices lazily instead of eagerly in the constructor, this could be perceived as a breaking change as there would be cases when leafSlices would now be null when it wasn't before.

@sohami sohami changed the title Provide constructor to accept the LeafSlice computed by extensions Add CachingLeafSlicesSupplier to compute the LeafSlices for concurrent segment search Jun 21, 2023
@sohami
Copy link
Contributor Author

sohami commented Jun 21, 2023

@jpountz @andrross As discussed, I have updated the PR with a thread safe caching LeafSlices supplier implementation. The supplier uses the slices method to compute the leaf slices

public LeafSlice[] get() {
if (executor == null) {
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make more sense to have to have the executor == null check in IndexSearcher#getSlices than here so that this class focuses solely on caching logic?

Copy link
Contributor Author

@sohami sohami Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because IndexSearcher can access leafSlicesSupplier without using getSlices. So everywhere it will need to perform explicit check for executor being non-null too for such usages. With current mechanism non null check for executor will not be needed for any future usages as well. This is also keeping the current behavior to return null slices when executor is null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, could you move this check into the provider lambda? It still feels wrong to me that CachingLeafSupplier has to check the executor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz I have moved it to getSlices method and using that everywhere as you suggested earlier. If executor is null then initializing the supplier as null in that case. Please let me know if this looks good.

@sohami
Copy link
Contributor Author

sohami commented Jun 26, 2023

@jpountz Gentle reminder for the review

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you add a CHANGES entry under 9.8?

…LeafSlices used with concurrent segment

search. It uses the protected method `slices` by default to compute the slices which can be
overriden by the sub classes of IndexSearcher
@sohami
Copy link
Contributor Author

sohami commented Jun 29, 2023

LGTM. Can you add a CHANGES entry under 9.8?

Added. I have also rebased the changes. Thanks all for your time and guidance.

@jpountz jpountz merged commit 223eecc into apache:main Jun 30, 2023
@jpountz jpountz added this to the 9.8.0 milestone Jun 30, 2023
jpountz pushed a commit to jpountz/lucene that referenced this pull request Jun 30, 2023
…LeafSlices used with concurrent segment (apache#12374)

search. It uses the protected method `slices` by default to compute the slices which can be
overriden by the sub classes of IndexSearcher
Bukhtawar added a commit to opensearch-project/OpenSearch that referenced this pull request Jul 23, 2023
I have nominated and maintainers have agreed to invite Sorabh Hamirwasia(@sohami) to be a co-maintainer. Sorabh has kindly accepted.

Sorabh has led the design and implementation of multiple features like query cancellation, concurrent segment search for aggregations and snapshot inter-operability w/remote storage in OpenSearch. Some significant issues and PRs authored by Sorabh for this effort are as follows:

Feature proposals
Concurrent segment search for aggregations : #6798
Searchable Remote Index : #2900

Implementations
Concurrent segment search for aggregations: #7514
Lucene changes to leaf slices for concurrent search: apache/lucene#12374
Moving concurrent search to core : #7203
Query cancellation support : #986
In total, Sorabh has authored 14 PRs going back to Aug 2021. He also frequently reviews contributions from others and has reviewed nearly 20 PRs in the same time frame. I believe Sorabh will be a valuable addition as a maintainer of OpenSearch and will continue to contribute to the success of the project going forward.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
opensearch-trigger-bot bot pushed a commit to opensearch-project/OpenSearch that referenced this pull request Jul 23, 2023
I have nominated and maintainers have agreed to invite Sorabh Hamirwasia(@sohami) to be a co-maintainer. Sorabh has kindly accepted.

Sorabh has led the design and implementation of multiple features like query cancellation, concurrent segment search for aggregations and snapshot inter-operability w/remote storage in OpenSearch. Some significant issues and PRs authored by Sorabh for this effort are as follows:

Feature proposals
Concurrent segment search for aggregations : #6798
Searchable Remote Index : #2900

Implementations
Concurrent segment search for aggregations: #7514
Lucene changes to leaf slices for concurrent search: apache/lucene#12374
Moving concurrent search to core : #7203
Query cancellation support : #986
In total, Sorabh has authored 14 PRs going back to Aug 2021. He also frequently reviews contributions from others and has reviewed nearly 20 PRs in the same time frame. I believe Sorabh will be a valuable addition as a maintainer of OpenSearch and will continue to contribute to the success of the project going forward.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
(cherry picked from commit 7769682)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Bukhtawar pushed a commit to opensearch-project/OpenSearch that referenced this pull request Jul 24, 2023
I have nominated and maintainers have agreed to invite Sorabh Hamirwasia(@sohami) to be a co-maintainer. Sorabh has kindly accepted.

Sorabh has led the design and implementation of multiple features like query cancellation, concurrent segment search for aggregations and snapshot inter-operability w/remote storage in OpenSearch. Some significant issues and PRs authored by Sorabh for this effort are as follows:

Feature proposals
Concurrent segment search for aggregations : #6798
Searchable Remote Index : #2900

Implementations
Concurrent segment search for aggregations: #7514
Lucene changes to leaf slices for concurrent search: apache/lucene#12374
Moving concurrent search to core : #7203
Query cancellation support : #986
In total, Sorabh has authored 14 PRs going back to Aug 2021. He also frequently reviews contributions from others and has reviewed nearly 20 PRs in the same time frame. I believe Sorabh will be a valuable addition as a maintainer of OpenSearch and will continue to contribute to the success of the project going forward.


(cherry picked from commit 7769682)

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 29, 2023
I have nominated and maintainers have agreed to invite Sorabh Hamirwasia(@sohami) to be a co-maintainer. Sorabh has kindly accepted.

Sorabh has led the design and implementation of multiple features like query cancellation, concurrent segment search for aggregations and snapshot inter-operability w/remote storage in OpenSearch. Some significant issues and PRs authored by Sorabh for this effort are as follows:

Feature proposals
Concurrent segment search for aggregations : opensearch-project#6798
Searchable Remote Index : opensearch-project#2900

Implementations
Concurrent segment search for aggregations: opensearch-project#7514
Lucene changes to leaf slices for concurrent search: apache/lucene#12374
Moving concurrent search to core : opensearch-project#7203
Query cancellation support : opensearch-project#986
In total, Sorabh has authored 14 PRs going back to Aug 2021. He also frequently reviews contributions from others and has reviewed nearly 20 PRs in the same time frame. I believe Sorabh will be a valuable addition as a maintainer of OpenSearch and will continue to contribute to the success of the project going forward.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
I have nominated and maintainers have agreed to invite Sorabh Hamirwasia(@sohami) to be a co-maintainer. Sorabh has kindly accepted.

Sorabh has led the design and implementation of multiple features like query cancellation, concurrent segment search for aggregations and snapshot inter-operability w/remote storage in OpenSearch. Some significant issues and PRs authored by Sorabh for this effort are as follows:

Feature proposals
Concurrent segment search for aggregations : opensearch-project#6798
Searchable Remote Index : opensearch-project#2900

Implementations
Concurrent segment search for aggregations: opensearch-project#7514
Lucene changes to leaf slices for concurrent search: apache/lucene#12374
Moving concurrent search to core : opensearch-project#7203
Query cancellation support : opensearch-project#986
In total, Sorabh has authored 14 PRs going back to Aug 2021. He also frequently reviews contributions from others and has reviewed nearly 20 PRs in the same time frame. I believe Sorabh will be a valuable addition as a maintainer of OpenSearch and will continue to contribute to the success of the project going forward.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
I have nominated and maintainers have agreed to invite Sorabh Hamirwasia(@sohami) to be a co-maintainer. Sorabh has kindly accepted.

Sorabh has led the design and implementation of multiple features like query cancellation, concurrent segment search for aggregations and snapshot inter-operability w/remote storage in OpenSearch. Some significant issues and PRs authored by Sorabh for this effort are as follows:

Feature proposals
Concurrent segment search for aggregations : opensearch-project#6798
Searchable Remote Index : opensearch-project#2900

Implementations
Concurrent segment search for aggregations: opensearch-project#7514
Lucene changes to leaf slices for concurrent search: apache/lucene#12374
Moving concurrent search to core : opensearch-project#7203
Query cancellation support : opensearch-project#986
In total, Sorabh has authored 14 PRs going back to Aug 2021. He also frequently reviews contributions from others and has reviewed nearly 20 PRs in the same time frame. I believe Sorabh will be a valuable addition as a maintainer of OpenSearch and will continue to contribute to the success of the project going forward.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
I have nominated and maintainers have agreed to invite Sorabh Hamirwasia(@sohami) to be a co-maintainer. Sorabh has kindly accepted.

Sorabh has led the design and implementation of multiple features like query cancellation, concurrent segment search for aggregations and snapshot inter-operability w/remote storage in OpenSearch. Some significant issues and PRs authored by Sorabh for this effort are as follows:

Feature proposals
Concurrent segment search for aggregations : opensearch-project#6798
Searchable Remote Index : opensearch-project#2900

Implementations
Concurrent segment search for aggregations: opensearch-project#7514
Lucene changes to leaf slices for concurrent search: apache/lucene#12374
Moving concurrent search to core : opensearch-project#7203
Query cancellation support : opensearch-project#986
In total, Sorabh has authored 14 PRs going back to Aug 2021. He also frequently reviews contributions from others and has reviewed nearly 20 PRs in the same time frame. I believe Sorabh will be a valuable addition as a maintainer of OpenSearch and will continue to contribute to the success of the project going forward.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants