Skip to content

feat(search_family): Introduce merging of sorted search results from shards. FIRST PR #5381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

BagritsevichStepan
Copy link
Contributor

@BagritsevichStepan BagritsevichStepan commented Jun 29, 2025

Fixes non-optimal sorting and limiting introduced in #4942

The problem:
In the PR #4942, we moved sorting and limiting of search results to the coordinator thread to support sorting on non-sortable fields. However, this introduced performance issues: each shard returns all possible IDs without applying any local limiting, and the coordinator moves everything into a single array, and only then does a partial or full sort.
Even before this PR, for KNN search we were performing full sorting on coordinator thread instead of merging sorted shard results.

How it works now:

  1. Sorting and limiting is now performed on shards in all cases: SORTBY on sortable fields, SORTBY on non-sortable fields, KNN search, KNN search + SORTBY
  2. The coordinator thread now performs only merging of already sorted results; sorting has been completely removed from this stage.

This change should improve performance (though benchmark tests are still in progress), especially due to limiting done early on shards.

Coordinator logic is already quite optimized: no unnecessary large allocations (e.g., offset + limit) are made — only up to limit elements are allocated and processed.

However, the shard-side logic still requires improvements, which will be fixed in the second PR:

  1. Remove unnecessary allocations of size offset + limit in sort indexes and non-sortable field sorting.
    Instead, we should use variant<vector<DocId>, vector<pair<DocId, ResultScore>>> in SearchAlrgorithmResult
  2. If the limit is much smaller than size of the array - use k-min-heap selection instead of std::partial_sort.
    This will significantly reduce allocations and improve efficiency.
  3. Do sorting in the search algorithm before calling Take method

There are also other improvements, but I haven't implemented them yet — I don't think it's worth optimizing this further (after second pr)

…shards

Signed-off-by: Stepan Bagritsevich <stefan@dragonflydb.io>
@BagritsevichStepan BagritsevichStepan force-pushed the search/speed-up-sort-limit branch from f7a6ddf to ff598ab Compare June 29, 2025 20:54
// Sort order for sorting indices and results.
enum class SortOrder { ASC, DESC };

/* Optimized comparator for ascending or descending sort.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you demonstrate the performance improvement here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially after introducing dynamic dispatch with std::function 😆 Will be two times slower at least. I'm not sure instantiated sort is fully optimizied for the ascending parameter being immutable, but I doubt it has a serious impact

Copy link
Contributor Author

@BagritsevichStepan BagritsevichStepan Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I’m aware that std::function introduces dynamic dispatch. It was initially added based on several comments and discussions from Borys in other PRs (regarding the SORTBY option in the FT.AGGREGATE command). I still added std::function here because we can always implement our own lightweight version — there are a lof of such implementations, and I remember that I implemented one myself some time ago.

For this PR we can remove this, and with adding the correct implementation add this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They optimize allocations, not dispatch cost. Why add so much code for this at all if

  • we can just use static dispatch with templates
  • likely optimize prematurely here

Copy link
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR mixes many changes but in practice the change you suggest can be demonstrated by using a much smaller PR. It does not have to work for all cases to demonstrate the value.

What this PR does not have is benchmarks, which in fact are necessary for such changes.

// Sort order for sorting indices and results.
enum class SortOrder { ASC, DESC };

/* Optimized comparator for ascending or descending sort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially after introducing dynamic dispatch with std::function 😆 Will be two times slower at least. I'm not sure instantiated sort is fully optimizied for the ascending parameter being immutable, but I doubt it has a serious impact

Comment on lines +420 to +423
// TODO: use min-heap if limit is much smaller than the number of results
std::partial_sort(knn_distances_.begin(), knn_distances_.begin() + knn.limit,
knn_distances_.end());
knn_distances_.resize(knn.limit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the limit is much smaller than size of the array - use k-min-heap selection instead of std::partial_sort.
This will significantly reduce allocations and improve efficiency.

Bold claim. I'm not sure at all it's more effective than partial sort

Copy link
Contributor Author

@BagritsevichStepan BagritsevichStepan Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Note that std::partial_sort is also using heap approach for small k (as I get from documentation). But here you need to do knn_distances_.reserve(sub_results.Size()); first and push the values to the array. If limit is small and there is a big amount of sub_results heap approach should be better. But I agree that it is not priority for the next PR

Copy link
Contributor

@dranikpg dranikpg Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we can avoid allocating knn distances vector (but likely cheap compared to computing vector distance)

Comment on lines +82 to +93
search_result->RearrangeAccordingToIndexes(ids_to_sort);

std::vector<SortableValue> out(size);
for (size_t i = 0; i < size; i++) {
const auto doc_id = search_result->ids[i];
auto& value = values_[doc_id];
if constexpr (!std::is_arithmetic_v<T>) {
out[i] = std::string{value};
} else {
out[i] = static_cast<double>(value);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If there was a simple way to sort two arrays at once, you wouldn't need the indices array and could keep a separate array for scores (to avoid reallocating into an array of pairs). Requires modifying the swap
  2. I understand it's temporary(?) but passing SearchAlgorithmResult* is not the best interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's temporary. I added TODOs and also mentioned in the PR description that it will be fixed in the very next PR.

Good idea regarding the swap.

Copy link
Contributor Author

@BagritsevichStepan BagritsevichStepan Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, take into account that now (on the current main branch) this method is not called at all 😄

search::SortOrder sort_order, search::SearchAlrgorithmResult* search_result) const {
auto fident = field.GetIdentifier(base_->schema, false);

if (IsSortableField(fident, base_->schema)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe let's not move logic higher than needed, doc_index knows itself if it's sortable or not, so it can do it internally, we just pass an accessor object unconditionally

for (size_t i = 0; i < results.size(); ++i) {
if (!results[i].docs.empty()) {
const auto& first_doc = results[i].docs[0];
update_cached_value(i, first_doc, &cached_values);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update_cached_value can really just be a get_value(i) 🤔

Comment on lines +619 to +624
for (size_t i = 0; i < results.size(); ++i) {
if (indexes[i] < results[i].docs.size() && (!next_doc || comparator(i, next_index))) {
next_doc = &results[i].docs[indexes[i]];
next_index = i;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually where you can use a heap optimization with make_heap or something akin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is temporary, note that the shards number is not big

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe later we can change this

@romange
Copy link
Collaborator

romange commented Jul 1, 2025

Let's not review this PR for now - it won't be submitted in this form as it's too big and not focused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants