feat(tdigest): add the support of TDIGEST.REVRANK command#3130
feat(tdigest): add the support of TDIGEST.REVRANK command#3130LindaSummer merged 75 commits intoapache:unstablefrom
Conversation
|
Thank you for your contribution. Could you add some golang test cases for it? Refer to https://github.com/apache/kvrocks/blob/unstable/tests/gocase/unit/type/tdigest/tdigest_test.go. |
|
@PragmaTwice ok,I will add some golang test |
|
src/types/redis_tdigest.cc
Outdated
| for (auto value : inputs) { | ||
| auto status_or_rank = TDigestRevRank(dump_centroids, value); | ||
| if (!status_or_rank) { | ||
| return rocksdb::Status::InvalidArgument(status_or_rank.Msg()); | ||
| } | ||
| result->push_back(*status_or_rank); | ||
| } |
There was a problem hiding this comment.
Hi @donghao526 ,
We could sort the inputs and get the ranks with just one scan of the centroids since it's sorted.
Best Regards,
Edward
There was a problem hiding this comment.
Hi @LindaSummer
I encountered a problem when I was testing. After the nodes merged, are there two adjacent centroids can be with the same mean?
I Test with
TDIGEST.CREATE s COMPRESSION 1000
TDIGEST.ADD s 10 10 10 10 20 20
I found the centroids after merged are:
(1) mean: 10 weight: 1
(2) mean: 10 weight: 1
(3) mean: 10 weight: 1
(4) mean: 10 weight: 1
(5) mean: 20 weight: 1
(6) mean: 20 weight: 1
Is this as expected or a bug?
There was a problem hiding this comment.
Hi @donghao526 ,
It is expected, and you could refer to #2878 for more details.
So we need a stable way for both serialization and deserialization.
The trigger for the merge is the weight, not the mean. So we could treat the mean only as a label of one centroid. The whole logic is driven by weight.
Best Regards,
Edward
src/types/tdigest.h
Outdated
| } | ||
|
|
||
| template <typename TD> | ||
| inline StatusOr<int> TDigestRevRank(TD&& td, double value) { |
There was a problem hiding this comment.
Hi @donghao526 ,
We need to use a stable way to compare between doubles.
It will be tough to assume that the two double numbers are equal to or greater than.
After solving this, we should add some test cases for this corner case.
Best Regards,
Edward
There was a problem hiding this comment.
Hi @donghao526 ,
Since the other code snippets use this way now. You could leave it with the current logic.
I will try to create a new PR to solve the unstable comparison problem in this file.
Best Regards,
Edward
There was a problem hiding this comment.
OK, after your new PR, I can help to fix here.
…ks into feature/tdigest-revrank
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ks into feature/tdigest-revrank
…ks into feature/tdigest-revrank
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::map<double, size_t, DoubleComparator> value_to_indices; | ||
| for (size_t i = 0; i < inputs.size(); ++i) { | ||
| value_to_indices[inputs[i]] = i; | ||
| } |
There was a problem hiding this comment.
When duplicate input values exist, this map will only store the index of the last occurrence, causing incorrect results to be returned. The value_to_indices map should use a multimap or map to vector to store all indices for duplicate values, not just the last one.
There was a problem hiding this comment.
The inputs have been deduplicated before calling this function, and there are no duplicate input values
src/commands/cmd_tdigest.cc
Outdated
| std::unordered_set<std::string> unique_inputs_set(args.begin() + 2, args.end()); | ||
| origin_inputs_.assign(args.begin() + 2, args.end()); | ||
|
|
||
| unique_inputs_.reserve(unique_inputs_set.size()); | ||
| size_t i = 0; | ||
| for (const auto &input : unique_inputs_set) { | ||
| auto value = ParseFloat(input); | ||
| if (!value) { | ||
| return {Status::RedisParseErr, errValueIsNotFloat}; | ||
| } | ||
| unique_inputs_.push_back(*value); | ||
| unique_inputs_order_[input] = i++; | ||
| } |
There was a problem hiding this comment.
The iteration order of std::unordered_set is not deterministic, which means the mapping in unique_inputs_order_ may not correspond correctly to the indices in unique_inputs_. This can lead to incorrect results when looking up values. Consider using the string representation as the key and ensuring consistent ordering, or iterate over unique_inputs_ after population to build the mapping.
There was a problem hiding this comment.
The origin_inputs_ vector stores the correct order of the origin inputs. So the order of the results is deterministic.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…eading Co-authored-by: Edward Xu <xuxiangad@foxmail.com>
Co-authored-by: Edward Xu <xuxiangad@foxmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
LindaSummer
left a comment
There was a problem hiding this comment.
LGTM, left one nitpit comment.
Co-authored-by: Edward Xu <xuxiangad@foxmail.com>
|





ISSUE
It closes #3063.
Proposed Changes
Add TDIGEST.REVRANK command implementation
Add cpp unit tests