feat: impl CpcSketch by tisonkun · Pull Request #75 · apache/datasketches-rust

tisonkun · 2026-01-20T15:14:29Z

This refers to #37.

I plan to implement the following steps:

(DONE) PairTable (for storing sparse data)
(DONE) CpcSketch without union
1. Empty state with empty data
2. Sparse state with pair table
3. Hybrid state with dense vector
4. Pinned state with dense vector (with ICON estimator)
5. Sliding state with dense vector
(Done) Union
Serde

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2026-01-20T17:22:36Z

datasketches/src/cpc/pair_table.rs

We may not implement the merge function as the Java/C++ impl for PairTable but find another way to do the two-way merge. This is because in Rust, it's impossible to hold a mutable ref when an immutable ref is still in used, which is how PairTable::merge is used in practice:

PairTable.merge(srcPairArr, 0, srcNumPairs, allPairs, srcNumPairs, numPairsFromArray, allPairs, 0); // note the overlapping subarray trick

The real effect here is to perform a two-way merge of allPairs[srcNumPairs..numPairsFromArray] and srcPairArr. There should be a more proper way to do this in Rust.

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2026-01-29T15:58:02Z

This PR is now ready for review.

It's mainly ported from the datasketches-cpp impl, so I tag @AlexanderSaydakov as a potential reviewer.

Union and serde (compression) would be implemented as follows. But the current state is a reviewable & mergeable minimal feature set.

tisonkun · 2026-01-29T16:00:47Z

datasketches/src/cpc/mod.rs

+// specific language governing permissions and limitations
+// under the License.
+
+#![allow(dead_code)]


To be removed when Union and Serde get implemented.

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2026-01-30T13:51:31Z

Cargo.toml

 dbg_macro = "deny"
+
 too_many_arguments = "allow"
+needless_range_loop = "allow"


False positive when iterating over index can be more expressive.

tisonkun · 2026-01-30T13:53:01Z

datasketches/src/cpc/pair_table.rs

+fn knuth_shell_sort3(a: &mut [u32]) {
+    let len = a.len();
+
+    let mut h = 0;
+    while h < len / 9 {
+        h = 3 * h + 1;
+    }
+
+    while h > 0 {
+        for i in h..len {
+            let v = a[i];
+            let mut j = i;
+            while j >= h && v < a[j - h] {
+                a[j] = a[j - h];
+                j -= h;
+            }
+            a[j] = v;
+        }
+        h /= 3;
+    }
+}


Java uses std Arrays.sort here. We may use [T]::sort_stable (or unstable?) as well. But this is how C++ impl does.

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2026-01-31T01:44:58Z

I'm going to megre this patch recently and continue on the serde(compression) part.

But this patch is ported manually so I'd like more eyes on concrete code, to avoid mistakes like #63

Also, it takes about 3 seconds to accumulate 100M distinct values in my local dev with release profile. Many of the time are spent on hashing. I hope we can make some baseline and improve the performance a bit.

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2026-02-01T03:12:11Z

I'm going to merge this now to keep one commit maintainable.

Review after merge is welcome and desired :D

feat: impl CpcSketch

848131a

Signed-off-by: tison <wander4096@gmail.com>

tisonkun marked this pull request as draft January 20, 2026 15:14

pair table

d42cb8f

Signed-off-by: tison <wander4096@gmail.com>

tisonkun force-pushed the cpcsketch branch from 25a83ff to d42cb8f Compare January 20, 2026 16:02

tisonkun commented Jan 20, 2026

View reviewed changes

tisonkun added 11 commits January 29, 2026 13:02

Merge branch 'main' into cpcsketch

94f76f7

icon_estimator

8b9d82f

Signed-off-by: tison <wander4096@gmail.com>

lookup tables

6219eed

Signed-off-by: tison <wander4096@gmail.com>

cpc_confidence

2ae4095

Signed-off-by: tison <wander4096@gmail.com>

cpc sketch impl

d201bfb

Signed-off-by: tison <wander4096@gmail.com>

max_serialized_bytes

315f4e8

Signed-off-by: tison <wander4096@gmail.com>

update structure

0234127

Signed-off-by: tison <wander4096@gmail.com>

impl update

3e09c9c

Signed-off-by: tison <wander4096@gmail.com>

promote_sparse_to_windowed

f914fb8

Signed-off-by: tison <wander4096@gmail.com>

move_window

822cc6b

Signed-off-by: tison <wander4096@gmail.com>

add tests

73b514c

Signed-off-by: tison <wander4096@gmail.com>

tisonkun marked this pull request as ready for review January 29, 2026 15:56

tisonkun requested review from AlexanderSaydakov, leerho and notfilippo January 29, 2026 15:56

tisonkun commented Jan 29, 2026

View reviewed changes

tisonkun added 6 commits January 30, 2026 00:41

refresh_kxp

08b7a61

Signed-off-by: tison <wander4096@gmail.com>

update comments

5b34a74

Signed-off-by: tison <wander4096@gmail.com>

add validate method

a3ccddc

Signed-off-by: tison <wander4096@gmail.com>

impl Union

cfcd0b7

Signed-off-by: tison <wander4096@gmail.com>

add tests

acbcc1c

Signed-off-by: tison <wander4096@gmail.com>

less cast

770ed74

Signed-off-by: tison <wander4096@gmail.com>

tisonkun commented Jan 30, 2026

View reviewed changes

more tests

21ea3f5

Signed-off-by: tison <wander4096@gmail.com>

tisonkun added 2 commits February 1, 2026 11:01

add more docs

aa0fe4a

Signed-off-by: tison <wander4096@gmail.com>

tidy

99e71ff

Signed-off-by: tison <wander4096@gmail.com>

tisonkun force-pushed the cpcsketch branch from 9f3f9e8 to 99e71ff Compare February 1, 2026 03:11

tisonkun enabled auto-merge (squash) February 1, 2026 03:12

tisonkun merged commit dd12abf into apache:main Feb 1, 2026
9 checks passed

tisonkun deleted the cpcsketch branch February 1, 2026 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: impl CpcSketch#75

feat: impl CpcSketch#75
tisonkun merged 22 commits intoapache:mainfrom
tisonkun:cpcsketch

tisonkun commented Jan 20, 2026 •

edited

Loading

Uh oh!

tisonkun Jan 20, 2026

Uh oh!

tisonkun commented Jan 29, 2026 •

edited

Loading

Uh oh!

tisonkun Jan 29, 2026

Uh oh!

tisonkun Jan 30, 2026

Uh oh!

tisonkun Jan 30, 2026

Uh oh!

tisonkun commented Jan 31, 2026

Uh oh!

tisonkun commented Feb 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tisonkun commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisonkun Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisonkun Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

tisonkun Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

tisonkun Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Jan 31, 2026

Uh oh!

tisonkun commented Feb 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tisonkun commented Jan 20, 2026 •

edited

Loading

tisonkun commented Jan 29, 2026 •

edited

Loading