DartMinHash & Efficient Rejection Sampling: Fast Sketching for Weighted Sets

This crate provides the implementation of DartMinHash (1) and Efficient Rejection Sampling (3) algorithm for estimation of weighted Jaccard similarity. To reproduce the algorithm in the paper, we use the same tabulation hashing idea (4). Mersenne Twister PRNG was used as seed. Other high quality 64-bit hash functions such as xxhash-rust or whyash-rs should also work as well.

Note: DartMinHash is significantly faster than (Efficient) Rejection Sampling (2,3) for very sparse vectors, that is the number of nonezero elements (d) is less than ~5% of vector dimension (D) on average for all vectors. This is especially true for large-scale datasets. However, For RS and ERS, a maxmimum value of weight for input vector must be known (dimension-wise). Otherwise, the estimation is significantly biased (6). Therefore, general applicability is limited by the required priori knowledge of sharp upper bounds for $w_{max}(d)$. Also, ERS is not unbiased (3).

Install & test

Add below lines to your Cargo.toml dependencies. Official release in crates.io is here.

dartminhash = "0.1"

Test case to evaulate the accuracy of the DartMinHash algorithm.

cargo test --release dartminhash_approximates_weighted_jaccard -- --nocapture

Test for true weighted Jaccard from 0.005 to 0.98. Output for DartMinHash:

true weighted Jaccard: 0.9801980198019941
estimated weighted Jaccard: 0.98046875
true weighted Jaccard: 0.9230769230769282
estimated weighted Jaccard: 0.921142578125
true weighted Jaccard: 0.8691588785046781
estimated weighted Jaccard: 0.872802734375
true weighted Jaccard: 0.8181818181818195
estimated weighted Jaccard: 0.8193359375
true weighted Jaccard: 0.7391304347825998
estimated weighted Jaccard: 0.73876953125
true weighted Jaccard: 0.6666666666666655
estimated weighted Jaccard: 0.67138671875
true weighted Jaccard: 0.5999999999999979
estimated weighted Jaccard: 0.6025390625
true weighted Jaccard: 0.5384615384615447
estimated weighted Jaccard: 0.53662109375
true weighted Jaccard: 0.48148148148148356
estimated weighted Jaccard: 0.484619140625
true weighted Jaccard: 0.42857142857142844
estimated weighted Jaccard: 0.4267578125
true weighted Jaccard: 0.37931034482758635
estimated weighted Jaccard: 0.37060546875
true weighted Jaccard: 0.3333333333333311
estimated weighted Jaccard: 0.336181640625
true weighted Jaccard: 0.25000000000000155
estimated weighted Jaccard: 0.245361328125
true weighted Jaccard: 0.17647058823529352
estimated weighted Jaccard: 0.17626953125
true weighted Jaccard: 0.11111111111111141
estimated weighted Jaccard: 0.109130859375
true weighted Jaccard: 0.05263157894736854
estimated weighted Jaccard: 0.05224609375
true weighted Jaccard: 0.02564102564102588
estimated weighted Jaccard: 0.026123046875
true weighted Jaccard: 0.005025125628140735
estimated weighted Jaccard: 0.004638671875

Test case to evaulate the accuracy of RS and ERS algorithms.

cargo test --release ers_approximates_weighted_jaccard -- --nocapture

Test for true weighted Jaccard from 0.005 to 0.98. Output for ERS:

true weighted Jaccard: 0.980198019801974
estimated weighted Jaccard: 0.981689453125
true weighted Jaccard: 0.9230769230769282
estimated weighted Jaccard: 0.930419921875
true weighted Jaccard: 0.869158878504667
estimated weighted Jaccard: 0.873779296875
true weighted Jaccard: 0.818181818181824
estimated weighted Jaccard: 0.813232421875
true weighted Jaccard: 0.7391304347826144
estimated weighted Jaccard: 0.743408203125
true weighted Jaccard: 0.6666666666666666
estimated weighted Jaccard: 0.6650390625
true weighted Jaccard: 0.6000000000000038
estimated weighted Jaccard: 0.593994140625
true weighted Jaccard: 0.5384615384615451
estimated weighted Jaccard: 0.52978515625
true weighted Jaccard: 0.4814814814814917
estimated weighted Jaccard: 0.477783203125
true weighted Jaccard: 0.42857142857142766
estimated weighted Jaccard: 0.4443359375
true weighted Jaccard: 0.3793103448275924
estimated weighted Jaccard: 0.36767578125
true weighted Jaccard: 0.33333333333333287
estimated weighted Jaccard: 0.32275390625
true weighted Jaccard: 0.24999999999999994
estimated weighted Jaccard: 0.25390625
true weighted Jaccard: 0.17647058823529382
estimated weighted Jaccard: 0.1767578125
true weighted Jaccard: 0.11111111111111145
estimated weighted Jaccard: 0.112060546875
true weighted Jaccard: 0.052631578947368515
estimated weighted Jaccard: 0.05126953125
true weighted Jaccard: 0.025641025641025373
estimated weighted Jaccard: 0.030029296875
true weighted Jaccard: 0.00502512562814075
estimated weighted Jaccard: 0.005859375

Usage

DartMinhash:

use dartminhash::dartminhash::DartMinHash;
use dartminhash::rng_utils::mt_from_seed;
use dartminhash::similarity::jaccard_estimate_from_minhashes;

fn main() {
    let mut rng = mt_from_seed(42);
    let k = 128;

    let dartminhash = DartMinHash::new_mt(&mut rng, k);

    // Weighted inputs: overlap in IDs, but weights differ a bit
    let sample_a = vec![
        (5, 1.2),
        (17, 0.9),
        (23, 1.1),
        (42, 0.95),
        (100, 1.0),
    ];
    let sample_b = vec![
        (5, 1.0),
        (17, 1.0),
        (44, 1.1),
        (100, 1.05),
    ];

    let sketch_a = dartminhash.sketch(&sample_a);
    let sketch_b = dartminhash.sketch(&sample_b);

    let est_jaccard = jaccard_estimate_from_minhashes(&sketch_a, &sketch_b);

    println!("Estimated weighted Jaccard similarity: {:.4}", est_jaccard);
}

Efficient Rejection Sampling:

use dartminhash::{ErsWmh};
use dartminhash::rng_utils::mt_from_seed;

// Tight, real-valued caps: m_i = max_s w_i(s) (no ceil, no max(1))
fn caps_from_sets(d: usize, sets: &[&[(u64, f64)]]) -> Vec<f64> {
    let mut m = vec![0.0f64; d];
    for s in sets {
        for &(id, w) in *s {
            if w > 0.0 {
                let idx = id as usize;
                if w > m[idx] { m[idx] = w; }
            }
        }
    }
    m
}

fn main() {
    let mut rng = mt_from_seed(1337);

    let d: usize = 200_000;
    let k: u64   = 1024;
    let L: u64   = 512;  // with tight caps you can usually keep this modest

    // Two weighted vectors
    let a = vec![(5, 1.2), (17, 0.9), (23, 1.1), (42, 0.95), (100, 1.0)];
    let b = vec![(5, 1.0), (17, 1.0), (44, 1.1), (100, 1.05)];

    // Caps must dominate both vectors; length d should cover the largest id+1
    let m_per_dim: Vec<f64> = caps_from_sets(d, &[&a, &b]);

    // New constructor takes &[f64] caps
    let ers = ErsWmh::new_mt(&mut rng, &m_per_dim, k);

    // ERS returns k (id, rank) pairs; collisions on id estimate Jaccard
    let sk_a = ers.sketch(&a, Some(L));
    let sk_b = ers.sketch(&b, Some(L));

    let hits = sk_a.iter().zip(&sk_b).filter(|(x, y)| x.0 == y.0).count();
    let j_est = hits as f64 / k as f64;

    println!("ERS (L={}) estimated weighted Jaccard: {:.4}", L, j_est);
}

Choosing L for Efficent Rejection Sampling (ERS)

The best L for achiving a given accuracy is related to the sparsity of the data (see ERS paper here). The author recommended an equation for L: $l=\frac{\alpha}{s}$, where s is the sparsity of the data (d/D, see above) while $\alpha$ is a constant, normally 0.5 to 5. In real-world datasets, $\alpha$ = 5 is better.

References

1.Christiani, T., 2020. Dartminhash: Fast sketching for weighted sets. arXiv preprint arXiv:2005.11547.

2.Shrivastava, A., 2016. Simple and efficient weighted minwise hashing. Advances in Neural Information Processing Systems, 29.

3.Li, X. and Li, P., 2021, May. Rejection sampling for weighted jaccard similarity revisited. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 5, pp. 4197-4205).

4.Pǎtraşcu, M. and Thorup, M., 2012. The power of simple tabulation hashing. Journal of the ACM (JACM), 59(3), pp.1-50.

5.Ertl, O. (2025) “TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation”. Zenodo. doi: 10.5281/zenodo.16730965.

6.Ertl, O., 2018, July. Bagminhash-minwise hashing algorithm for weighted sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1368-1377).

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DartMinHash_logo.png		DartMinHash_logo.png
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DartMinHash & Efficient Rejection Sampling: Fast Sketching for Weighted Sets

Install & test

Usage

Choosing L for Efficent Rejection Sampling (ERS)

References

About

Uh oh!

Releases 1

Packages

Languages

License

jianshu93/dartminhash-rs

Folders and files

Latest commit

History

Repository files navigation

DartMinHash & Efficient Rejection Sampling: Fast Sketching for Weighted Sets

Install & test

Usage

Choosing L for Efficent Rejection Sampling (ERS)

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages