Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.18: implements weighted shuffle using N-ary tree (backport of #259) #429

Merged
merged 1 commit into from
Mar 27, 2024

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Mar 26, 2024

This is port of firedancer's implementation of weighted shuffle:
https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c

Problem

#185 implemented weighted shuffle using binary tree. Though asymptotically a
binary tree has better performance, compared to a Fenwick tree, it is
less cache local resulting in smaller improvements and in particular
slower WeightedShuffle::new.

Summary of Changes

In order to improve cache locality and reduce the overheads of
traversing the tree, this commit instead uses a generalized N-ary tree
with fanout of 16, showing significant improvements in both
WeightedShuffle::new and WeightedShuffle::shuffle.

With 4000 weights:

N-ary tree (fanout 16):

test bench_weighted_shuffle_new     ... bench:      36,244 ns/iter (+/- 243)
test bench_weighted_shuffle_shuffle ... bench:     149,082 ns/iter (+/- 1,474)

Binary tree:

test bench_weighted_shuffle_new     ... bench:      58,514 ns/iter (+/- 229)
test bench_weighted_shuffle_shuffle ... bench:     269,961 ns/iter (+/- 16,446)

Fenwick tree:

test bench_weighted_shuffle_new     ... bench:      39,413 ns/iter (+/- 179)
test bench_weighted_shuffle_shuffle ... bench:     364,771 ns/iter (+/- 2,078)

The improvements become even more significant as there are more items to
shuffle. With 20_000 weights:

N-ary tree (fanout 16):

test bench_weighted_shuffle_new     ... bench:     200,659 ns/iter (+/- 4,395)
test bench_weighted_shuffle_shuffle ... bench:     941,928 ns/iter (+/- 26,492)

Binary tree:

test bench_weighted_shuffle_new     ... bench:     881,114 ns/iter (+/- 12,343)
test bench_weighted_shuffle_shuffle ... bench:   1,822,257 ns/iter (+/- 12,772)

Fenwick tree:

test bench_weighted_shuffle_new     ... bench:     276,936 ns/iter (+/- 14,692)
test bench_weighted_shuffle_shuffle ... bench:   2,644,713 ns/iter (+/- 49,252)

This is an automatic backport of pull request #259 done by [Mergify](https://mergify.com).

This is port of firedancer's implementation of weighted shuffle:
https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c

#185
implemented weighted shuffle using binary tree. Though asymptotically a
binary tree has better performance, compared to a Fenwick tree, it has
less cache locality resulting in smaller improvements and in particular
slower WeightedShuffle::new.

In order to improve cache locality and reduce the overheads of
traversing the tree, this commit instead uses a generalized N-ary tree
with fanout of 16, showing significant improvements in both
WeightedShuffle::new and WeightedShuffle::shuffle.

With 4000 weights:

N-ary tree (fanout 16):

    test bench_weighted_shuffle_new     ... bench:      36,244 ns/iter (+/- 243)
    test bench_weighted_shuffle_shuffle ... bench:     149,082 ns/iter (+/- 1,474)

Binary tree:

    test bench_weighted_shuffle_new     ... bench:      58,514 ns/iter (+/- 229)
    test bench_weighted_shuffle_shuffle ... bench:     269,961 ns/iter (+/- 16,446)

Fenwick tree:

    test bench_weighted_shuffle_new     ... bench:      39,413 ns/iter (+/- 179)
    test bench_weighted_shuffle_shuffle ... bench:     364,771 ns/iter (+/- 2,078)

The improvements become even more significant as there are more items to
shuffle. With 20_000 weights:

N-ary tree (fanout 16):

    test bench_weighted_shuffle_new     ... bench:     200,659 ns/iter (+/- 4,395)
    test bench_weighted_shuffle_shuffle ... bench:     941,928 ns/iter (+/- 26,492)

Binary tree:

    test bench_weighted_shuffle_new     ... bench:     881,114 ns/iter (+/- 12,343)
    test bench_weighted_shuffle_shuffle ... bench:   1,822,257 ns/iter (+/- 12,772)

Fenwick tree:

    test bench_weighted_shuffle_new     ... bench:     276,936 ns/iter (+/- 14,692)
    test bench_weighted_shuffle_shuffle ... bench:   2,644,713 ns/iter (+/- 49,252)

(cherry picked from commit 30eecd6)
@behzadnouri behzadnouri added the automerge automerge Merge this Pull Request automatically once CI passes label Mar 26, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.6%. Comparing base (e356a44) to head (05c205f).

Additional details and impacted files
@@            Coverage Diff            @@
##            v1.18     #429     +/-   ##
=========================================
- Coverage    81.6%    81.6%   -0.1%     
=========================================
  Files         827      827             
  Lines      224756   224753      -3     
=========================================
- Hits       183419   183404     -15     
- Misses      41337    41349     +12     

@mergify mergify bot merged commit 48a850d into v1.18 Mar 27, 2024
36 checks passed
@mergify mergify bot deleted the mergify/bp/v1.18/pr-259 branch March 27, 2024 05:36
anwayde pushed a commit to firedancer-io/agave that referenced this pull request Jul 23, 2024
…-xyz#259) (anza-xyz#429)

implements weighted shuffle using N-ary tree (anza-xyz#259)

This is port of firedancer's implementation of weighted shuffle:
https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c

anza-xyz#185
implemented weighted shuffle using binary tree. Though asymptotically a
binary tree has better performance, compared to a Fenwick tree, it has
less cache locality resulting in smaller improvements and in particular
slower WeightedShuffle::new.

In order to improve cache locality and reduce the overheads of
traversing the tree, this commit instead uses a generalized N-ary tree
with fanout of 16, showing significant improvements in both
WeightedShuffle::new and WeightedShuffle::shuffle.

With 4000 weights:

N-ary tree (fanout 16):

    test bench_weighted_shuffle_new     ... bench:      36,244 ns/iter (+/- 243)
    test bench_weighted_shuffle_shuffle ... bench:     149,082 ns/iter (+/- 1,474)

Binary tree:

    test bench_weighted_shuffle_new     ... bench:      58,514 ns/iter (+/- 229)
    test bench_weighted_shuffle_shuffle ... bench:     269,961 ns/iter (+/- 16,446)

Fenwick tree:

    test bench_weighted_shuffle_new     ... bench:      39,413 ns/iter (+/- 179)
    test bench_weighted_shuffle_shuffle ... bench:     364,771 ns/iter (+/- 2,078)

The improvements become even more significant as there are more items to
shuffle. With 20_000 weights:

N-ary tree (fanout 16):

    test bench_weighted_shuffle_new     ... bench:     200,659 ns/iter (+/- 4,395)
    test bench_weighted_shuffle_shuffle ... bench:     941,928 ns/iter (+/- 26,492)

Binary tree:

    test bench_weighted_shuffle_new     ... bench:     881,114 ns/iter (+/- 12,343)
    test bench_weighted_shuffle_shuffle ... bench:   1,822,257 ns/iter (+/- 12,772)

Fenwick tree:

    test bench_weighted_shuffle_new     ... bench:     276,936 ns/iter (+/- 14,692)
    test bench_weighted_shuffle_shuffle ... bench:   2,644,713 ns/iter (+/- 49,252)

(cherry picked from commit 30eecd6)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge automerge Merge this Pull Request automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants