-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BracketedSort
a new, faster algorithm for partialsort
and friends
#52006
Merged
Merged
Changes from 37 commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
9f4ad8b
add sample implementation
e47321f
add fallback and remove instrumentation
c61e6c3
add a faster, non-allocating version
9093cb9
small tweaks
ca7bd59
add tests and support target ranges
a170fc7
Add tuning
8c6eff6
implement threshold
4a6fce7
Merge branch 'master' into lh/fast-partialsort
1f1fc3b
add slow version to Julia
4aad01f
fix some bugs and fiddle with optimization passess (specifically disa…
8e66d4b
a bit more fiddling. The remaining perforamnce gap is due to NaN safety
5d07ffe
revert whitespace change
0f81beb
update comments and increase tries from 4 to 5
e1df36e
remove 'deleteme' development file
0ebef7e
Merge branch 'master' into lh/fast-partialsort
8003a0c
update docstring
8e933c3
support non-unit-range targets
76d2833
bugfix TODO: add tests that catch this
847172e
another bugfix (this one caught by CI)
5a85c03
update invalid lt tests
86fc129
add todo
a3a6c47
Tweak dispatch to avoid >100% regressions on 39 element arrays & opti…
bda1b6d
more performance characteristic tweaks (and a dynamic dispatch perfor…
b2e4529
use standard optimizations for recursive calls
fd8d967
cleanup, add comments, and admit weakness against inputs with duplica…
8361184
make lots of duplicates non-pathological (still not great, but not te…
5d52194
fix some bugs (wow, we need better test coverage!) and add a dispatch…
6f8048f
change offset from .5 to .7 (helps a huge amount for small to medium …
d0a38a2
noting that running a hundred benchmarks doesn't fail a single trial,…
52a6785
implement NFC todo that requires rebuilding Julia
1d90487
fix typo
83c9e27
check and document the invariant that makes the `@inbounds`s safe
0b2b399
fix some unimportant off by one errors that have been bugging me
422a14b
round less coarsely
ad82125
micro-refactor to use more code sharing
ccb5c99
Avoid overflow and nfc refactor add comments, and variable rename for…
650c6a2
randomize initial hash seed; use consistent recursive algorithms; add…
5c18e25
REVERT ME: revert the re-introduction of PartialQuickSort
5657e5f
implement Oscar's suggestion to speed up heuristic computation
069c453
accept that invalid lt continues to work
eb86ec5
Merge branch 'master' into lh/fast-partialsort
LilithHafner File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -91,7 +91,15 @@ issorted(itr; | |
issorted(itr, ord(lt,by,rev,order)) | ||
|
||
function partialsort!(v::AbstractVector, k::Union{Integer,OrdinalRange}, o::Ordering) | ||
_sort!(v, InitialOptimizations(ScratchQuickSort(k)), o, (;)) | ||
# TODO move k from `alg` to `kw` | ||
# Don't perform InitialOptimizations before Bracketing. The optimizations take O(n) | ||
# time and so does the whole sort. But do perform them before recursive calls because | ||
# that can cause significant speedups when the target range is large so the runtime is | ||
# dominated by k log k and the optimizations runs in O(k) time. | ||
_sort!(v, BoolOptimization( | ||
Small{12}( # Very small inputs should go straight to insertion sort | ||
BracketedSort(k))), | ||
o, (;)) | ||
maybeview(v, k) | ||
end | ||
|
||
|
@@ -1111,6 +1119,197 @@ function _sort!(v::AbstractVector, a::ScratchQuickSort, o::Ordering, kw; | |
end | ||
|
||
|
||
""" | ||
BracketedSort(target[, next::Algorithm]) <: Algorithm | ||
|
||
Perform a partialsort for the elements that fall into the indices specified by the `target` | ||
using BracketedSort with the `next` algorithm for subproblems. | ||
|
||
BracketedSort takes a random* sample of the input, estimates the quantiles of the input | ||
LilithHafner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
using the quantiles of the sample to find signposts that almost certainly bracket the target | ||
values, filters the value in the input that fall between the signpost values to the front of | ||
the input, and then, if that "almost certainly" turned out to be true, finds the target | ||
within the small chunk that are, by value, between the signposts and now by position, at the | ||
front of the vector. On small inputs or when target is close to the size of the input, | ||
BracketedSort falls back to the `next` algorithm directly. Otherwise, BracketedSort uses the | ||
`next` algorithm only to compute quantiles of the sample and to find the target within the | ||
small chunk. | ||
|
||
## Performance | ||
|
||
If the `next` algorithm has `O(n * log(n))` runtime and the input is not pathological then | ||
the runtime of this algorithm is `O(n + k * log(k))` where `n` is the length of the input | ||
and `k` is `length(target)`. On pathological inputs the asymptotic runtime is the same as | ||
the runtime of the `next` algorithm. | ||
|
||
BracketedSort itself does not allocate. If `next` is in-place then BracketedSort is also | ||
in-place. If `next` is not in place, and it's space usage increases monotonically with input | ||
length then BracketedSort's maximum space usage will never be more than the space usage | ||
of `next` on the input BracketedSort receives. For large nonpathological inputs and targets | ||
substantially smaller than the size of the input, BracketedSort's maximum memory usage will | ||
be much less than `next`'s. If the maximum additional space usage of `next` scales linearly | ||
then for small k the average* maximum additional space usage of BracketedSort will be | ||
`O(n^(2.3/3))`. | ||
|
||
By default, BracketedSort uses the in place `PartialQuickSort` algorithm recursively for | ||
integer `target`s and the faster but not in place `ScratchQuickSort` for unit range | ||
`target`s. This is because the runtime of recursive calls is negligible for large inputs | ||
unless `k` is similar in size to `n`. | ||
|
||
*Sorting is unable to depend on Random.jl because Random.jl depends on sorting. | ||
Consequently, we use `hash` as a source of randomness. The average runtime guarantees | ||
assume that `hash(x::Int)` produces a random result. However, as this randomization is | ||
deterministic, if you try hard enough you can find inputs that consistently reach the | ||
worst case bounds. Actually constructing such inputs is an exercise left to the reader. | ||
Have fun :). | ||
|
||
Characteristics: | ||
* *unstable*: does not preserve the ordering of elements that compare equal | ||
(e.g. "a" and "A" in a sort of letters that ignores case). | ||
* *in-place* in memory if the `next` algorithm is in-place. | ||
* *estimate-and-filter*: strategy | ||
* *linear runtime* if `length(target)` is constant and `next` is reasonable | ||
* *n + k log k* worst case runtime if `next` has that runtime. | ||
* *pathological inputs* can significantly increase constant factors. | ||
""" | ||
struct BracketedSort{T, F} <: Algorithm | ||
target::T | ||
get_next::F | ||
end | ||
|
||
# TODO: this composition between BracketedSort and ScratchQuickSort does not bring me joy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can be avoided via moving |
||
BracketedSort(k::Integer) = BracketedSort(k, k -> InitialOptimizations(PartialQuickSort(k))) | ||
BracketedSort(k::OrdinalRange) = BracketedSort(k, k -> InitialOptimizations(ScratchQuickSort(k))) | ||
|
||
function bracket_kernel!(v::AbstractVector, lo, hi, lo_signpost, hi_signpost, o) | ||
i = 0 | ||
count_below = 0 | ||
checkbounds(v, lo:hi) | ||
for j in lo:hi | ||
x = @inbounds v[j] | ||
a = lo_signpost !== nothing && lt(o, x, lo_signpost) | ||
b = hi_signpost === nothing || !lt(o, hi_signpost, x) | ||
count_below += a | ||
# if a != b # This branch is almost never taken, so making it branchless is bad. | ||
# @inbounds v[i], v[j] = v[j], v[i] | ||
# i += 1 | ||
# end | ||
c = a != b # JK, this is faster. | ||
k = i * c + j | ||
# Invariant: @assert firstindex(v) ≤ lo ≤ i + j ≤ k ≤ j ≤ hi ≤ lastindex(v) | ||
@inbounds v[j], v[k] = v[k], v[j] | ||
i += c - 1 | ||
end | ||
count_below, i+hi | ||
end | ||
|
||
function move!(v, target, source) | ||
# This function never dominates runtime—only add `@inbounds` if you can demonstrate a | ||
# performance improvement. And if you do, also double check behavior when `target` | ||
# is out of bounds. | ||
@assert length(target) == length(source) | ||
if length(target) == 1 || isdisjoint(target, source) | ||
for (i, j) in zip(target, source) | ||
v[i], v[j] = v[j], v[i] | ||
end | ||
else | ||
@assert minimum(source) <= minimum(target) | ||
reverse!(v, minimum(source), maximum(target)) | ||
reverse!(v, minimum(target), maximum(target)) | ||
end | ||
end | ||
|
||
function _sort!(v::AbstractVector, a::BracketedSort, o::Ordering, kw) | ||
@getkw lo hi scratch | ||
# TODO for further optimization: reuse scratch between trials better, from signpost | ||
# selection to recursive calls, and from the fallback (but be aware of type stability, | ||
# especially when sorting IEEE floats. | ||
|
||
# We don't need to bounds check target because that is done higher up in the stack | ||
# However, we cannot assume the target is inbounds. | ||
lo < hi || return scratch | ||
ln = hi - lo + 1 | ||
|
||
# This is simply a precomputed short-circuit to avoid doing scalar math for small inputs. | ||
# It does not change dispatch at all. | ||
ln < 260 && return _sort!(v, a.get_next(a.target), o, kw) | ||
|
||
target = a.target | ||
k2 = round(Int, ln^(2/3)) | ||
k2ln = k2/ln | ||
offset = .7k2^0.575 # TODO for further optimization: tune this | ||
LilithHafner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
lo_signpost_i, hi_signpost_i = | ||
(floor(Int, (tar - lo) * k2ln + lo + off) for (tar, off) in | ||
((minimum(target), -offset), (maximum(target), offset))) | ||
lastindex_sample = lo+k2-1 | ||
expected_middle_ln = (min(lastindex_sample, hi_signpost_i) - max(lo, lo_signpost_i) + 1) / k2ln | ||
# This heuristic is complicated because it fairly accurately reflects the runtime of | ||
# this algorithm which is necessary to get good dispatch when both the target is large | ||
# and the input are large. | ||
# expected_middle_ln is a float and k2 is significantly below typemax(Int), so this will | ||
# not overflow: | ||
# TODO move target from alg to kw to avoid this ickyness: | ||
ln <= 130 + 2k2 + 2expected_middle_ln && return _sort!(v, a.get_next(a.target), o, kw) | ||
|
||
# We store the random sample in | ||
# sample = view(v, lo:lo+k2) | ||
# but views are not quite as fast as using the input array directly, | ||
# so we don't actually construct this view at runtime. | ||
|
||
# TODO for further optimization: handle lots of duplicates better. | ||
# Right now lots of duplicates rounds up when it could use some super fast optimizations | ||
# in some cases. | ||
# e.g. | ||
# | ||
# Target: |----| | ||
# Sorted input: 000000000000000000011111112222223333333333 | ||
# | ||
# Will filter all zeros and ones to the front when it could just take the first few | ||
# it encounters. This optimization would be especially potent when `allequal(ans)` and | ||
# equal elements are egal. | ||
|
||
# 3 random trials should typically give us 0.99999 reliability; we can assume | ||
# the input is pathological and abort to fallback if we fail three trials. | ||
seed = hash(ln, Int === Int64 ? 0x85eb830e0216012d : 0xae6c4e15) | ||
for attempt in 1:3 | ||
seed = hash(attempt, seed) | ||
for i in lo:lo+k2-1 | ||
j = mod(hash(i, seed), i:hi) # TODO for further optimization: be sneaky and remove this division | ||
v[i], v[j] = v[j], v[i] | ||
end | ||
count_below, lastindex_middle = if lo_signpost_i <= lo && lastindex_sample <= hi_signpost_i | ||
# The heuristics higher up in this function that dispatch to the `next` | ||
# algorithm should prevent this from happening. | ||
# Specifically, this means that expected_middle_ln == ln, so | ||
# ln <= ... + 2.0expected_middle_ln && return ... | ||
# will trigger. | ||
@assert false | ||
# But if it does happen, the kernel reduces to | ||
0, hi | ||
elseif lo_signpost_i <= lo | ||
_sort!(v, a.get_next(hi_signpost_i), o, (;kw..., hi=lastindex_sample)) | ||
bracket_kernel!(v, lo, hi, nothing, v[hi_signpost_i], o) | ||
elseif lastindex_sample <= hi_signpost_i | ||
_sort!(v, a.get_next(lo_signpost_i), o, (;kw..., hi=lastindex_sample)) | ||
bracket_kernel!(v, lo, hi, v[lo_signpost_i], nothing, o) | ||
else | ||
# TODO for further optimization: don't sort the middle elements | ||
_sort!(v, a.get_next(lo_signpost_i:hi_signpost_i), o, (;kw..., hi=lastindex_sample)) | ||
bracket_kernel!(v, lo, hi, v[lo_signpost_i], v[hi_signpost_i], o) | ||
end | ||
target_in_middle = target .- count_below | ||
if lo <= minimum(target_in_middle) && maximum(target_in_middle) <= lastindex_middle | ||
scratch = _sort!(v, a.get_next(target_in_middle), o, (;kw..., hi=lastindex_middle)) | ||
move!(v, target, target_in_middle) | ||
return scratch | ||
end | ||
# This line almost never runs. | ||
end | ||
# This line only runs on pathological inputs. Make sure it's covered by tests :) | ||
_sort!(v, a.get_next(target), o, kw) | ||
end | ||
|
||
|
||
""" | ||
StableCheckSorted(next) <: Algorithm | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a TODO that predates this PR, I'm just adding a note because this PR touches target range handling and reminded me that this is not the best approach.