Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move sorting algorithms from base to stdlib (First try) #45584

Closed
wants to merge 1 commit into from

Conversation

LilithHafner
Copy link
Member

@LilithHafner LilithHafner commented Jun 4, 2022

I believe this makes sense for two reasons:

  1. It allows advanced sorting algorithms to rely on stdlibs (e.g. Random)
  2. Sorting that occurs without stdlibs loaded (i.e. Core.Compiler.Sort.sort!) is heavily skewed toward sorting very short vectors. InsertionSort is well suited to those sorting tasks.

I did some profiling by inserting

const SORTS = []
function sort!(v::AbstractVector{T}, alg::Algorithm,
        order::Ordering, t::Union{AbstractVector{T}, Nothing}=nothing) where T
    push!(SORTS, (length(v), T))
    sort!(v, firstindex(v), lastindex(v), alg, order, t)
end

into sort.jl before building from source and then doing a bit of plotting.

The results I got indicate that the vast majority of sorting happens for vectors of length less than 30, well before any of our existing advanced sorting algorithm start to outperform InsertionSort.

Here are some figures I and a script that helped me reach that conclusion:

In the top row are some data on runtie if InsertionSort and standard sorting. InsertionSort can be a tiny bit faster for short inputs, and a decent bit slower for medium inputs.

In the bottom row are some data on which lengths of vectors are sorted, the conclusion is that the vectors are almost all very short. Even when weighted by runtime (which all the figures are) the vast majority of runtime is spent sorting very small vectors. This makes me think it is okay to not have advanced sorting in base.

Screen Shot 2022-06-04 at 7 10 39 PM

Script
using GLMakie
using Random
using Statistics

function runtime(n; alg)
    x = [rand(Int64, n) for _ in 1:82]
    y = [[(rand(Int64), rand(Int64), rand(Bool)) for _ in 1:n] for _ in 1:277]
    z = [[(rand(Int64), rand(Int64)) for _ in 1:n] for _ in 1:127]
    (@elapsed begin
        for i in x
            sort!(i; alg)
        end
        for i in y
            sort!(i; alg)
        end
        for i in z
            sort!(i; alg)
        end
    end) ./ 486
end

x = shuffle(1:600)
@time y1 = runtime.(x; alg=InsertionSort)
@time y2 = runtime.(x; alg=Base.Sort.DEFAULT_UNSTABLE)

outlier_threshold = mean(vcat(y1, y2)) + 3std(vcat(y1, y2))
inliers = (y1 .< outlier_threshold) .& (y2 .< outlier_threshold)
x = x[inliers]
y1 = y1[inliers]
y2 = y2[inliers]

insertion_estimated_runtime(n) = 1.8e-10(n^2+60n)
default_estimated_runtime(n) = 5.3e-9(n*log(n)+1.5)

fig = Figure()

Label(fig[0,1:4], "estimates: red, insertion: blue, default: orange")

ax = Axis(fig[1,1], xlabel="length", ylabel="runtime")
plot!(x, y1; color=:blue)
plot!(x, y2; color=:orange)
lines!(sort(x), insertion_estimated_runtime.(sort(x)), color=:red, linewidth=3)
lines!(sort(x), default_estimated_runtime.(sort(x)), color=:red, linewidth=3)

ax = Axis(fig[1,2], xlabel="length", ylabel="runtime ratio")
plot!(x, y1 ./ insertion_estimated_runtime.(x); color=:blue)
lines!(sort(x), ones(length(x)), color=:red, linewidth=3)
ylims!(.7,4)

ax = Axis(fig[1,3], xlabel="length", ylabel="runtime ratio")
plot!(x, y2 ./ default_estimated_runtime.(x), color=:orange)
lines!(sort(x), ones(length(x)), color=:red, linewidth=3)
ylims!(.7,4)

ax = Axis(fig[1,4], xlabel="length", ylabel="runtime ratio (common estimate)")
plot!(x, y1 ./ insertion_estimated_runtime.(x); color=:blue)
plot!(x, y2 ./ insertion_estimated_runtime.(x), color=:orange)
xlims!(0,100)
ylims!(.7, 4)

ax = Axis(fig[2,1], title="Histogram", xlabel="length", ylabel="fraction of time")
hist!(first.(Core.Compiler.Sort.SORTS), weights=insertion_estimated_runtime.(first.(Core.Compiler.Sort.SORTS)))
ylims!(0,.045)

e = ecdf(first.(Core.Compiler.Sort.SORTS), weights=insertion_estimated_runtime.(first.(Core.Compiler.Sort.SORTS)))
ax = Axis(fig[2,3], title="Cumulative distribution function (CDF)", xlabel="length", ylabel="fraction of time spent sorting shorter inputs")
lines!(sort(x), e.(sort(x)), )
ax = Axis(fig[2,4], title="CDF close up", xlabel="length", ylabel="fraction of time spent sorting shorter inputs")
lines!(sort(x), e.(sort(x)), )
xlims!(0,100)
ylims!(.8,1)

display("Unweighted")
display(sort(collect(countmap(last.(Core.Compiler.Sort.SORTS))); by=last))
display("Weighted")
display(sort(collect(countmap(last.(Core.Compiler.Sort.SORTS), insertion_estimated_runtime.(first.(Core.Compiler.Sort.SORTS)))); by=last))

display(fig)

This is a work in progress (e.g. docs and test are copied directly from Random), I'm not sure exactly how to carry out the migration and want to hear feedback before proceeding any further.

Likely fixes #45522

@LilithHafner LilithHafner added the excision Removal of code from Base or the repository label Jun 4, 2022
@oscardssmith
Copy link
Member

IMO, it would be fine if (like *) sort automatically picks a good algorithm, but choosing a specific algorithm requires using a library.

@LilithHafner LilithHafner added the stdlib Julia's standard library label Jun 4, 2022
@LilithHafner
Copy link
Member Author

LilithHafner commented Jun 4, 2022

This PR puts StandardSortingAlgorithms in the same category as Random. It is automatically loaded for user code. When a user calls sort(rand(1000)) it will still perform a presorted check, a reverse sorted check, and then dispatch to radix sort via fpsort (i.e. picks a good default algorithm).

The point is to allow sorting to depend on stdlibs, the cost is that Core.Compiler.Sort.sort! will have fewer algorithms available (just InsertionSort). I believe that cost may even be e benefit because of how the compiler uses sorting.

@oscardssmith
Copy link
Member

The one thing that I would be worried about here is that even if the compiler typically only has to sort small things, we have worked really hard in the past year to remove accidental quadratic algorithms from the compiler that result in bad scaling for large amounts of generated code, and we don't want to add a new one.

@Seelengrab
Copy link
Contributor

Seelengrab commented Jun 5, 2022

I think this is a good idea! The name is quite long though, how about keeping the Sort name from the original module?

we have worked really hard in the past year to remove accidental quadratic algorithms from the compiler

Would using mergesort instead of insertionsort for longer vectors as a fallback be an alternative? It's slower on average, but doesn't have the worst case quadratic behavior.

@LilithHafner
Copy link
Member Author

@oscardssmith Yep, that makes sense. In my profiling, I never compiled an expression consisting of the sum of 1000 variables, or really anything quite unusual.

@Seelengrab that sounds reasonable. This PR would leave MergeSort in base.

Then, #45222, which stabilizes QuickSort, could move MergeSort out of base (and deprecate it?) and move QuickSort back in. Core.Compiler.Sort.sort!(alg=QuickSort) would have to use non-randomized pivot selection, but I think it is safe to assume non-malicious input to Core.Compiler.Sort.sort! (if you are compiling malicious code you have bigger problems than quadratic runtime!).

I don't want to merge this PR with #45222 because both are already pretty big.

@LilithHafner
Copy link
Member Author

The name is quite long though, how about keeping the Sort name from the original module?

Agreed. I had hoped we could find a shorter name. I had assumed that Sort would result in a naming conflict, but it appears not to. I'll switch it to Sort.

@Seelengrab
Copy link
Contributor

In my profiling, I never compiled an expression consisting of the sum of 1000 variables, or really anything quite unusual.

I think Keno sometimes reaches such quite unusual cases, from function & method generation from AD.

@LilithHafner
Copy link
Member Author

Thanks for the feedback, y'all! I'm going to plan to move forward with this after several other sorting-related PRs are finished to avoid merge conflicts.

Hold pending other PRs (see #44876 (reply in thread) for roadmap)

@LilithHafner LilithHafner marked this pull request as draft June 6, 2022 13:10
@ViralBShah
Copy link
Member

Loving all your work on the sorting stuff!

@ViralBShah ViralBShah added the sorting Put things in order label Jul 19, 2022
@LilithHafner LilithHafner changed the title [RFC] Move sorting algorithms from base to stdlib Move sorting algorithms from base to stdlib (First try) Sep 10, 2022
@LilithHafner
Copy link
Member Author

Superseded by #46679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
excision Removal of code from Base or the repository sorting Put things in order stdlib Julia's standard library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Core.Compiler calls into sort with undefined view
4 participants