Skip to content

Conversation

@matthias314
Copy link
Contributor

@matthias314 matthias314 commented May 20, 2021

This PR speeds up mergewith! in two ways:

  • A function boundary inside mergewith! makes it operate faster if the additional dictionaries use different types. As an added bonus, one now needs to implement dedicated mergewith! methods only for merging exactly two dictionaries.
    The following table shows the speed-up factor for merging an IdDict{Int,Float64} and an IdDict{Int, Int} into an IdDict{Int,Float64}. "Partial overlap" means that the dictionaries have half of the keys in common.
    size: complete/partial/no overlap
      10:  4.23  3.96  3.79
     100:  3.41  2.56  2.37
    1000:  3.45  2.75  2.29
   10000:  2.36  2.07  1.92
  100000:  2.17  1.97  1.42
 1000000:  1.88  1.47  1.34

I'm comparing to Julia 1.6.1. Because of the second change, I'm not using Dict here.

  • To avoid unnecessary hash table lookups, a dedicated method for Dict is added.
    The following table shows the speed-up factor for Dict{Int,Int} compared to the existing generic implementation for AbstractDict. The new method appears to be always faster, for dictionaries with less than 600 elements usually by a factor of at least 2.
    size: complete/partial/no overlap
      10:  1.99  1.97  2.45
     100:  2.14  2.23  2.27
    1000:  2.61  1.50  1.19
   10000:  1.51  1.24  1.29
  100000:  1.42  1.25  1.12
 1000000:  1.42  1.17  1.24

Both changes combined lead to the following speed-up for merging a Dict{Int,Float64} and a Dict{Int,Int} into a Dict{Int,Float64}:

    size: complete/partial/no overlap
      10:  34.88  35.38  37.51
     100:  36.90  33.30  29.31
    1000:  22.89  11.77  12.08
   10000:  15.50  10.46   9.64
  100000:  14.07   7.84   7.11
 1000000:  16.59  10.07   4.87
Click here for the code used to obtain the timings
using BenchmarkTools, Test, Printf
using Base: ht_keyindex2!, _setindex!

function new_mergewith!(combine, d::AbstractDict, others::AbstractDict...)
    foldl(new_mergewith!(combine), others; init = d)
end

function new_mergewith!(combine, d1::AbstractDict, d2::AbstractDict)
    for (k, v) in d2
        d1[k] = haskey(d1, k) ? combine(d1[k], v) : v
    end
    return d1
end

new_mergewith!(combine) = (args...) -> new_mergewith!(combine, args...)

function new_mergewith!(combine, d1::Dict{K, V}, d2::AbstractDict) where {K, V}
    for (k, v) in d2
        i = ht_keyindex2!(d1, k)
        if i > 0
            d1.vals[i] = combine(d1.vals[i], v)
            # We don't use @inbounds in front of the whole formula
            # because we don't know what the combine function does.
            # Using @inbounds only for array access seems to slow
            # things down for simple functions like integer addition.
        else
            if !isequal(k, convert(K, k))
                throw(ArgumentError("$(limitrepr(k)) is not a valid key for type $K"))
            end
            @inbounds _setindex!(d1, convert(V, v), k, -i)
        end
    end
    return d1
end

function testmerge(a::AbstractDict, b::AbstractDict...)
    @test mergewith!(+, copy(a), b...) == new_mergewith!(+, copy(a), b...)
    t_old = @belapsed mergewith!(+, a1, $b...) setup = (a1 = copy($a))
    t_new = @belapsed new_mergewith!(+, a1, $b...) setup = (a1 = copy($a))
    return t_old/t_new
end

function testmerge2(D, nn::Int...)
    println("merging 2 $D's with values Int, Int")
    println("    size: complete/partial/no overlap")
    for n in nn
        n ÷= 2
        a = D(k => k for k in -n:n)
        b = D(k => k for k in -n:n)
        f1 = testmerge(a, b)
        b = D(k => k for k in 0:2*n)
        f2 = testmerge(a, b)
        b = D(k => k for k in n+1:3*n)
        f3 = testmerge(a, b)
        @printf "%8i:  %.2f  %.2f  %.2f\n" 2*n f1 f2 f3
    end
end

function testmerge3(D, nn::Int...)
    println("merging 3 $D's with values Float64, Float64, Int")
    println("    size: complete/partial/no overlap")
    for n in nn
        n ÷= 2
        a = D(k => Float64(k) for k in -n:n)
        b = D(k => Float64(k) for k in -n:n)
        c = D(k => k for k in -n:n)
        f1 = testmerge(a, b, c)
        b = D(k => Float64(k) for k in 0:2*n)
        c = D(k => k for k in -2*n:0)
        f2 = testmerge(a, b, c)
        b = D(k => Float64(k) for k in n+1:3*n)
        c = D(k => k for k in -3*n:-n-1)
        f3 = testmerge(a, b, c)
        @printf "%8i:  %.2f  %.2f  %.2f\n" 2*n f1 f2 f3
    end
end

testmerge2(Dict, [10^k for k in 1:6]...)
testmerge2(IdDict, [10^k for k in 1:6]...)
testmerge3(Dict, [10^k for k in 1:6]...)
testmerge3(IdDict, [10^k for k in 1:6]...)

Some more comments:

  • I don't explicitly change the age parameter for Dict. My understanding is that this is not necessary if only existing values are modified. (For new entries this is done inside _setindex!.)
  • mergewith! is already extensively tested with Dict in test/dict.jl. These test sets now call the new dedicated function. I have added tests for the AbstractDict method to test/dict.jl; they use IdDict.

@dkarrasch dkarrasch added the performance Must go faster label May 21, 2021
@matthias314 matthias314 changed the title add mergewith! method for Dict faster mergewith! May 23, 2021
@matthias314
Copy link
Contributor Author

There hasn't been any feedback on this PR so far. Is it not a good idea? Or is there anything I should improve?

@oscardssmith
Copy link
Member

From a quick glance, this looks good. Adding some labels so it doesn't get forgotten.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems good to me

Co-authored-by: Jameson Nash <vtjnash@gmail.com>
@dkarrasch dkarrasch merged commit 16f433b into JuliaLang:master Jun 5, 2021
@matthias314 matthias314 deleted the m3/mergewith branch June 5, 2021 20:36
shirodkara pushed a commit to shirodkara/julia that referenced this pull request Jun 9, 2021
Co-authored-by: matthias314 <matthias314@posteo.net>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
Co-authored-by: matthias314 <matthias314@posteo.net>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants