Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transducer as an optimization: map, filter and flatten #33526

Merged
merged 11 commits into from
Dec 4, 2019

Conversation

tkf
Copy link
Member

@tkf tkf commented Oct 11, 2019

This PR implements very minimal (stateless) transducers in Base and use it internally in foldl (mapfoldl). My intention is to make transducer-based code completely an implementation detail and so that code using iterators can just become faster. Here is a benchmark:

using BenchmarkTools
xs = [abs(x) < 1 ? x : missing for x in randn(1000)]
@btime foldl(+, (x for x in $xs if x !== missing))
  • Before (a68237f): 1.484 μs (9 allocations: 144 bytes)
  • After (fb89009): 692.771 ns (2 allocations: 32 bytes)
  • @btime sum(skipmissing($xs)): 881.146 ns (5 allocations: 80 bytes)

This benchmark shows that transducer-based optimization can yield 2x speed up for a simple filter. Also, it's "faster" than skipmissing although it does pair-wise summation so this is not really a fair comparison. However, transducer-based optimization can also be applied to reduce which then would eliminate the need for special mapreduce_impl implementation (ref #27743). Furthermore, similar speed up may happen for any type-based filtering, not just missing.

(By the way, specialized mapreduce_impl for skipmissing seems to be required due to the requirement for hard-coding === missing according to this comment #27681 (comment) by @Keno. However, this seems to be not applied to transducers. Why is this the case? This is puzzling me for a while.)

I think I can optimize it more but I think this is a good starting point where code is still simple and so easy to review. Other "stateless" iterator transforms like Flatten should be easy to support.

tl;dr

  • Transducers make foldl combined with iterator comprehensions (Iterator.filter and Generator) faster.
  • Full comprehension support is easy (i.e., adding Flatten). (Edit: Flatten is supported in this PR now)
  • Some skipmissing-specific code can be removed (maybe).

What do people think about this? Is it worth using simple transducers in Julia Base?

Implementation

Quoting the docstring, the central idea is implemented in _xfadjoint which (roughly speaking) converts iterators to transducers:

_xfadjoint(op, itr) -> op′, itr′

Given a pair of reducing function op and an iterator itr, return a pair (op′, itr′) of similar types. If the iterator itr is transformed by an iterator transform ixf whose adjoint transducer xf is known, op′ = xf(op) and itr′ = ixf⁻¹(itr) is returned. Otherwise, op and itr are returned as-is. For example, transducer rf -> MappingRF(f, rf) is the adjoint of iterator transform itr -> Generator(f, itr).

Nested iterator transforms are converted recursively. That is to say, given op and

itr = (ixf₁ ∘ ixf₂ ∘ ... ∘ ixfₙ)(itr′)

what is returned is itr′ and

op′ = (xfₙ ∘ ... ∘ xf₂ ∘ xf₁)(op)

This conversion is invoked inside foldl/mapfoldl just before starting the main loop.

@JeffBezanson
Copy link
Member

Nice, this seems a lot more elegant than having mapfold as a special case.

Now I better understand why you want inner "transformed" iterators to have consistent field names. It would be fine to change them all to be consistent.

@JeffBezanson JeffBezanson added the performance Must go faster label Oct 11, 2019
@ararslan
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@tkf
Copy link
Member Author

tkf commented Oct 12, 2019

I added the transducer for Flatten too, as I noticed that doing so would force me to refactor/simplify some implementations.

@tkf tkf changed the title Transducer as an optimization: map and filter Transducer as an optimization: map, filter and flatten Oct 12, 2019
@chriselrod
Copy link
Contributor

Aren't Array{Union{T,Missing}} an Array{T} and a BitArray under the hood somehow? If so, and all we're talking about is summing a non-missing elements, we can get a lot faster. Of course, foldl is with "guaranteed left associativity", so it shouldn't be vectorized like the simple loop.

using BenchmarkTools
xs = [abs(x) < 1 ? x : missing for x in randn(1000)];
xsf64 = [x isa Missing ? NaN : x for x  xs];
xsb = isa.(xs, Missing);
function summissing(x::AbstractArray{T}, b::BitArray) where {T}
    s = zero(T)
    @inbounds @simd for i  eachindex(x,b)
        s += b[i] ? zero(T) : x[i]
    end
    s
end

Now:

julia> @btime foldl(+, (x for x in $xs if x !== missing))
  1.473 μs (9 allocations: 144 bytes)
-17.98924543402577

julia> @btime summissing($xsf64, $xsb)
  311.917 ns (0 allocations: 0 bytes)
-17.989245434025776

julia> @btime sum(skipmissing($xs))
  1.252 μs (5 allocations: 80 bytes)
-17.98924543402577

312 ns vs >1.2 μs.

The 312 ns is also a lot better than I get when trying to use the Union Array:

julia> function summissing(x::AbstractVector{Union{T,Missing}}) where {T}
           s = zero(T)
           @inbounds @simd for i  eachindex(x)
               xᵢ = x[i]; s += ismissing(xᵢ) ? zero(T) : xᵢ
           end
           s
       end
summissing (generic function with 2 methods)

julia> @btime summissing($xs)
  874.055 ns (0 allocations: 0 bytes)
25.598370606424098

The actual Float64 elements are stored contiguously, so as long as there is a string of bits located somewhere (instead of Bools), we should be able to get that kind of performance.

julia> xsp = Base.unsafe_convert(Ptr{Float64}, pointer(xs)); # Don't want Ptr{Union{Float64,Missing}}

julia> xs'
1×1000 LinearAlgebra.Adjoint{Union{Missing, Float64},Array{Union{Missing, Float64},1}}:
 -0.0323233  -0.426845  missing  -0.141168  0.0985111  -0.258741  missing  missing  0.116394  missing  0.0855161  -0.949796  0.751656  0.998423  missing  -0.921113  -0.952986  0.330352  missing  0.322523    -0.0225253  -0.117459  missing  missing  0.327171  missing  missing  missing  0.663337  missing  0.25228  0.152217  -0.658007  -0.1019  -0.903252  -0.832974  0.964921  0.117777  missing  0.760332  0.000516222

julia> unsafe_load(xsp,1), unsafe_load(xsp,2), unsafe_load(xsp,4)
(-0.032323336118395316, -0.42684453051201715, -0.1411677613982569)

@tkf
Copy link
Member Author

tkf commented Oct 27, 2019

Slowness of ismissing(xᵢ) ? zero(T) : xᵢ is exactly why ismissing has a special mapreduce (see #27679 and #27681 (comment)). If I use if xᵢ !== missing to help inference

function summissing2(x::AbstractVector{Union{T,Missing}}) where {T}
    s = zero(T)
    @inbounds @simd for i  eachindex(x)
        xᵢ = x[i]
        if xᵢ !== missing
            s += xᵢ
        end
    end
    s
end

then I get

julia> @btime summissing($xsf64, $xsb)
  430.477 ns (0 allocations: 0 bytes)
11.40270702296504

julia> @btime summissing($xs)
  985.214 ns (0 allocations: 0 bytes)
11.402707022965044

julia> @btime summissing2($xs)
  666.930 ns (0 allocations: 0 bytes)
11.402707022965044

summissing2 is still slower than summissing(x::AbstractArray{T}, b::BitArray) but not as drastic as with summissing(::AbstractVector{Union{T,Missing}}).

Note that the code structure of summissing2 is exactly what filtering transducer produces (except the @simd).

Aren't Array{Union{T,Missing}} an Array{T} and a BitArray under the hood somehow?

If the implementation hasn't changed since this blog post https://julialang.org/blog/2018/06/missing, it is Array{UInt8}. The rationale is explained in the post:

Even if it consumes more memory, the Array{UInt8} mask approach is faster (at least in the current state of BitArray), and it generalizes to Unions of more than two types.

(though I guess your benchmark is a counter example of the claim that it's faster)

@JeffBezanson
Copy link
Member

Please rebase and I'll merge this.

rf::T
end

@inline (op::MappingRF)(acc, x) = op.rf(acc, op.f(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the use of @inline here; with such a minimal implementation, you'd think it wouldn't be needed. Yet I think I've found myself doing the same thing because, if I understand correctly, even though op.f might itself have @inline, having this simple function barrier can prevent total inlining (I imagine because the op.f gets inlined into this one-liner, but then itself doesn't get inlined due to the now-non-inlineable nature). Is that correct thinking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's probably what happens.

@tkf
Copy link
Member Author

tkf commented Dec 3, 2019

In d0b1c04, I merged master by re-doing what was done in #33917.

@tkf
Copy link
Member Author

tkf commented Dec 3, 2019

Hmm... So, it seems that the speedup by using transducers is completely gone at current master. What is puzzling is that @descending to _foldl_impl does not show any difference in the IR when comparing d0b1c04 to dfcd792 (which shows the speedup). I uploaded the Julia IR and LLVM IR here: https://gist.github.com/tkf/d9397c14f8612f61be9ea72257c50591 You can see that there is basically no difference in the Julia IR but there are some differences in LLVM IR. Were there some recent updates in the pipeline between these IRs?

@KristofferC
Copy link
Member

What is puzzling is that @descending to _foldl_impl does not show any difference in the IR when comparing d0b1c04 to dfcd792 (which shows the speedup)

In cases where specialization doesn't happen due to compiler heuristics, the code reflections doesn't show the code that will execute (see e.g. #33142).

@JeffBezanson
Copy link
Member

Wow, thanks for checking that again.

@tkf
Copy link
Member Author

tkf commented Dec 4, 2019

@KristofferC Thanks, that make sense.

@JeffBezanson I probably should be using less "magic" example of transducers; i.e., the part that does not interact with compiler detail. After adding a iterator-to-transducer conversion I forgot to implement before (ce49c7e), following example shows a good speedup over current master:

using BenchmarkTools
@btime foldl(+, (y for x in 1:1000 for y in 1:x if y % 2 == 0))
  • Current master (5a9cce1): 219.535 μs (0 allocations: 0 bytes)
  • This PR (ce49c7e): 72.512 μs (0 allocations: 0 bytes)

Can this be merged?

@tkf
Copy link
Member Author

tkf commented Dec 4, 2019

Actually, I still see the speedup I mentioned in the OP if I set init:

using BenchmarkTools
xs = [abs(x) < 1 ? x : missing for x in randn(1000)]
@btime foldl(+, (x for x in $xs if x !== missing); init=0.0)
  • This PR (ce49c7e): 676.987 ns (0 allocations: 0 bytes)
  • Master (5a9cce1): 1.279 μs (2 allocations: 32 bytes)

I guess init-less version is just slightly complicated such that it crosses some threshold changed by the recent change for the compilation time fixes for 1.3 or something?

@StefanKarpinski
Copy link
Member

I'm not too worried about the lost optimization for the moment since the fact that it used to be faster proves that it generates simpler, easier to optimize code and seems to me to therefore remain justifiable. Of course, it would be nice to recover the optimization as well, but that's somewhat independent of this change.

@JeffBezanson JeffBezanson merged commit 3c182bc into JuliaLang:master Dec 4, 2019
@nalimilan
Copy link
Member

Sorry, I had missed this. Can anything be done to improve interaction of skipmissing with reductions and/or drop the specialized implementation? Would adding _xfadjoint(op, itr::SkipMissing) make sense? Would a special Filter-like iterator but using === work?

In particular, it would be nice to be able to implement reductions over dimensions without copying all the definitions like #28027 currently does.

@tkf tkf deleted the filterxf branch January 5, 2020 20:04
@tkf
Copy link
Member Author

tkf commented Jan 5, 2020

I think we need two things. First, as you said, we need to add

_xfadjoint(op, itr::SkipMissing) = _xfadjoint(FilteringRF(!ismissing, op), itr.itr)

Second, we need to hook the transducers into reduce. See mapfoldl_impl for how to use _xfadjoint.

But, as I mentioned in #33526 (comment) and #33526 (comment), it seems that you need to have the concrete identity element (init) in order to have union splitting (?) with the current compiler heuristics. Maybe "tail call function-barrier" can solve this but it is a bit tricky to implement due to the problem I discussed in this discourse post.

@tkf
Copy link
Member Author

tkf commented Jan 7, 2020

Maybe "tail call function-barrier" can solve this

It does. See #34293.

@tkf
Copy link
Member Author

tkf commented Jan 7, 2020

I thought to give a shot to this (at least single-dimensional version) but then realized that it would introduce a big conflict with my other PR #31020...

@nalimilan
Copy link
Member

OK, thanks. Maybe let's revive #31020? It would be nice to get this merged anyway!

@tkf
Copy link
Member Author

tkf commented Jan 7, 2020

#31020 is still alive and waiting for @mbauman to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants