Skip to content

Conversation

@Tokazama
Copy link
Member

@Tokazama Tokazama commented May 3, 2021

While working on layouts for #141 I realized that unless getindex was skipped (e.g., directly calling unsafe_getindex) than element-wise indexing was a lot slower. I approached this from a lot of different angles (some of which will probably come up in another PR still), but the issue appears to be due to the compiler not knowing to elide reconstruction of something like Tuple{Int,Int}, despite not undergoing any change throughout to_indices. So I'll start with the fix I had for that.

Canonical Indexing Types

I chose the name because I saw a similar concept in LoopVectorization. In to_indices(A, args::Tuple) each element of args is checked to see if it is a canonical indexing type up front, avoiding recursive reconstruction of the tuple of indexers. This check is performed by is_canonical, resulting in the following improvement:

x = ones(4,4,4);

#  current version
julia> @btime ArrayInterface.getindex($x, 3, 3, 3)
  4.566 ns (0 allocations: 0 bytes)
1.0

#  new version
julia> @btime ArrayInterface.getindex($x, 3, 3, 3)
  1.755 ns (0 allocations: 0 bytes)
1.0

# base version
julia> @btime getindex($x, 3, 3, 3)
  1.756 ns (0 allocations: 0 bytes)
1.0

I've also expanded on the number of types that are considered "canonical" indexers from that of base (e.g., we don't ever need to change OptionallyStaticUnitRange), so there may also be some (very small) improvements on indexing collection.
I think we could still squeeze some more performance out of this by avoiding construction of axes for types where the conversion to the canonical state is known already.
This is kind of the idea behind canonicalize(x), which converts x to a canonical indexer and helped clean up some internal code that does the same thing.
I'm probably going to work on that optimization in another PR b/c it will probably require some involved work on the internal mechanics of to_indices.
For now we at least are as fast as Base.

LazyAxis

A lot of array traits can be derived from looking an array's axes.
Many arrays don't have fully constructed axes stored with them, meaning we usually spend some unnecessary time composing an axis to just find out its size or offset.
Using lazy_axes we can sometimes avoid these costs.

For example, we can avoid the cost of retrieving size information for an Array.

axes_map_first(x) = map(first, ArrayInterface.axes(x))
lazy_axes_map_first(x) = map(first, ArrayInterface.lazy_axes(x))

julia> @btime axes_map_first($x)
  1.754 ns (0 allocations: 0 bytes)
(1, 1, 1)

julia> @btime lazy_axes_map_first($x)
  0.046 ns (0 allocations: 0 bytes)
(1, 1, 1)

Much of this improvement is lost when we size information has to be called by both approaches.

axes_map_last(x) = map(last, ArrayInterface.axes(x))
lazy_axes_map_last(x) = map(last, ArrayInterface.lazy_axes(x))

julia> @btime axes_map_last($x)
  1.608 ns (0 allocations: 0 bytes)
(4, 4, 4)

julia> @btime lazy_axes_map_last($x)
  1.470 ns (0 allocations: 0 bytes)
(4, 4, 4)

I've started putting this in several places and it looks like it either does as well as the eager method or improves on it.
I could probably be a little more exhaustive with improving the performance of this, but figured it was helpful enough to include as is right now.

One final note, I'm aware that I've added a bunch of messy code for bounds checking. I'm going to clean this up when I fine tune to_indices later on.

@codecov
Copy link

codecov bot commented May 3, 2021

Codecov Report

Merging #149 (730afb0) into master (16defd4) will decrease coverage by 1.65%.
The diff coverage is 69.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #149      +/-   ##
==========================================
- Coverage   84.47%   82.82%   -1.66%     
==========================================
  Files          11       11              
  Lines        1546     1694     +148     
==========================================
+ Hits         1306     1403      +97     
- Misses        240      291      +51     
Impacted Files Coverage Δ
src/ArrayInterface.jl 84.76% <ø> (-0.07%) ⬇️
src/axes.jl 61.59% <50.00%> (-2.79%) ⬇️
src/indexing.jl 75.58% <76.25%> (-2.25%) ⬇️
src/stridelayout.jl 88.25% <81.25%> (-0.59%) ⬇️
src/dimensions.jl 95.96% <100.00%> (+0.72%) ⬆️
src/ndindex.jl 85.36% <100.00%> (ø)
src/ranges.jl 92.06% <100.00%> (-0.02%) ⬇️
src/size.jl 83.33% <100.00%> (+0.40%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16defd4...730afb0. Read the comment docs.

@Tokazama Tokazama requested a review from chriselrod May 3, 2021 02:34
@Tokazama
Copy link
Member Author

Tokazama commented May 3, 2021

Just saw the coverage diff. I'll try to fix that up

@chriselrod
Copy link
Collaborator

chriselrod commented May 3, 2021

As an unrelated comment, I noticed this recently:

julia> using ArrayInterface

julia> ci = CartesianIndices(ntuple(i -> Base.OneTo(32), 4));

julia> cis = CartesianIndices(ntuple(i -> ArrayInterface.One():32, 4));

julia> @btime sum($ci)
  827.116 μs (0 allocations: 0 bytes)
CartesianIndex(17301504, 17301504, 17301504, 17301504)

julia> @btime sum($cis)
  915.828 ms (8655868 allocations: 376.07 MiB)
CartesianIndex(17301504, 17301504, 17301504, 17301504)

This happens when iterating over CartesianIndices with OptionallyStaticUnitRange{StaticInt{F}}. For now, I've switched StrideArraysCore to use Base.OneTo for 1-indexed axes.
The issue is a type instability caused by a type instability in __inc:

@inline function __inc(state::Tuple{Int,Int,Vararg{Int,N}}, indices::Tuple{OrdinalRangeInt,OrdinalRangeInt,Vararg{OrdinalRangeInt,N}}) where {N}
    rng = indices[1]
    I = state[1] + step(rng)
    if __is_valid_range(I, rng) && state[1] != last(rng)
        return true, (I, tail(state)...)
    end
    t1, t2 = tail(state), tail(indices)
    # avoid dynamic dispatch by telling the compiler relational invariants
    valid, I = isa(t1, Tuple{Int}) ? __inc(t1, t2::Tuple{OrdinalRangeInt}) : __inc(t1, t2::Tuple{OrdinalRangeInt,OrdinalRangeInt,Vararg{OrdinalRangeInt}})
    return valid, (first(rng), I...)
end

state[1] + step(rng) and first(rng) won't have the same type.
Could be fixed by making first(::OptionallyStaticUnitRange) always return Int.

@Tokazama
Copy link
Member Author

Tokazama commented May 3, 2021

I'll fix that.

@chriselrod
Copy link
Collaborator

I approached this from a lot of different angles (some of which will probably come up in another PR still), but the issue appears to be due to the compiler not knowing to elide reconstruction of something like Tuple{Int,Int}, despite not undergoing any change throughout to_indices. So I'll start with the fix I had for that.

Do you have an example?
All of this should inline, in which case it should be able to decompose all the Tuples into individual scalar operations.

I chose the name because I saw a similar concept in LoopVectorization

It should largely be unnecessary, because the known_* functions should cover all the cases when it comes time for LoopVectorization to process the loops.
The fact that there are different methods there is somewhat a hold over from before it began using the ArrayInterface.known_* methods.

A lot of array traits can be derived from looking an array's axes.
Many arrays don't have fully constructed axes stored with them, meaning we usually spend some unnecessary time composing an axis to just find out its size or offset.

This seems like it really ought to be compiled away.
Seems like a benchmarking artifact:

julia> = rand(4,5);

julia> @inline sum_first_axes(A) = sum(map(first, ArrayInterface.axes(A)))
sum_first_axes (generic function with 1 method)

julia> @btime sum_first_axes($A)
  1.410 ns (0 allocations: 0 bytes)
2

julia> @code_typed sum_first_axes(A)
CodeInfo(
1 ─     Base.arraysize(A, 1)::Int64
│       Base.arraysize(A, 2)::Int64
└──     return 2
) => Int64

julia> @code_llvm sum_first_axes(A)
;  @ REPL[23]:1 within `sum_first_axes'
define i64 @julia_sum_first_axes_1792({}* nonnull align 16 dereferenceable(40) %0) #0 {
top:
  ret i64 2
}

julia> @code_native sum_first_axes(A)
        .text
; ┌ @ REPL[23]:1 within `sum_first_axes'
        movl    $2, %eax
        retq
        nopw    %cs:(%rax,%rax)
; └

1 clock cycle is around 0.25ns, so anything less than that means the benchmark was completely compiled away.
Seems like that doesn't happen with the current axes in a benchmark, but that otherwise everything does disappear.

@chriselrod
Copy link
Collaborator

I still need to spend the time to look over your other PR, but this one looks fine to me.
I'm generally in favor of being easier on the compiler. If a benchmark doesn't get compiled away with axes but does with lazy_axes, there're probably more cases where the compiler ends up doing better with the latter.

@Tokazama
Copy link
Member Author

Tokazama commented May 3, 2021

Do you have an example?
All of this should inline, in which case it should be able to decompose all the Tuples into individual scalar operations.

I always operated under that assumption until I started benchmarking everything.

# old
@btime ArrayInterface.to_indices($x, (3,3,3))
julia> @btime ArrayInterface.to_indices($x, (3,3,3))
  4.453 ns (0 allocations: 0 bytes)

# new
julia> @btime ArrayInterface.to_indices($x, (3,3,3))
  2.045 ns (0 allocations: 0 bytes)
(3, 3, 3)

Apparently it's a known issue

@Tokazama
Copy link
Member Author

Tokazama commented May 3, 2021

I still need to spend the time to look over your other PR

Just get to that when you have a chance. I recently discovered the Data Layout Modeling section of the MLIR manual and I'm considering changing a couple of things in how I approached that.

@Tokazama
Copy link
Member Author

Tokazama commented May 3, 2021

@chriselrod , is this fine to merge?

Copy link
Collaborator

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@Tokazama Tokazama merged commit f1b5f65 into master May 3, 2021
@Tokazama Tokazama deleted the lazy branch May 3, 2021 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants