Improve single element indexing (and other quality of life improvements) #149

Tokazama · 2021-05-03T02:26:31Z

While working on layouts for #141 I realized that unless getindex was skipped (e.g., directly calling unsafe_getindex) than element-wise indexing was a lot slower. I approached this from a lot of different angles (some of which will probably come up in another PR still), but the issue appears to be due to the compiler not knowing to elide reconstruction of something like Tuple{Int,Int}, despite not undergoing any change throughout to_indices. So I'll start with the fix I had for that.

Canonical Indexing Types

I chose the name because I saw a similar concept in LoopVectorization. In to_indices(A, args::Tuple) each element of args is checked to see if it is a canonical indexing type up front, avoiding recursive reconstruction of the tuple of indexers. This check is performed by is_canonical, resulting in the following improvement:

x = ones(4,4,4);

#  current version
julia> @btime ArrayInterface.getindex($x, 3, 3, 3)
  4.566 ns (0 allocations: 0 bytes)
1.0

#  new version
julia> @btime ArrayInterface.getindex($x, 3, 3, 3)
  1.755 ns (0 allocations: 0 bytes)
1.0

# base version
julia> @btime getindex($x, 3, 3, 3)
  1.756 ns (0 allocations: 0 bytes)
1.0

I've also expanded on the number of types that are considered "canonical" indexers from that of base (e.g., we don't ever need to change OptionallyStaticUnitRange), so there may also be some (very small) improvements on indexing collection.
I think we could still squeeze some more performance out of this by avoiding construction of axes for types where the conversion to the canonical state is known already.
This is kind of the idea behind canonicalize(x), which converts x to a canonical indexer and helped clean up some internal code that does the same thing.
I'm probably going to work on that optimization in another PR b/c it will probably require some involved work on the internal mechanics of to_indices.
For now we at least are as fast as Base.

LazyAxis

A lot of array traits can be derived from looking an array's axes.
Many arrays don't have fully constructed axes stored with them, meaning we usually spend some unnecessary time composing an axis to just find out its size or offset.
Using lazy_axes we can sometimes avoid these costs.

For example, we can avoid the cost of retrieving size information for an Array.

axes_map_first(x) = map(first, ArrayInterface.axes(x))
lazy_axes_map_first(x) = map(first, ArrayInterface.lazy_axes(x))

julia> @btime axes_map_first($x)
  1.754 ns (0 allocations: 0 bytes)
(1, 1, 1)

julia> @btime lazy_axes_map_first($x)
  0.046 ns (0 allocations: 0 bytes)
(1, 1, 1)

Much of this improvement is lost when we size information has to be called by both approaches.

axes_map_last(x) = map(last, ArrayInterface.axes(x))
lazy_axes_map_last(x) = map(last, ArrayInterface.lazy_axes(x))

julia> @btime axes_map_last($x)
  1.608 ns (0 allocations: 0 bytes)
(4, 4, 4)

julia> @btime lazy_axes_map_last($x)
  1.470 ns (0 allocations: 0 bytes)
(4, 4, 4)

I've started putting this in several places and it looks like it either does as well as the eager method or improves on it.
I could probably be a little more exhaustive with improving the performance of this, but figured it was helpful enough to include as is right now.

One final note, I'm aware that I've added a bunch of messy code for bounds checking. I'm going to clean this up when I fine tune to_indices later on.

codecov · 2021-05-03T02:29:21Z

Codecov Report

Merging #149 (730afb0) into master (16defd4) will decrease coverage by 1.65%.
The diff coverage is 69.72%.

@@            Coverage Diff             @@
##           master     #149      +/-   ##
==========================================
- Coverage   84.47%   82.82%   -1.66%     
==========================================
  Files          11       11              
  Lines        1546     1694     +148     
==========================================
+ Hits         1306     1403      +97     
- Misses        240      291      +51

Impacted Files	Coverage Δ
src/ArrayInterface.jl	`84.76% <ø> (-0.07%)`	⬇️
src/axes.jl	`61.59% <50.00%> (-2.79%)`	⬇️
src/indexing.jl	`75.58% <76.25%> (-2.25%)`	⬇️
src/stridelayout.jl	`88.25% <81.25%> (-0.59%)`	⬇️
src/dimensions.jl	`95.96% <100.00%> (+0.72%)`	⬆️
src/ndindex.jl	`85.36% <100.00%> (ø)`
src/ranges.jl	`92.06% <100.00%> (-0.02%)`	⬇️
src/size.jl	`83.33% <100.00%> (+0.40%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16defd4...730afb0. Read the comment docs.

Tokazama · 2021-05-03T02:38:48Z

Just saw the coverage diff. I'll try to fix that up

chriselrod · 2021-05-03T08:48:38Z

As an unrelated comment, I noticed this recently:

julia> using ArrayInterface

julia> ci = CartesianIndices(ntuple(i -> Base.OneTo(32), 4));

julia> cis = CartesianIndices(ntuple(i -> ArrayInterface.One():32, 4));

julia> @btime sum($ci)
  827.116 μs (0 allocations: 0 bytes)
CartesianIndex(17301504, 17301504, 17301504, 17301504)

julia> @btime sum($cis)
  915.828 ms (8655868 allocations: 376.07 MiB)
CartesianIndex(17301504, 17301504, 17301504, 17301504)

This happens when iterating over CartesianIndices with OptionallyStaticUnitRange{StaticInt{F}}. For now, I've switched StrideArraysCore to use Base.OneTo for 1-indexed axes.
The issue is a type instability caused by a type instability in __inc:

@inline function __inc(state::Tuple{Int,Int,Vararg{Int,N}}, indices::Tuple{OrdinalRangeInt,OrdinalRangeInt,Vararg{OrdinalRangeInt,N}}) where {N}
    rng = indices[1]
    I = state[1] + step(rng)
    if __is_valid_range(I, rng) && state[1] != last(rng)
        return true, (I, tail(state)...)
    end
    t1, t2 = tail(state), tail(indices)
    # avoid dynamic dispatch by telling the compiler relational invariants
    valid, I = isa(t1, Tuple{Int}) ? __inc(t1, t2::Tuple{OrdinalRangeInt}) : __inc(t1, t2::Tuple{OrdinalRangeInt,OrdinalRangeInt,Vararg{OrdinalRangeInt}})
    return valid, (first(rng), I...)
end

state[1] + step(rng) and first(rng) won't have the same type.
Could be fixed by making first(::OptionallyStaticUnitRange) always return Int.

Tokazama · 2021-05-03T09:10:04Z

I'll fix that.

chriselrod · 2021-05-03T09:10:53Z

I approached this from a lot of different angles (some of which will probably come up in another PR still), but the issue appears to be due to the compiler not knowing to elide reconstruction of something like Tuple{Int,Int}, despite not undergoing any change throughout to_indices. So I'll start with the fix I had for that.

Do you have an example?
All of this should inline, in which case it should be able to decompose all the Tuples into individual scalar operations.

I chose the name because I saw a similar concept in LoopVectorization

It should largely be unnecessary, because the known_* functions should cover all the cases when it comes time for LoopVectorization to process the loops.
The fact that there are different methods there is somewhat a hold over from before it began using the ArrayInterface.known_* methods.

A lot of array traits can be derived from looking an array's axes.
Many arrays don't have fully constructed axes stored with them, meaning we usually spend some unnecessary time composing an axis to just find out its size or offset.

This seems like it really ought to be compiled away.
Seems like a benchmarking artifact:

julia> = rand(4,5);

julia> @inline sum_first_axes(A) = sum(map(first, ArrayInterface.axes(A)))
sum_first_axes (generic function with 1 method)

julia> @btime sum_first_axes($A)
  1.410 ns (0 allocations: 0 bytes)
2

julia> @code_typed sum_first_axes(A)
CodeInfo(
1 ─     Base.arraysize(A, 1)::Int64
│       Base.arraysize(A, 2)::Int64
└──     return 2
) => Int64

julia> @code_llvm sum_first_axes(A)
;  @ REPL[23]:1 within `sum_first_axes'
define i64 @julia_sum_first_axes_1792({}* nonnull align 16 dereferenceable(40) %0) #0 {
top:
  ret i64 2
}

julia> @code_native sum_first_axes(A)
        .text
; ┌ @ REPL[23]:1 within `sum_first_axes'
        movl    $2, %eax
        retq
        nopw    %cs:(%rax,%rax)
; └

1 clock cycle is around 0.25ns, so anything less than that means the benchmark was completely compiled away.
Seems like that doesn't happen with the current axes in a benchmark, but that otherwise everything does disappear.

chriselrod · 2021-05-03T09:20:50Z

I still need to spend the time to look over your other PR, but this one looks fine to me.
I'm generally in favor of being easier on the compiler. If a benchmark doesn't get compiled away with axes but does with lazy_axes, there're probably more cases where the compiler ends up doing better with the latter.

Tokazama · 2021-05-03T09:25:52Z

Do you have an example?
All of this should inline, in which case it should be able to decompose all the Tuples into individual scalar operations.

I always operated under that assumption until I started benchmarking everything.

# old
@btime ArrayInterface.to_indices($x, (3,3,3))
julia> @btime ArrayInterface.to_indices($x, (3,3,3))
  4.453 ns (0 allocations: 0 bytes)

# new
julia> @btime ArrayInterface.to_indices($x, (3,3,3))
  2.045 ns (0 allocations: 0 bytes)
(3, 3, 3)

Apparently it's a known issue

Tokazama · 2021-05-03T09:34:31Z

I still need to spend the time to look over your other PR

Just get to that when you have a chance. I recently discovered the Data Layout Modeling section of the MLIR manual and I'm considering changing a couple of things in how I approached that.

Tokazama · 2021-05-03T17:00:03Z

@chriselrod , is this fine to merge?

chriselrod

Yes.

Tokazama added 5 commits May 2, 2021 12:39

Use LazyAxis to avoid cost of materializing axes

445239b

to_indices doesn't iterate through indexers that don't need conversion

f17360c

Patch bump and simplify some lazy_axes returns

a77fcd9

add change size trait

d17275d

Document canonicalize

7d80282

Tokazama requested a review from chriselrod May 3, 2021 02:34

Fix type stability of ranges

a4bbe75

Add more axes tests

730afb0

chriselrod approved these changes May 3, 2021

View reviewed changes

Tokazama merged commit f1b5f65 into master May 3, 2021

Tokazama deleted the lazy branch May 3, 2021 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve single element indexing (and other quality of life improvements) #149

Improve single element indexing (and other quality of life improvements) #149

Uh oh!

Tokazama commented May 3, 2021

Uh oh!

codecov bot commented May 3, 2021 •

edited

Loading

Uh oh!

Tokazama commented May 3, 2021

Uh oh!

chriselrod commented May 3, 2021 •

edited

Loading

Uh oh!

Tokazama commented May 3, 2021

Uh oh!

chriselrod commented May 3, 2021

Uh oh!

chriselrod commented May 3, 2021

Uh oh!

Tokazama commented May 3, 2021

Uh oh!

Tokazama commented May 3, 2021 •

edited

Loading

Uh oh!

Tokazama commented May 3, 2021

Uh oh!

chriselrod left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve single element indexing (and other quality of life improvements) #149

Improve single element indexing (and other quality of life improvements) #149

Uh oh!

Conversation

Tokazama commented May 3, 2021

Canonical Indexing Types

LazyAxis

Uh oh!

codecov bot commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Tokazama commented May 3, 2021

Uh oh!

chriselrod commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tokazama commented May 3, 2021

Uh oh!

chriselrod commented May 3, 2021

Uh oh!

chriselrod commented May 3, 2021

Uh oh!

Tokazama commented May 3, 2021

Uh oh!

Tokazama commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tokazama commented May 3, 2021

Uh oh!

chriselrod left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented May 3, 2021 •

edited

Loading

chriselrod commented May 3, 2021 •

edited

Loading

Tokazama commented May 3, 2021 •

edited

Loading