-
Notifications
You must be signed in to change notification settings - Fork 38
Improve single element indexing (and other quality of life improvements) #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #149 +/- ##
==========================================
- Coverage 84.47% 82.82% -1.66%
==========================================
Files 11 11
Lines 1546 1694 +148
==========================================
+ Hits 1306 1403 +97
- Misses 240 291 +51
Continue to review full report at Codecov.
|
|
Just saw the coverage diff. I'll try to fix that up |
|
As an unrelated comment, I noticed this recently: julia> using ArrayInterface
julia> ci = CartesianIndices(ntuple(i -> Base.OneTo(32), 4));
julia> cis = CartesianIndices(ntuple(i -> ArrayInterface.One():32, 4));
julia> @btime sum($ci)
827.116 μs (0 allocations: 0 bytes)
CartesianIndex(17301504, 17301504, 17301504, 17301504)
julia> @btime sum($cis)
915.828 ms (8655868 allocations: 376.07 MiB)
CartesianIndex(17301504, 17301504, 17301504, 17301504)This happens when iterating over @inline function __inc(state::Tuple{Int,Int,Vararg{Int,N}}, indices::Tuple{OrdinalRangeInt,OrdinalRangeInt,Vararg{OrdinalRangeInt,N}}) where {N}
rng = indices[1]
I = state[1] + step(rng)
if __is_valid_range(I, rng) && state[1] != last(rng)
return true, (I, tail(state)...)
end
t1, t2 = tail(state), tail(indices)
# avoid dynamic dispatch by telling the compiler relational invariants
valid, I = isa(t1, Tuple{Int}) ? __inc(t1, t2::Tuple{OrdinalRangeInt}) : __inc(t1, t2::Tuple{OrdinalRangeInt,OrdinalRangeInt,Vararg{OrdinalRangeInt}})
return valid, (first(rng), I...)
end
|
|
I'll fix that. |
Do you have an example?
It should largely be unnecessary, because the
This seems like it really ought to be compiled away. julia> = rand(4,5);
julia> @inline sum_first_axes(A) = sum(map(first, ArrayInterface.axes(A)))
sum_first_axes (generic function with 1 method)
julia> @btime sum_first_axes($A)
1.410 ns (0 allocations: 0 bytes)
2
julia> @code_typed sum_first_axes(A)
CodeInfo(
1 ─ Base.arraysize(A, 1)::Int64
│ Base.arraysize(A, 2)::Int64
└── return 2
) => Int64
julia> @code_llvm sum_first_axes(A)
; @ REPL[23]:1 within `sum_first_axes'
define i64 @julia_sum_first_axes_1792({}* nonnull align 16 dereferenceable(40) %0) #0 {
top:
ret i64 2
}
julia> @code_native sum_first_axes(A)
.text
; ┌ @ REPL[23]:1 within `sum_first_axes'
movl $2, %eax
retq
nopw %cs:(%rax,%rax)
; └1 clock cycle is around 0.25ns, so anything less than that means the benchmark was completely compiled away. |
|
I still need to spend the time to look over your other PR, but this one looks fine to me. |
I always operated under that assumption until I started benchmarking everything. # old
@btime ArrayInterface.to_indices($x, (3,3,3))
julia> @btime ArrayInterface.to_indices($x, (3,3,3))
4.453 ns (0 allocations: 0 bytes)
# new
julia> @btime ArrayInterface.to_indices($x, (3,3,3))
2.045 ns (0 allocations: 0 bytes)
(3, 3, 3)Apparently it's a known issue |
Just get to that when you have a chance. I recently discovered the Data Layout Modeling section of the MLIR manual and I'm considering changing a couple of things in how I approached that. |
|
@chriselrod , is this fine to merge? |
chriselrod
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
While working on layouts for #141 I realized that unless
getindexwas skipped (e.g., directly callingunsafe_getindex) than element-wise indexing was a lot slower. I approached this from a lot of different angles (some of which will probably come up in another PR still), but the issue appears to be due to the compiler not knowing to elide reconstruction of something likeTuple{Int,Int}, despite not undergoing any change throughoutto_indices. So I'll start with the fix I had for that.Canonical Indexing Types
I chose the name because I saw a similar concept in LoopVectorization. In
to_indices(A, args::Tuple)each element ofargsis checked to see if it is a canonical indexing type up front, avoiding recursive reconstruction of the tuple of indexers. This check is performed byis_canonical, resulting in the following improvement:I've also expanded on the number of types that are considered "canonical" indexers from that of base (e.g., we don't ever need to change
OptionallyStaticUnitRange), so there may also be some (very small) improvements on indexing collection.I think we could still squeeze some more performance out of this by avoiding construction of axes for types where the conversion to the canonical state is known already.
This is kind of the idea behind
canonicalize(x), which convertsxto a canonical indexer and helped clean up some internal code that does the same thing.I'm probably going to work on that optimization in another PR b/c it will probably require some involved work on the internal mechanics of
to_indices.For now we at least are as fast as Base.
LazyAxis
A lot of array traits can be derived from looking an array's axes.
Many arrays don't have fully constructed axes stored with them, meaning we usually spend some unnecessary time composing an axis to just find out its size or offset.
Using
lazy_axeswe can sometimes avoid these costs.For example, we can avoid the cost of retrieving size information for an
Array.Much of this improvement is lost when we size information has to be called by both approaches.
I've started putting this in several places and it looks like it either does as well as the eager method or improves on it.
I could probably be a little more exhaustive with improving the performance of this, but figured it was helpful enough to include as is right now.
One final note, I'm aware that I've added a bunch of messy code for bounds checking. I'm going to clean this up when I fine tune
to_indiceslater on.