-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Layouts - combining array structure and indexing #141
Conversation
This needed to be explicitly defined or else static values would end up in the `range(x, y; kwargs...)` pipeline (which always causes problems).
This isn't fully optimized but it passes tests and begins connecting the stride layout information to the indexing information
Codecov Report
@@ Coverage Diff @@
## master #141 +/- ##
==========================================
- Coverage 84.38% 79.32% -5.07%
==========================================
Files 11 12 +1
Lines 1531 1727 +196
==========================================
+ Hits 1292 1370 +78
- Misses 239 357 +118
Continue to review full report at Codecov.
|
The approach here is typically faster than that in base without any additional work on fusing loops across strides. This is especially noticable when Base has to dig into parent arrays. However, single element indexing is significantly slower here. This is due to `to_indices`. The changes here are mostly trying to pull out as much as possible from to_indices so that it can be inlined and avoid allocating. It's not there yet, but it's a start.
@chriselrod , Before you review the code here I want to pass something by you: After some exploratory work on improving this I wonder if there's a better way to do this. The approach here basically constructs something like My concern is that this approach won't translate as well with arrays that don't have strides (sparse arrays) and will be difficult to incorporate device and machine info into the codegen for VectorizationBase.jl. I started played around with something that was more based in traits but at a certain point I realized I was just imitating IR passes and probably triggering a lot of unnecessary compiler work. I took a look at the MLIR manual and came across the Data Layout Modeling section, which describes a lot of what I'm trying to accomplish here. I haven't had time to test this out and don't know how feasible it is, but what if we had array IR here that only focused on combining nested array representations and composed the final function that iteratively copied indices to the new array? Then packages from JuliaSIMD would be able to alter this IR with info about hardware. This would change the current indexing pipeline to basically be.
If this worked then theoretically other array types would be a lot simpler to incorporate and optimize using ArrayInterface.jl You know this stuff way better than I do, so if you think this would be an unhelpful distraction from what you're already trying to do then perhaps the original direction I started in here is better. Also, if you think the IR approach is right but want me to take a back seat I can just focus my efforts here on documentation and cleaning up what I've already implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't taken a close look at the MLIR document yet, but it looks like it's talking about more low level details, of the sort covered by Base.datatype_alignment
?
Returns `static(true)` if the current environment does not have `@inbounds` metadata and | ||
bounds checking should still occur. | ||
""" | ||
@propagate_inbounds function should_check_bounds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why propagate inbounds this way instead of @propagate_inbounds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably shouldn't have included a full doc string on this. I was playing around with micro-optimizations for single element indexing and how far I could delay bounds checking. I'll just take this out.
offset1::O1 | ||
strides::S | ||
axes::A | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the VectorizationBase side of things, it'd be nice if StridedPointer
could be a StrideLayout
+ a Ptr
.
However, StrideLayout
is missing some things, most importantly the contiguous axis, and also has axes
which it does not want (although it does want to store map(static_first, axes(x))
).
Also, StridedPointer
+ axes
== PtrArray
. StrideLayout
already subtypes AbstractArray2
, so it is almost there itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something that is indexable but isn't necessarily an iterator like this
struct StrideLayout{N,S<:Tuple{Vararg{Any,N}},R,O,O1,C}
strides::S
rank::R
offset1::O1
offsets::O
contiguous::C
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any actual support anywhere for ArrayInterface.contiguous_batch_size
yet, so we could add that later, but probably best to start carrying that around now as well.
But this looks good.
end | ||
end | ||
_layout_indices(::IndexStyle, axs) = CartesianIndices(axs) | ||
_layout_indices(::IndexLinear, axs) = LinearIndices(axs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether to use linear or Cartesian indexing is of course context dependent (in the rewrite, I'll add support for switching between representations).
E.g., if you're summing a Matrix{Float64}
or a Adjoint{Float64,Matrix{Float64}}
, you'd want to use linear indexing in both cases (for the Adjoint
, the linear indices would be traversing it in row major order). Same if you're calculating the sum of 2 instances of Matrix{Float64}
or 2 instances of Adjoint{Float64,Matrix{Float64}}
. However, if you sum add a Matrix{Float64}
to a Adjoint{Float64,Matrix{Float64}}
, you suddenly need cartesian indexing.
eachindex
gets some of this right, but it'll always return IndexCartesian()
for the Adjoint
, meaning performance for those sums/additions is worse than necessary.
Is there a "combine layouts"?
I'll get LoopVectorization to be able to transform between these correctly in the rewrite.
One of my goals there is for something like this to work:
@avx begin
C .= 0
for n in indices((C,B),2), m in indices((C,A),1), for k in indices((A,B),(2,1))
C[m,n] += A[m,k] * B[k,n]
end
end
and to have LoopVectorization be capable of converting the linear indexing of C .= 0
into cartesian indexing, and then fuse that broadcast with the following loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're describing is basically what I was getting at with the reference to MLIR. The final "layout" after code gen would be the result of scoping (for summation this would be unordered access to all elements once) which informs how to combine each layer of the arrays.
see #169 |
This is the first big step towards solving a lot of the recent issues I've created and also further combines the stride traits with the rest of ArrayInterface.jl.
The conceptual piece here is the concept of a "layout" (if you have a better name feel free to recommend it b/c I hate naming things). The idea is that we need to propagate information about an array's structure and anything it wraps when accessing collections of its elements so I've included
layout(::AbstractArray, index)
. The return value oflayout
is intended to be the most appropriate iterator for accessing the memory that corresponds toindex
. Note that this is different thaneachindex
,LinearIndices
, orCartesianIndices
, which are meant to directly index the array they are called on. This is why I've also includedbuffer
for extracting the most direct reference to an array's memory. This PR is strictly focusing on the stride array layout, but I've played around with a bunch of different ways this could be generalized to other situations (sparse, lazy stacked). For and idea of how we are already circling this in Base, you could look atSCartesianIndex2
for reinterpreted arrays.Currently
layout
is very basic and geared towards arrays with strides. If linear indexing is performed, then we see if the array isIndexLinear
, otherwise we construct an instance ofStrideIndices
. I've also changed what the flow intoto_indices
is. Instead of passing the indexed array the result of layout is passed. This works out pretty nicely because if you know the layout of an array you also know the axes so I've mostly done some cleaning up there.Here are some very basic benchmarks (with the array print out removed).
I also wanted to see how much of a difference was due to cleaning up the overall pipeline, and how much was due to using strides. So I compared comparable spots in the base call to indexing a collection and ArrayInterface.jl
None of this is combining slices or taking advantage of any fancy memory management, just turning multidim indexing into linear indexing with strides.
I'll continue to put together more robust tests, but I wanted know if/how you'd like this to work with StrideArrays.jl and LoopVectorization.jl. This was discussed in some detail in #94 (which this is replacing). I was leaning towards having something like vaguely akin to...
...and then whatever buffers and pointers are produced could control dispatch for more optimized memory management based on hardware.