Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layouts - combining array structure and indexing #141

Closed
wants to merge 7 commits into from
Closed

Conversation

Tokazama
Copy link
Member

This is the first big step towards solving a lot of the recent issues I've created and also further combines the stride traits with the rest of ArrayInterface.jl.

The conceptual piece here is the concept of a "layout" (if you have a better name feel free to recommend it b/c I hate naming things). The idea is that we need to propagate information about an array's structure and anything it wraps when accessing collections of its elements so I've included layout(::AbstractArray, index). The return value of layout is intended to be the most appropriate iterator for accessing the memory that corresponds to index. Note that this is different than eachindex, LinearIndices, or CartesianIndices, which are meant to directly index the array they are called on. This is why I've also included buffer for extracting the most direct reference to an array's memory. This PR is strictly focusing on the stride array layout, but I've played around with a bunch of different ways this could be generalized to other situations (sparse, lazy stacked). For and idea of how we are already circling this in Base, you could look at SCartesianIndex2 for reinterpreted arrays.

Currently layout is very basic and geared towards arrays with strides. If linear indexing is performed, then we see if the array is IndexLinear, otherwise we construct an instance of StrideIndices. I've also changed what the flow into to_indices is. Instead of passing the indexed array the result of layout is passed. This works out pretty nicely because if you know the layout of an array you also know the axes so I've mostly done some cleaning up there.

Here are some very basic benchmarks (with the array print out removed).

julia> x = ones(4,4,4);

julia> @btime ArrayInterface.getindex(x, 3, :, :)
  74.965 ns (2 allocations: 224 bytes)

julia> @btime getindex(x, 3, :, :)
  347.114 ns (4 allocations: 288 bytes)

julia> @btime ArrayInterface.getindex(x, :, 3, :)
  69.938 ns (2 allocations: 224 bytes)

julia> @btime getindex(x, :, 3, :)
  404.116 ns (7 allocations: 352 bytes)

julia> @btime ArrayInterface.getindex(x, :, :, 3)
  73.510 ns (2 allocations: 224 bytes)

julia> @btime getindex(x, :, :, 3)
  276.487 ns (4 allocations: 288 bytes)

I also wanted to see how much of a difference was due to cleaning up the overall pipeline, and how much was due to using strides. So I compared comparable spots in the base call to indexing a collection and ArrayInterface.jl

inds = Base.to_indices(x, (:, :, :))
@btime inner_index(x, inds)
@btime inner_index_base(x, inds)

julia> @btime inner_index(x, inds)
  119.776 ns (1 allocation: 624 bytes)

julia> @btime inner_index_base(x, inds)
  238.631 ns (5 allocations: 688 bytes)

None of this is combining slices or taking advantage of any fancy memory management, just turning multidim indexing into linear indexing with strides.

I'll continue to put together more robust tests, but I wanted know if/how you'd like this to work with StrideArrays.jl and LoopVectorization.jl. This was discussed in some detail in #94 (which this is replacing). I was leaning towards having something like vaguely akin to...

function unsafe_get_collection(A, layout, inds)
    axs = to_axes(lyt, inds)
    dst = similar(A, axs)
    map(Base.unsafe_length, axes(dst)) == map(Base.unsafe_length, axs) || Base.throw_checksize_error(dst, axs)
    # usually a generated function, don't allow it to impact inference result
    Ab, Aptr = buffer_pointer(A)
    dstb, dstptr = buffer_pointer(A)
    GC.@preserve Ab dstb copyto!(dstptr, layout(dstptr, :), Aptr, view(lyt, inds...))
    return dst
end

...and then whatever buffers and pointers are produced could control dispatch for more optimized memory management based on hardware.

This needed to be explicitly defined or else static values would end up
in the `range(x, y; kwargs...)` pipeline (which always causes problems).
This isn't fully optimized but it passes tests and begins connecting
the stride layout information to the indexing information
@codecov
Copy link

codecov bot commented Apr 11, 2021

Codecov Report

Merging #141 (2c345c7) into master (cd6e5d2) will decrease coverage by 5.06%.
The diff coverage is 66.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #141      +/-   ##
==========================================
- Coverage   84.38%   79.32%   -5.07%     
==========================================
  Files          11       12       +1     
  Lines        1531     1727     +196     
==========================================
+ Hits         1292     1370      +78     
- Misses        239      357     +118     
Impacted Files Coverage Δ
src/layouts.jl 42.85% <42.85%> (ø)
src/ArrayInterface.jl 84.74% <66.66%> (-0.09%) ⬇️
src/indexing.jl 61.67% <70.76%> (-16.16%) ⬇️
src/ranges.jl 92.30% <85.71%> (+0.22%) ⬆️
src/axes.jl 70.27% <100.00%> (+4.99%) ⬆️
src/dimensions.jl 95.23% <100.00%> (ø)
src/size.jl 83.33% <100.00%> (+0.40%) ⬆️
src/stridelayout.jl 88.20% <100.00%> (+0.05%) ⬆️
src/ndindex.jl 83.13% <0.00%> (-2.24%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd6e5d2...2c345c7. Read the comment docs.

The approach here is typically faster than that in base without any
additional work on fusing loops across strides. This is especially
noticable when Base has to dig into parent arrays. However, single
element indexing is significantly slower here. This is due to
`to_indices`. The changes here are mostly trying to pull out as much as
possible from to_indices so that it can be inlined and avoid allocating.
It's not there yet, but it's a start.
@Tokazama
Copy link
Member Author

Tokazama commented May 3, 2021

@chriselrod , Before you review the code here I want to pass something by you:

After some exploratory work on improving this I wonder if there's a better way to do this. The approach here basically constructs something like LinearIndices or CartesianIndices, but instead of representing a single array StrideLayout is the unified representation of multiple arrays with strides. The end goal was that optimizing unique scoping/indexing patterns would be easier because the mapping between user provided indices and the core buffer/data would just be a
"layout". With some more cleaning up I found this improved indexing pretty consistently (still need VectorizationBase.jl to make pointers fast though).

My concern is that this approach won't translate as well with arrays that don't have strides (sparse arrays) and will be difficult to incorporate device and machine info into the codegen for VectorizationBase.jl. I started played around with something that was more based in traits but at a certain point I realized I was just imitating IR passes and probably triggering a lot of unnecessary compiler work. I took a look at the MLIR manual and came across the Data Layout Modeling section, which describes a lot of what I'm trying to accomplish here.

I haven't had time to test this out and don't know how feasible it is, but what if we had array IR here that only focused on combining nested array representations and composed the final function that iteratively copied indices to the new array? Then packages from JuliaSIMD would be able to alter this IR with info about hardware. This would change the current indexing pipeline to basically be.

  1. getindex - index entry point
  2. to_indices - make sure user indexing is inbounds and in the appropriate form
  3. unsafe_getindex - generate destination array
  4. unsafe_getindex! - composes the code the iteratively copies to the destination array.

If this worked then theoretically other array types would be a lot simpler to incorporate and optimize using ArrayInterface.jl

You know this stuff way better than I do, so if you think this would be an unhelpful distraction from what you're already trying to do then perhaps the original direction I started in here is better. Also, if you think the IR approach is right but want me to take a back seat I can just focus my efforts here on documentation and cleaning up what I've already implemented.

Copy link
Collaborator

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't taken a close look at the MLIR document yet, but it looks like it's talking about more low level details, of the sort covered by Base.datatype_alignment?

Returns `static(true)` if the current environment does not have `@inbounds` metadata and
bounds checking should still occur.
"""
@propagate_inbounds function should_check_bounds()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why propagate inbounds this way instead of @propagate_inbounds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably shouldn't have included a full doc string on this. I was playing around with micro-optimizations for single element indexing and how far I could delay bounds checking. I'll just take this out.

offset1::O1
strides::S
axes::A
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the VectorizationBase side of things, it'd be nice if StridedPointer could be a StrideLayout + a Ptr.
However, StrideLayout is missing some things, most importantly the contiguous axis, and also has axes which it does not want (although it does want to store map(static_first, axes(x))).

Also, StridedPointer + axes == PtrArray. StrideLayout already subtypes AbstractArray2, so it is almost there itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something that is indexable but isn't necessarily an iterator like this

struct StrideLayout{N,S<:Tuple{Vararg{Any,N}},R,O,O1,C}
    strides::S
    rank::R
    offset1::O1
    offsets::O
    contiguous::C
end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any actual support anywhere for ArrayInterface.contiguous_batch_size yet, so we could add that later, but probably best to start carrying that around now as well.
But this looks good.

end
end
_layout_indices(::IndexStyle, axs) = CartesianIndices(axs)
_layout_indices(::IndexLinear, axs) = LinearIndices(axs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether to use linear or Cartesian indexing is of course context dependent (in the rewrite, I'll add support for switching between representations).
E.g., if you're summing a Matrix{Float64} or a Adjoint{Float64,Matrix{Float64}}, you'd want to use linear indexing in both cases (for the Adjoint, the linear indices would be traversing it in row major order). Same if you're calculating the sum of 2 instances of Matrix{Float64} or 2 instances of Adjoint{Float64,Matrix{Float64}}. However, if you sum add a Matrix{Float64} to a Adjoint{Float64,Matrix{Float64}}, you suddenly need cartesian indexing.
eachindex gets some of this right, but it'll always return IndexCartesian() for the Adjoint, meaning performance for those sums/additions is worse than necessary.

Is there a "combine layouts"?

I'll get LoopVectorization to be able to transform between these correctly in the rewrite.
One of my goals there is for something like this to work:

@avx begin
  C .= 0
  for n in indices((C,B),2), m in indices((C,A),1), for k in indices((A,B),(2,1))
    C[m,n] += A[m,k] * B[k,n]
  end
end

and to have LoopVectorization be capable of converting the linear indexing of C .= 0 into cartesian indexing, and then fuse that broadcast with the following loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're describing is basically what I was getting at with the reference to MLIR. The final "layout" after code gen would be the result of scoping (for summation this would be unordered access to all elements once) which informs how to combine each layer of the arrays.

This was referenced Jun 1, 2021
@Tokazama
Copy link
Member Author

Tokazama commented Jun 7, 2021

see #169

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants