Skip to content

Commit

Permalink
Add devdocs for SubArrays
Browse files Browse the repository at this point in the history
  • Loading branch information
timholy committed Nov 20, 2014
1 parent 6445fff commit c7fb8cb
Show file tree
Hide file tree
Showing 3 changed files with 281 additions and 13 deletions.
13 changes: 0 additions & 13 deletions doc/devdocs/index.rst

This file was deleted.

1 change: 1 addition & 0 deletions doc/devdocs/julia.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@

cartesian
meta
subarrays
280 changes: 280 additions & 0 deletions doc/devdocs/subarrays.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
.. _devdocs-subarrays:

.. currentmodule:: Base

**************************
SubArrays
**************************

Julia's ``SubArray`` type is a container encoding a "view" of a parent
``AbstractArray``. This page documents some of the design principles
and implementation of ``SubArrays``.

Indexing: cartesian vs. linear indexing
---------------------------------------

Broadly speaking, there are two main ways to access data in an array.
The first, often called cartesian indexing, uses ``N`` indexes for an
``N`` -dimensional ``AbstractArray``. For example, a matrix ``A``
(2-dimensional) can be indexed in cartesian style as ``A[i,j]``. The
second indexing method, refered to as linear indexing, uses a single
index even for higher-dimensional objects. For example, if ``A =
reshape(1:12, 3, 4)``, then the expression ``A[5]`` returns the
value 5. Julia allows you to combine these styles of indexing: for
example, a 3d array ``A3`` can be indexed as ``A3[i,j]``, in which
case ``i`` is interpreted as a cartesian index for the first
dimension, and ``j`` is a linear index over dimensions 2 and 3.

For ``Arrays``, linear indexing appeals to the underlying storage
format: an array is laid out as a contiguous block of memory, and
hence the linear index is just the offset (+1) of the corresponding
entry relative to the beginning of the array. However, this is not
true for many other ``AbstractArrays``: examples include

This comment has been minimized.

Copy link
@ivarne

ivarne Nov 21, 2014

Member

The type is called AbstractArray, not AbstractArrays. Is it OK to emphasize them with the s inside?

This comment has been minimized.

Copy link
@timholy

timholy Nov 21, 2014

Author Member

I tried. ReST won't format it correctly if `` is followed by a non-space, non-punctuation character. (It's different in that respect from Markdown.)

This comment has been minimized.

Copy link
@ivarne

ivarne Nov 21, 2014

Member

Yes, it's nice that markdowns parser lets you emphasize part of a name (but that also adds too much space for my taste). My suggestion will likely be to use AbstractArray in plural, or remove the emphasization.

This comment has been minimized.

Copy link
@StefanKarpinski

StefanKarpinski Nov 21, 2014

Member

I tend to just write to avoid the plural forms of things that need to go in backticks since even though Markdown allows it, it tends not to look great. For example, here you can just write

However, this is not true for many other AbstractArray subtypes:

This comment has been minimized.

Copy link
@jiahao

jiahao Nov 21, 2014

Member

If you really want to, you can do

 ``AbstractArray``\ s

This comment has been minimized.

Copy link
@timholy

timholy Nov 21, 2014

Author Member

Good suggestions all. See 6061806.

This comment has been minimized.

Copy link
@ivarne

ivarne Nov 21, 2014

Member

Nice! That's much better

``SparseMatrixCSC``, arrays that require some kind of computation
(such as interpolation), and the type under discussion here,
``SubArrays``. For these types, the underlying information is more
naturally described in terms of cartesian indexes.

You can manually convert from a cartesian index to a linear index with
``sub2ind``, and vice versa using ``ind2sub``. ``getindex`` and
``setindex!`` functions for ``AbstractArray`` types may include
similar operations.

While converting from a cartesian index to a linear index is fast
(it's just multiplication and addition), converting from a linear
index to a cartesian index is very slow: it relies on the ``div``
operation, which is one of the slowest low-level operations you can
perform with a CPU. For this reason, any code that deals with
``AbstractArrays`` is best designed in terms of cartesian, rather than
linear, indexing.

Index replacement
-----------------

Consider making 2d slices of a 3d array::

S1 = slice(A, :, 5, 2:6)
S2 = slice(A, 5, :, 2:6)

``slice`` drops "singleton" dimensions (ones that are specified by an
``Int``), so both ``S1`` and ``S2`` are two-dimensional ``SubArrays``.
Consequently, the natural way to index these is with ``S1[i,j]``. To
extract the value from the parent array ``A``, the natural approach is
to replace ``S1[i,j]`` with ``A[i,5,(2:6)[j]]`` and ``S2[i,j]`` with
``A[5,i,(2:6)[j]]``.

The key feature of the design of SubArrays is that this index
replacement can be performed without any runtime overhead.

SubArray design
---------------

Type parameters and fields
~~~~~~~~~~~~~~~~~~~~~~~~~~

The strategy adopted is first and foremost expressed in the definition
of the type::

type SubArray{T,N,P<:AbstractArray,I<:(ViewIndex...),LD} <: AbstractArray{T,N}
parent::P
indexes::I
dims::NTuple{N,Int}
first_index::Int # for linear indexing and pointer
stride1::Int # used only for linear indexing
end

``SubArrays`` have 5 template parameters. The first two are the

This comment has been minimized.

Copy link
@binarybana

binarybana Nov 21, 2014

Contributor

'template' -> 'type' I would think.

This comment has been minimized.

Copy link
@timholy

timholy Nov 21, 2014

Author Member

It's funny how C++ never quite leaves one's system...

Fixed in bc88cc5. Thanks for reading & commenting!

standard element type and dimensionality. The next is the type of the
parent ``AbstractArray``. The most heavily-used is the fourth
parameter, a ``tuple`` of the types of the indexes for each dimension.
The final one, ``LD``, is used only in special circumstances, to
implement efficient linear indexing for those types that can support
it.

If in our example above ``A`` is a ``Array{Float64, 3}``, our ``S1``
case above would be a
``SubArray{Float64,2,Array{Float64,3},(Colon,Int64,UnitRange{Int64}),1}``.
Note in particular the tuple parameter, which stores the types of
the indexes used to create ``S1``. Likewise,
::

julia> S1.indexes
(Colon(),5,2:6)

Storing these values allows index replacement, and having the types
encoded as parameters allows one to dispatch to efficient algorithms.

An ``Int`` index is used to represent a parent dimension that should
be dropped. The distinction between the ``sub`` and ``slice``
commands is that ``sub`` converts *interior* ``Int`` indices into
ranges at the time of construction. For example::

S3 = sub(A, :, 5, 2:6)

julia> S3.indexes
(Colon(),5:5,2:6)

Because of this conversion, ``S3`` is three-dimensional.

``getindex`` and ``setindex!`` (index translation)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Performing index translation requires that you do different things for
different types of ``SubArrays``. For example, for ``S1``, one needs
to apply the ``i,j`` indexes to the first and third dimensions of the
parent array, whereas for ``S2`` one needs to apply them to the
second and third. The simplest approach to indexing would be to do
the type-analysis at runtime::

parentindexes = Array(Any, 0)
for i = 1:ndims(S.parent)
...
if isa(thisindex, Int)
# Don't consume one of the input indexes
push!(parentindexes, thisindex)
else
# Consume an input index
push!(parentindexes, thisindex[inputindex[j]])
j += 1
end
end
S.parent[parentindexes...]

Unfortunately, this would be disastrous in terms of performance: each
element access would allocate memory, and involves the running of a
lot of poorly-typed code.

The better approach is to dispatch to specific methods to handle each
type of input. Note, however, that the number of distinct methods
needed grows exponentially in the number of dimensions, and since
Julia supports arrays of any dimension the number of methods required
is in fact infinite. Fortunately, ``stagedfunctions`` allow one to
generate the necessary methods quite straightforwardly. The resulting
code looks quite a lot like the runtime approach above, but all of the
type analysis is performed at the time of method instantiation. For a
``SubArray`` of the type of ``S1``, the method executed at runtime is
literally ::

getindex(S::<type of S1>, i, j) = S.parent[i, S.indexes[2], S.indexes[3][j]]

Linear indexing
~~~~~~~~~~~~~~~

Linear indexing can be implemented efficiently when the entire array
has a single stride that separates successive elements. For
``SubArrays``, the availability of efficient linear indexing is based
purely on the types of the indexes, and does not depend on values like
the size of the array. It therefore can miss some cases in which the
stride happens to be uniform::

julia> A = reshape(1:4*2, 4, 2)
4x2 Array{Int64,2}:
1 5
2 6
3 7
4 8

julia> diff(A[2:2:4,:][:])
3-element Array{Int64,1}:
2
2
2

A view constructed as ``sub(A, 2:2:4, :)`` happens to have uniform
stride, and therefore linear indexing indeed could be performed
efficiently. However, success in this case depends on the size of the
array: if the first dimension instead were odd, ::

julia> A = reshape(1:5*2, 5, 2)
5x2 Array{Int64,2}:
1 6
2 7
3 8
4 9
5 10

julia> diff(A[2:2:4,:][:])
3-element Array{Int64,1}:
2
3
2

then ``A[2:2:4,:]`` does not have uniform stride, so we cannot
guaranteee efficient linear indexing. Since we have to base this
decision based purely on types encoded in the parameters of the
``SubArray``, ``S = sub(A, 2:2:4, :)`` cannot implement efficient
linear indexing.

The last parameter of ``SubArrays``, ``LD``, encodes the highest
dimension up to which elements are guaranteed to have uniform stride.
When ``LD == length(I)``, the length of the ``indexes`` tuple,
efficient linear indexing becomes possible.

An example might help clarify what this means:

- For ``S1`` above, the ``Colon`` along the first dimension is
uniformly spaced (all elements are displaced by 1 from the previous
value), so this dimension does not "break" linear indexing.
Consequently ``LD`` has a value of at least 1.

- The second dimension of the parent, sliced out as ``5``, does not
not by itself break linear indexing: if all of the remaining
indexes were ``Int``, the entire ``SubArray`` would have efficient
linear indexing. Consequently, ``LD`` is at least 2.

- The last dimension is a ``Range``. This would by itself break
linear indexing (even though it is a ``UnitRange``, the fact that it
might not start at 1 means that there might be gaps). Additionally,
given the preceeding indexes any choice other than ``Int`` would
also have truncated ``LD`` at 2.

Consequently, as a whole ``S1`` does not have efficient linear
indexing.

However, if we were to later say ``S1a = slice(S1, 2:2:7, 3)``,
``S1a`` would have an ``LD`` of 3 (its indexes tuple has type
``(Colon, Int, Int)``) and would have efficient linear indexing. This
ability to re-slice is the main motivation to use an integer ``LD``
rather than a boolean flag to encode the applicability of linear
indexing.

The main reason ``LD`` cannot be inferred from the ``indexes`` tuple
is because ``sub`` converts internal ``Int`` indexes into
``UnitRanges``. Consequently it is important to encode "safe"
dimensions of size 1 prior to conversion. Up to the ``LDth`` entry,
we can be sure that any ``UnitRange`` was, in fact, an ``Integer``
prior to conversion.


A few details
~~~~~~~~~~~~~

- Hopefully by now it's fairly clear that supporting slices means that
the dimensionality, given by the parameter ``N``, is not necessarily
equal to the dimensionality of the parent array or the length of the
``indexes`` tuple. Neither do user-supplied indexes necessarily
line up with entries in the ``indexes`` tuple (e.g., the second
user-supplied index might correspond to the third dimension of the
parent array, and the third element in the ``indexes`` tuple).

What might be less obvious is that the dimensionality of the parent
array may not be equal to the length of the ``indexes`` tuple. Some
examples::

A = reshape(1:35, 5, 7) # A 2d parent Array
S = sub(A, 2:7) # A 1d view created by linear indexing
S = sub(A, :, :, 1) # Appending extra indexes is supported
S = sub(A, :, :, 1:1)

Consequently, internal ``SubArray`` code needs to be fairly careful
about which of these three notions of dimensionality is relevant in
each circumstance.

- Because the processing needed to implement all of the stagedfunction
expressions isn't readily available at the time ``subarray.jl``
appears in the bootstrap process, ``SubArray`` functionality is
split into two files, the second being ``subarray2.jl``.

- Bounds-checking has currently not been tackled. There are two
relevant notions of bounds-checking, one at construction time and
one during element access. This is an important outstanding issue.

1 comment on commit c7fb8cb

@binarybana
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another great stride forward in documentation quality for the Julia project. Thanks again Tim!

Please sign in to comment.