Skip to content

Add a concept of memory-backed contiguous collection in Base #54581

@jakobnissen

Description

@jakobnissen

I propose the creation of a new concrete (parametric) type in Base that corresponds to memory-backed contiguous arrays, and to implement Base methods that specialize on these properties in terms of this type.

Edit: I've made a proof-of-concept package here: https://github.com/BioJulia/MemoryViews.jl

Motivation

Various places in Base and elsewhere, Julia has methods that can operate on any contiguous memory region.
Let's call this type of data a dense array.
Examples include:

  • Search methods which work by ccalling memchr, as in string/search.jl
  • CRC32c
  • Various IO methods, e.g. read!(::IO, A::AbstractArray), which calls to unsafe_write.

This is currently handled inconsistently in Base:

  • string/search.jl uses the internal ByteArray union type to cover dense arrays of bytes
  • CRC32c uses Union{Array{UInt8},FastContiguousSubArray{UInt8,N,<:Array{UInt8}} where N}
  • read dispatches on AbstractArray, and checks of isbitstype(eltype(A)) && _checkcontiguous(Bool, A).

There are several issues with this handling:

First, all the approaches above fail to cover some important types. I've attempted to address this ad-hoc in #47693, #53178 and #54579. However, this ad-hoc patching is deeply unsatisfying, and I'm certain I have missed several types. In fact, I'm starting to think it's literally impossible to cover all permutations of views, reinterpretarrays, codeunits, memory and vector using a Union.
The practical implication is that lots of function calls needlessly fail to hit the fast path, while at the same time, the code is harder to reason about because this ad-hoc implementations uses complex big unions in their signature, and is inconsistent with each other.

The fact that this code is duplicated thrice in Base, and in all three places was lacking important methods, suggests to me that there is a need for a better (internal) API to write methods for dense arrays. That was also my experience when making those three PR's: "Surely, there must be a better way to do this". The main usability issue is that in lieu of any API to cover dense arrays, it's up to the implementer of every method to make sure they've covered non-obvious types like CodeUnits{UInt8, SubString{String}} and SubArray{UInt8, 1, Vector{UInt8}, Tuple{Base.OneTo{Int64}}, true}. Unsurprisingly, people don't correctly do this.

There are also some other, minor issues with the existing approaches: Namely, it causes unnecessary specialization, as there is no reason to compile two methods for e.g. String and Vector{UInt8} if they each just operate on a slice of bytes in memory. Also, it's more difficult to introspect and reason about methods whose signature includes a "big union".

Why not use DenseArray?

Because it doesn't cover the correct types:

  • It doesn't cover SubArray, despite many subarrays being dense vectors.
  • Currently, we incorrectly have CodeUnits <: DenseVector (CodeUnits <: DenseVector, but does not fulfill its only criteria #53996). If this is fixed, then e.g. CodeUnits{T, String} will also not be covered by DenseArray.
  • Users may create a new subtype <: DenseArray without implementing methods like pointer, or indeed without even realizing that subtyping DenseArray requires that their new type must be memory-backed and contiguous.

Also, reinterpret may create dense arrays that are not DenseArrays - however, my proposed implementation doesn't handle this, either.

Proposal

I have cooked up a proof-of-concept package here: https://github.com/BioJulia/MemoryViews.jl which you may see for more details

I propose creating a new, internal Base type with the following layout:

# M is a marker type that is either Mutable or Immutable 
struct MemoryView{T, M} <: DenseVector{T}
    ref::MemoryRef{T}
    len::Int
end

These types can be constructed from the various contiguous, dense arrays:

MemoryView(A::Memory{T}) where {T} = MutableMemoryView{T}(unsafe, memoryref(A), length(A))
MemoryView(A::Array{T}) where {T} = MutableMemoryView{T}(unsafe, A.ref, length(A))

MemoryView(s::String) = ImmutableMemoryView(unsafe_wrap(Memory{UInt8}, s))

function MemoryView(s::SubString{String})
    memview = MemoryView(parent(s))
    isempty(memview) && return memview
    newref = @inbounds memoryref(memview.ref, s.offset + 1)
    ImmutableMemoryView{UInt8}(unsafe, newref, s.ncodeunits)
end

[etc...]

Optimised methods can be implemented in terms of MemView, when it's not possible to write optimised methods generic to AbstractArray, and where Union{Vector, Memory} etc. is too restrictive:

function some_function_working_on_memory(mem::MemoryView{UInt8})
    # the optimised function here, maybe calling ccall or intrinsics
end

Which would only need to be compiled once, and which is conceptually easier to understand then working with a huge union.

Further, my proposal implements a trait called MemoryKind, which is IsMemory if the type is semantically equivalent to its MemoryView and NotMemory otherwise. For example, codeunits are semantically equivalent to MemoryView (both being dense arrays of bytes), whereas strings have a MemoryView method, but are not semantically equivalent to an array of bytes.
The purpose of MemoryKind is that, for any type implementing that, methods can immediately construct a MemoryView, then pass it on to the low-level implementation function.

The surface level API would be along the lines below:

function foo(x::MemoryView)
    # low-level implementation
end

function foo(::NotMemory, x::AbstractArray)
    # slow, generic fallback
end

# Dispatch with the `MemoryKind` trait
foo(::IsMemory, x) = foo(MemoryView(x))
foo(x) = foo(MemoryKind(typeof(x)), x)

# Optionally: Also support strings
foo(x::AbstractString) = foo(codeunits(x))

This proposal has two important features:

  1. At a high level, dispatching on the MemoryKind trait is more accurate at selecting dense arrays than dispatching on big unions. It both correctly includes complex nested types such as views of codeunits of substrings, while also rejecting incorrect types, such as an incorrectly implemented DenseArray.
  2. At a low level, it is easier to reason about the behaviour (and safety) of low-level methods operating on memory, if it's implemented in terms of a single, concrete MemoryView type. It's also better for compile times and sysimg size if the methods for many different types all dispatch to the same single implementation.

Alternatives

  1. Instead of creating concrete types, the different parts in Base could more consistently use Base._checkcontiguous and isbitstype, perhaps extracted into a function is_memory_backed_contiguous. Base could then be reviewed for places where memory-backed contiguous types are used, and methods could be added to "funnel" all the compatible types into these methods.
    Importantly, any alternative approach should include an internal API, such that it makes it easy to write a method that will correctly dispatch if, and only if, the argument is a dense array - which is not the status quo.

  2. MemoryView could, instead of being backed by Memory, be backed by a simple pointer. In that case, MemoryView would be much like Random.UnsafeView. The novelty of this proposal would be in the MemoryKind trait, mostly. See more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrays[a, r, r, a, y, s]collectionsData structures holding multiple items, e.g. sets

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions