Add a concept of memory-backed contiguous collection in Base

I propose the creation of a new concrete (parametric) type in Base that corresponds to memory-backed contiguous arrays, and to implement Base methods that specialize on these properties in terms of this type.

Edit: I've made a proof-of-concept package here: https://github.com/BioJulia/MemoryViews.jl

## Motivation
Various places in Base and elsewhere, Julia has methods that can operate on _any_ contiguous memory region.
Let's call this type of data a _dense array_.
Examples include:
* Search methods which work by ccalling `memchr`, as in `string/search.jl`
* CRC32c
* Various IO methods, e.g. `read!(::IO, A::AbstractArray)`, which calls to `unsafe_write`.

This is currently handled inconsistently in Base:
* `string/search.jl` uses the internal `ByteArray` union type to cover dense arrays of bytes
* CRC32c uses `Union{Array{UInt8},FastContiguousSubArray{UInt8,N,<:Array{UInt8}} where N}`
* `read` dispatches on `AbstractArray`, and checks of `isbitstype(eltype(A)) && _checkcontiguous(Bool, A)`.

There are several issues with this handling:

First, all the approaches above fail to cover some important types. I've attempted to address this ad-hoc in #47693, #53178 and #54579. However, this ad-hoc patching is deeply unsatisfying, and I'm certain I have missed several types. In fact, I'm starting to think it's literally impossible to cover all permutations of views, reinterpretarrays, codeunits, memory and vector using a `Union`.
The practical implication is that lots of function calls needlessly fail to hit the fast path, while at the same time, the code is harder to reason about because this ad-hoc implementations uses complex big unions in their signature, and is inconsistent with each other.

The fact that this code is duplicated thrice in Base, and in all three places was lacking important methods, suggests to me that there is a need for a better (internal) API to write methods for dense arrays. That was also my experience when making those three PR's: "Surely, there must be a better way to do this". The main usability issue is that in lieu of any API to cover dense arrays, it's up to the implementer of every method to make sure they've covered non-obvious types like `CodeUnits{UInt8, SubString{String}}` and `SubArray{UInt8, 1, Vector{UInt8}, Tuple{Base.OneTo{Int64}}, true}`. Unsurprisingly, people don't correctly do this.

There are also some other, minor issues with the existing approaches: Namely, it causes unnecessary specialization, as there is no reason to compile two methods for e.g. `String` and `Vector{UInt8}` if they each just operate on a slice of bytes in memory. Also, it's more difficult to introspect and reason about methods whose signature includes a "big union".

#### Why not use `DenseArray`?
Because it doesn't cover the correct types:
* It doesn't cover `SubArray`, despite many subarrays being dense vectors.
* Currently, we incorrectly have `CodeUnits <: DenseVector` (#53996). If this is fixed, then e.g. `CodeUnits{T, String}` will also not be covered by `DenseArray`.
* Users may create a new subtype `<: DenseArray` without implementing methods like `pointer`, or indeed without even realizing that subtyping `DenseArray` requires that their new type must be memory-backed and contiguous. 

Also, reinterpret may create dense arrays that are not `DenseArray`s - however, my proposed implementation doesn't handle this, either.

## Proposal
I have cooked up a proof-of-concept package here: https://github.com/BioJulia/MemoryViews.jl which you may see for more details

I propose creating a new, internal Base type with the following layout:
```julia
# M is a marker type that is either Mutable or Immutable 
struct MemoryView{T, M} <: DenseVector{T}
    ref::MemoryRef{T}
    len::Int
end
```

These types can be constructed from the various contiguous, dense arrays:
```julia
MemoryView(A::Memory{T}) where {T} = MutableMemoryView{T}(unsafe, memoryref(A), length(A))
MemoryView(A::Array{T}) where {T} = MutableMemoryView{T}(unsafe, A.ref, length(A))

MemoryView(s::String) = ImmutableMemoryView(unsafe_wrap(Memory{UInt8}, s))

function MemoryView(s::SubString{String})
    memview = MemoryView(parent(s))
    isempty(memview) && return memview
    newref = @inbounds memoryref(memview.ref, s.offset + 1)
    ImmutableMemoryView{UInt8}(unsafe, newref, s.ncodeunits)
end

[etc...]
```

Optimised methods can be implemented in terms of `MemView`, when it's not possible to write optimised methods generic to `AbstractArray`, and where `Union{Vector, Memory}` etc. is too restrictive:
```julia
function some_function_working_on_memory(mem::MemoryView{UInt8})
    # the optimised function here, maybe calling ccall or intrinsics
end
```
Which would only need to be compiled once, and which is conceptually easier to understand then working with a huge union.

Further, my proposal implements a trait called `MemoryKind`, which is `IsMemory` if the type is _semantically equivalent to its `MemoryView`_ and `NotMemory` otherwise. For example, codeunits _are_ semantically equivalent to `MemoryView` (both being dense arrays of bytes), whereas strings have a `MemoryView` method, but are not semantically equivalent to an array of bytes.
The purpose of `MemoryKind` is that, for any type implementing that, methods can immediately construct a `MemoryView`, then pass it on to the low-level implementation function.

The surface level API would be along the lines below:
```julia
function foo(x::MemoryView)
    # low-level implementation
end

function foo(::NotMemory, x::AbstractArray)
    # slow, generic fallback
end

# Dispatch with the `MemoryKind` trait
foo(::IsMemory, x) = foo(MemoryView(x))
foo(x) = foo(MemoryKind(typeof(x)), x)

# Optionally: Also support strings
foo(x::AbstractString) = foo(codeunits(x))
```

This proposal has two important features:
1. At a _high level_, dispatching on the `MemoryKind` trait is more accurate at selecting dense arrays than dispatching on big unions. It both correctly includes complex nested types such as views of codeunits of substrings, while also rejecting incorrect types, such as an incorrectly implemented `DenseArray`.
2. At a _low_ level, it is easier to reason about the behaviour (and safety) of low-level methods operating on memory, if it's implemented in terms of a single, concrete `MemoryView` type. It's also better for compile times and sysimg size if the methods for many different types all dispatch to the same single implementation.

## Alternatives
1. Instead of creating concrete types, the different parts in Base could more consistently use `Base._checkcontiguous` and `isbitstype`, perhaps extracted into a function `is_memory_backed_contiguous`. Base could then be reviewed for places where memory-backed contiguous types are used, and methods could be added to "funnel" all the compatible types into these methods.
Importantly, _any alternative approach should include an internal API_, such that it makes it easy to write a method that will correctly dispatch if, and only if, the argument is a dense array - which is _not_ the status quo.

2. MemoryView could, instead of being backed by `Memory`, be backed by a simple pointer. In that case, `MemoryView` would be much like `Random.UnsafeView`. The novelty of this proposal would be in the `MemoryKind` trait, mostly. [See more here](https://github.com/jakobnissen/MemViews.jl?tab=readme-ov-file#alternative-proposal).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add a concept of memory-backed contiguous collection in Base #54581

Motivation

Why not use `DenseArray`?

Proposal

Alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add a concept of memory-backed contiguous collection in Base #54581

Description

Motivation

Why not use DenseArray?

Proposal

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why not use `DenseArray`?