Why does this library exist? #6

jpivarski · 2023-12-30T18:46:15Z

jpivarski
Dec 30, 2023
Maintainer

Ragged is a library that provides a ragged.array data type, which is like a NumPy/CuPy array except that some dimensions may be variable-length lists instead of fixed-length.

>>> import ragged
>>> array = ragged.array([[[1.1, 2.2, 3.3], [], [4.4, 5.5]], [], [[6.6], [7.7, 8.8, 9.9]]])
>>> array
ragged.array([
    [[1.1, 2.2, 3.3], [], [4.4, 5.5]],
    [],
    [[6.6], [7.7, 8.8, 9.9]]
])

It is a subset of Awkward Array, which allows any tree-like data structure in the arrays, but by focusing on lists (of lists of lists...) of numbers, a ragged.array can have a shape and a dtype,

>>> array.shape
(3, None, None)

>>> array.dtype
dtype('float64')

which makes it usable in contexts that Awkward Arrays don't fit. In fact, Ragged adheres to the Array API Standard, which might allow it to be dropped into programs that were written with conventional arrays in mind. That is, by producing the Array API interface, it may be consumed by libraries or applications that use it.

The Array API Standard specifies that an array's shape must have data type

shape: tuple[None | int, ...]

The Array API's purpose in allowing None as a shape item is to accommodate dimensions of unknown length, rather than variable length: not the same thing!

This library, Ragged, is an experiment in reinterpreting that part of the standard to see what happens. If it works, that is, if Ragged easily composes with other libraries that consume the standard, then perhaps future versions of the standard can include language that explicitly allows it. If not, then perhaps an amendment or extension would allow us to introduce a new token such as VAR in

shape: tuple[None | int | VAR, ...]

Discussions of a pure ragged array library have come up many times, usually in conjunction with xarray. Awkward Array was originally developed for particle physics, which is coming from C++ with the assumption that data can be organized in any object graph. Awkward (eventually) narrowed that scope to arrays of identically typed trees and showed that NumPy-like, array-oriented idioms are possible in such a system. Other communities, such as Pangeo/climate science, anndata and Zarr/genetics, have traditionally used arrays and NetCDF/HDF5 files with the assumption that data can be arranged as tables of data. But in some cases, they need to break out of that restriction.

In general, there's a spectrum of data structures from numerical arrays to object graphs, and if you're starting in one corner, it is easiest to take a small step toward the middle, rather than jump across to the other side. I would describe the spectrum like this:

N-dimensional arrays of numbers; all dimensions are fixed-length and data cannot be missing. The array can be described by a shape (tuple of integers) and a dtype (numeric interpretation of the fixed-width numbers).
N-dimensional, rectangular array that either contains a ragged array or is contained within ragged dimensions. The shape only describes the fixed-width part and the variable-length part can be part of the dtype description. Scipp is an example of this. (@SimonHeybrock can correct me if I'm wrong.)
N-dimensional, rectangular array with missing values; "nullable" types can be an extension of the dtype. Pandas and xarray do this.
N-dimensional array in which some dimensions are variable-length lists and others are fixed-length. This generalizes the shape, not the dtype, and this new Ragged library targets this scope.
Combinations of variable-length dimensions and missing data (which I strongly considered, to give argmin/argmax a meaningful value for empty lists, but it directly contradicts the Array API's definition of dtypes while adding complexity everywhere).
Heterogeneous levels of nested dimensions, such as [1, 2, 3, [4, 5]]. This is allowed by JSON and GeoJSON uses it to describe coordinates, some of which are points and some are boundaries. At this level of generality, no shape is possible, since the shape tuple would need to have different lengths to describe the different nesting levels.
Nested records, also a common feature of JSON ("objects"). Different fields of a record can have different data types and different nesting levels, so this precludes both shape and dtype.
The Awkward Array library has all of the above, as long as there are no cycles in the type description (using Datashape, rather than separate shape and dtype). Despite this generality, it's possible to apply array-oriented idioms.
General, cyclic data graph, such as Python or C++ objects. Not only are shape and dtype impossible, but it's unclear how you'd apply array-oriented functions to such data. (At this level of generality, array-oriented programming becomes functional programming: you'd be writing lambda functions to map and reduce over the data, like Spark's RDDs or Pandas's apply function.)

Particle physicists were starting at the bottom of this spectrum and had to move upward to better take advantage of Python at scale. Communities who are starting at the top of this spectrum but need more complex data models can move downward in a smaller step when the Ragged library is ready for use.

There has been a lot of discussion about this over a long period of time. Here's a list of threads on this topic that I'm going to point to this thread, for consolidation.

Dec 2019‒Jul 2020: can Awkward Array be used in or with xarray? No. Using Awkward Arrays in or with Xarray awkward#27
Aug 2020‒Sep 2020: xarray with a variable-length dimension in Pangeo? https://discourse.pangeo.io/t/xarray-for-raster-data-dems-with-inconsistent-spatial-extent/821
Nov 2020‒Dec 2020: will the Array API Standard be able to accommodate Awkward Array? No. How do I get involved? data-apis/consortium-feedback#6
Mar 2022‒Sep 2023: AnnData, xarray, and Awkward Array. Alignment with xarray scverse/anndata#744
Jul 2020‒Nov 2022: Awkward Array backend in xarray? Awkward array backend? pydata/xarray#4285
Sep 2022: Ragged array summit in Scientific Python (good idea; didn't happen). https://discuss.scientific-python.org/t/ragged-array-summit/465
Jul 2023‒Nov 2023: Awkward Array backend in xarray again, but differently? Ragged DataArrays (Variables) and coordinates pydata/xarray#7988
Nov 2023: Ragged arrays in Scientific Python? https://discuss.scientific-python.org/t/best-practices-regarding-accepting-ragged-arrays/857
Dec 2023: Ragged/fixed-length array at scale in Pangeo? https://discourse.pangeo.io/t/data-format-for-a-nested-2-d-big-array/3922

We have discussed nearly every aspect of this: "Do we really need a library to focus on ragged arrays only when there's already Awkward Array?" "What would be the ragged library's scope? Should it include missing values?" "What token should denote a ragged dimension in the shape? None? Something else?" "Can raggedness be introduced to xarray directly, by some clever use of coordinates?"

What's new here is a commitment to build a subset library, the observation that the Array API Standard probably accommodates it, thanks to allowing None in the shape specification, otherwise strict adherence to the Array API standard (already stubbed-out!), and an exclusion of missing data from the scope.

swamidass · 2024-01-03T08:25:31Z

swamidass
Jan 3, 2024

Interesting. I hope it works.

0 replies

ConstantinVasilev · 2024-01-07T17:33:28Z

ConstantinVasilev
Jan 7, 2024

Hi @jpivarski may I ask why missing data is excluded from the scope? What is the issue of having dimensions of different lengths but also allowing None values in a given dimension?

3 replies

jpivarski Jan 7, 2024
Maintainer Author

There are three possible levels of missing value support:

don't allow them at all
allow numerical data to be missing, but not lists
allow numerical data or lists to be missing.

Level 2 would effectively double the set of dtypes: for each dtype T, there would be a dtype Option[T] that would allow the numeric values to be None. Maybe this wouldn't be needed for floating point (including complex) data, since NaN can take the place of None, but integer types and booleans would need to be expanded to accommodate missing values.

Level 3 would allow a whole list/row to be missing, and None (list/row is missing) is different from [] (list/row has zero length), which is different from [None, None, None, ...] (every numerical value in the list/row is missing). You can make these distinctions in Awkward Array, but they seem out of place here. Also, there wouldn't be any way to express that a given dimension can have missing lists/rows: it can't be expressed in the dtype, since the dtype doesn't say anything about how dimensions are arranged. It could be in the shape, but we're already interpreting None in a shape tuple to mean "no fixed length; lengths can be ragged."

So we're down to only option 1 versus option 2.

I strongly considered option 2, since the min, max, argmin, argmax functions need to return something for length-0 lists. In Awkward Array, they return None. In fact, these functions return option-type even if none of the lists have zero length—output types should not depend on values.

However, the Array API is explicit about what dtypes it includes. Adding more would be an extension of the API, which is allowed, but it would be getting away from what we're trying to do here: provide an intermediate between rectilinear arrays and the full Awkward Array at one particularly interesting point along the spectrum: the Array API. The reason this project never got started before is because we got stuck at the question of where along the spectrum to do it. (I got stuck there, anyway: see the above-linked discussions.) We already have a library that handles all data types; if ragged arrays without nullable data is too restrictive, you can use Awkward Array.

ConstantinVasilev Jan 8, 2024

Thank you for the detailed answer! Is there some possibility for a very limited / minimal effort "2.1 allow missing floats" which is already supported by most float dtypes? It could be some balance between not completely ruling out missings and avoiding weird cases + extending the API.

jpivarski Jan 8, 2024
Maintainer Author

NaN values come for free with the floating point types, so they're included in option 1, which I'll clarify with an edit right now. (Also the complex types, though I don't know what happens if the real part is NaN and the imaginary part isn't, or vice-versa. Such a thing probably becomes all NaN pretty quickly.)

For instance, there's already a symbol called ragged.nan, which is required by the specification.

What remains is to see what functions do with these missing values. argmin/argmax can't return floating-point NaN for missing lists, since argmin/argmax must return an integer type. I've noticed that the specification doesn't have the nansum/nanmean/nanstd/etc. functions that prevent NaN from spreading virally through a calculation. (For elementwise functions, it's contained, but not for reducers.) I think I will have a namespace for extensions beyond the Array API, so that we'd have a place to add functions like nansum.

jpivarski · 2024-01-16T00:15:30Z

jpivarski
Jan 16, 2024
Maintainer Author

Version 0.1.0 has been released (GitHub, PyPI), and this is the first "ready-for-customers" version of Ragged.

Try it out and let me know what's missing or broken. Thanks!

1 reply

hmaarrfk Aug 16, 2024

and on conda-forge soon: conda-forge/staged-recipes#27281

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does this library exist? #6

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why does this library exist? #6

jpivarski Dec 30, 2023 Maintainer

Replies: 3 comments · 4 replies

swamidass Jan 3, 2024

ConstantinVasilev Jan 7, 2024

jpivarski Jan 7, 2024 Maintainer Author

ConstantinVasilev Jan 8, 2024

jpivarski Jan 8, 2024 Maintainer Author

jpivarski Jan 16, 2024 Maintainer Author

hmaarrfk Aug 16, 2024

jpivarski
Dec 30, 2023
Maintainer

Replies: 3 comments 4 replies

swamidass
Jan 3, 2024

ConstantinVasilev
Jan 7, 2024

jpivarski Jan 7, 2024
Maintainer Author

jpivarski Jan 8, 2024
Maintainer Author

jpivarski
Jan 16, 2024
Maintainer Author