Replies: 3 comments 4 replies
-
Interesting. I hope it works. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi @jpivarski may I ask why missing data is excluded from the scope? What is the issue of having dimensions of different lengths but also allowing None values in a given dimension? |
Beta Was this translation helpful? Give feedback.
3 replies
-
Version 0.1.0 has been released (GitHub, PyPI), and this is the first "ready-for-customers" version of Ragged. Try it out and let me know what's missing or broken. Thanks! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Ragged is a library that provides a
ragged.array
data type, which is like a NumPy/CuPy array except that some dimensions may be variable-length lists instead of fixed-length.It is a subset of Awkward Array, which allows any tree-like data structure in the arrays, but by focusing on lists (of lists of lists...) of numbers, a
ragged.array
can have ashape
and adtype
,which makes it usable in contexts that Awkward Arrays don't fit. In fact, Ragged adheres to the Array API Standard, which might allow it to be dropped into programs that were written with conventional arrays in mind. That is, by producing the Array API interface, it may be consumed by libraries or applications that use it.
The Array API Standard specifies that an array's shape must have data type
The Array API's purpose in allowing
None
as a shape item is to accommodate dimensions of unknown length, rather than variable length: not the same thing!This library, Ragged, is an experiment in reinterpreting that part of the standard to see what happens. If it works, that is, if Ragged easily composes with other libraries that consume the standard, then perhaps future versions of the standard can include language that explicitly allows it. If not, then perhaps an amendment or extension would allow us to introduce a new token such as
VAR
inDiscussions of a pure ragged array library have come up many times, usually in conjunction with xarray. Awkward Array was originally developed for particle physics, which is coming from C++ with the assumption that data can be organized in any object graph. Awkward (eventually) narrowed that scope to arrays of identically typed trees and showed that NumPy-like, array-oriented idioms are possible in such a system. Other communities, such as Pangeo/climate science, anndata and Zarr/genetics, have traditionally used arrays and NetCDF/HDF5 files with the assumption that data can be arranged as tables of data. But in some cases, they need to break out of that restriction.
In general, there's a spectrum of data structures from numerical arrays to object graphs, and if you're starting in one corner, it is easiest to take a small step toward the middle, rather than jump across to the other side. I would describe the spectrum like this:
shape
(tuple of integers) and adtype
(numeric interpretation of the fixed-width numbers).shape
only describes the fixed-width part and the variable-length part can be part of thedtype
description. Scipp is an example of this. (@SimonHeybrock can correct me if I'm wrong.)dtype
. Pandas and xarray do this.shape
, not thedtype
, and this new Ragged library targets this scope.argmin
/argmax
a meaningful value for empty lists, but it directly contradicts the Array API's definition ofdtypes
while adding complexity everywhere).[1, 2, 3, [4, 5]]
. This is allowed by JSON and GeoJSON uses it to describe coordinates, some of which are points and some are boundaries. At this level of generality, noshape
is possible, since theshape
tuple would need to have different lengths to describe the different nesting levels.shape
anddtype
.shape
anddtype
). Despite this generality, it's possible to apply array-oriented idioms.shape
anddtype
impossible, but it's unclear how you'd apply array-oriented functions to such data. (At this level of generality, array-oriented programming becomes functional programming: you'd be writing lambda functions to map and reduce over the data, like Spark's RDDs or Pandas'sapply
function.)Particle physicists were starting at the bottom of this spectrum and had to move upward to better take advantage of Python at scale. Communities who are starting at the top of this spectrum but need more complex data models can move downward in a smaller step when the Ragged library is ready for use.
There has been a lot of discussion about this over a long period of time. Here's a list of threads on this topic that I'm going to point to this thread, for consolidation.
We have discussed nearly every aspect of this: "Do we really need a library to focus on ragged arrays only when there's already Awkward Array?" "What would be the ragged library's scope? Should it include missing values?" "What token should denote a ragged dimension in the
shape
?None
? Something else?" "Can raggedness be introduced to xarray directly, by some clever use of coordinates?"What's new here is a commitment to build a subset library, the observation that the Array API Standard probably accommodates it, thanks to allowing
None
in theshape
specification, otherwise strict adherence to the Array API standard (already stubbed-out!), and an exclusion of missing data from the scope.Beta Was this translation helpful? Give feedback.
All reactions