-
-
Notifications
You must be signed in to change notification settings - Fork 131
Description
CSD stands for Compressed Sparse Dimensions. It is a multidimensional generalization of CSR, CSC and COO.
CSD has a few "compressed" dimensions and a few "uncompressed" dimensions. The "compressed" dimensions are linearized and go into indptr similar to CSR/CSC. The "uncompressed" dimensions (corresponding to indices in CSR/CSC, but there can be multiple) will be stored in coords.
The plan is to:
- Write
CSD. - Test it.
- Make
COOand others inherit from it.
Task checklist:
Implement CSD (parametrize tests over compressed axes).:
- Element-wise
- Reductions
- Slicing
- Change compressed dimensions
Post-Implementation (parametrize tests over types):
- Make COO inherit from CSD
- Make CSR inherit from CSD
- Make CSC inherit from CSD
Compressed Sparse Dimensions — Format Specification
Compressed Sparse Dimensions (CSD) is an multidimensional sparse array storage format.
Let’s say that the number of dimensions is ndim and compressed_axes represents the axes/dimensions that are compressed, as a tuple.
It contains three main arrays, one of them two-dimensional:
data(one dimensional,(nnz,))coords(two-dimensional,(ndim - len(compressed_axes), nnz))indptr(one dimensional(reduce(operator.mul, (shape[a] for a in compressed_axes), 1) + 1,))
data/coords/indptr from an ndarray
Relating to a normal NumPy array, we can specify data and coords directly. Let’s say arr is the Numpy array to convert.
non_compressed_axes = tuple(a for a in range(ndim) if a not in set(compressed_axis))
coo_coords = np.where(arr)
data = arr[coo_coords]
axes_order = compressed_axes + non_compressed_axes
if axes_order != tuple(range(ndim)):
lin_coo_coords = linear_loc(coo_coords[np.asarray(axes_order)], tuple(self.shape[a] for a in axes_order))
idx = np.argsort(lin_coo_coords)
data = data[idx]
coords = coords[:, idx]
coords = coo_coords[np.asarray(non_compressed_axes)].astype(np.min_scalar_type(max(non_compressed_axes)))
compressed_linear_shape = reduce(operator.mul, (shape[a] for a in compressed_axes), 1)
compressed_linearized_coords = linear_loc(coo_coords[np.array(compressed_axes)], compressed_axes)
indptr = np.empty(compressed_linear_shape + 1, dtype=np.min_scalar_type(compressed_linear_shape))
indptr[0] = 0
np.cumsum(np.bincount(compressed_linearized_coords, minlength=compressed_linear_shape), out=indptr[1:])Interpreting the arrays
For CSR, the interpretation is simple: indices represents the columns, data the respective data. indptr[i]:indptr[i+1] represents the slice corresponding to row i for which columns and data are valid.
We can extend this definition to CSD as well. coords represent the non-compressed axes, and data the respective data. indptr can be interpreted the same way, replacing the word “row” with “linearized compressed axes index”, but it turns out that there is a simpler way to work with things.
Alternative representation for indptr
Consider the following two arrays produced from indptr:
starts = indptr[:-1].reshape((shape[a] for a in compressed_axes))
stops indptr[1:].reshape((shape[a] for a in compressed_axes))Note that these are views, no copies are made. This gives a simpler and more intuitive interpretation for indptr. For every possible index corresponding to the compressed axes: Indexing into starts gives the starts for that index and stops gives the stops.
The “multiple-COO” view
For every value in starts/stops, coords[:, start_val:stop_val] and data[start_val:stop_val] can be seen as a mini-COO array, and can thus have any COO-related algorithms applied to it.