Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add advanced indexing support #114

Open
woodmd opened this issue Mar 1, 2018 · 12 comments
Open

Add advanced indexing support #114

woodmd opened this issue Mar 1, 2018 · 12 comments
Labels
enhancement Indicates new feature requests help wanted

Comments

@woodmd
Copy link

woodmd commented Mar 1, 2018

I would like to have support for advanced indexing for both retrieval and assignment. Ideally I was hoping to find something that could serve as a drop-in replacement for numpy.ndarray for these types of operations. Is this functionality something that would be in the scope of this library? Are there any thoughts on how it should be implemented?

On a related note I noticed that COO is currently immutable and thus doesn't doesn't allow item assignment. However I wonder if one could support assignment by having COO make an in-place copy of itself. Of course this will be extremely inefficient for updating a single element but when addressing a large number of elements in parallel the overhead from the copy should be more manageable. Of course in the documentation you could stress that setting elements of COO individually is not recommended.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Mar 1, 2018

Hi!

Advanced indexing support is definitely on the to-do list (see #1), however it would be extremely inefficient to assign with COO, as you noted.

It isn't a high priority right now, but a PR is always welcome. Also, since you have shown interest, the priority went up.

If you want fast writes, DOK is the way to go, which you will then convert to COO once you are done writing to it. It supports assigning an array simultaneously, but the downside is that it doesn't support advanced indexing either. Unfortunately, this is implemented in Python (and so isn't super fast) and does not accept another COO/DOK as input, but does accept Numpy Arrays for slices.

If we want to make advanced indexing super-fast for both COO and DOK, we will need to re-write DOK in C++ to use std::unordered_map. For COO as well, even though changing the data structure won't be needed, a large part of indexing will need to be in C++.

If you want to interleave reads and writes, DOK is the best choice. However, currently it only supports accessing single elements, though that isn't too hard to extend.

Hope this helps!

Making COO itself mutable is out of scope, I'm afraid. It just isn't meant to support writes. Instead, in your scenario, I suggest you construct a DOK, mutate it, and take "snapshots" to COO whenever you want to perform operations other than writes.

@hameerabbasi hameerabbasi added enhancement Indicates new feature requests help wanted labels Mar 1, 2018
@hameerabbasi
Copy link
Collaborator

Howvever, if assigning to COO won't change its sparsity structure, then that is in scope.

@hameerabbasi hameerabbasi mentioned this issue Apr 18, 2018
7 tasks
@hameerabbasi
Copy link
Collaborator

I'm blocked on this by numba/numba#2560 or #126.

@hameerabbasi
Copy link
Collaborator

cc @woodmd would it be good enough if we converted on-the-fly between the most optimal formats for each operation? So writing would automatically convert it to DOK, element-wise ops or reads to COO, for example?

@woodmd
Copy link
Author

woodmd commented Apr 30, 2018

I don't think having element-wise update is critical for us. The use-case would generally be reading/updating a very large number of elements at once. Probably for updating the easiest approach would be to instantiate a new COO on the fly.

@ahwillia
Copy link
Contributor

ahwillia commented May 5, 2018

The following indexing pattern no longer works on master. I believe the problem originates at commit 222fb8c:

import sparse
import numpy as np
s = sparse.random((100, 100, 100), density=.1)
idx = np.random.permutation(100)
s[idx]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

.../sparse/coo/indexing.py in getitem(x, index)
     54 
     55     # zip_longest so things like x[..., None] are picked up.
---> 56     if len(index) != 0 and all(ind == slice(0, dim, 1) for ind, dim in zip_longest(index, x.shape)):
     57         return x
     58 

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

@hameerabbasi
Copy link
Collaborator

This is correct, it's a known issue, and it was discussed in the PR related to that commit.

We sacrificed this feature for performance. If you're interested, I can walk you through how to add full-blown advanced indexing.

I have no immediate plans to do this myself, however.

@ahwillia
Copy link
Contributor

ahwillia commented May 6, 2018

I'd definitely be happy to help with this! (As long as you don't think it would take too much effort on your part to get me started on it.)

@hameerabbasi
Copy link
Collaborator

Great! It is a bit of work to understand, but not too much work to actually implement. I'll say this right now: This isn't a cup of tea for a beginning contributor to this library, and it's completely okay if you're not okay doing this. That said, I'm more than willing to help guide you through it, in case you have questions or need any kind of support (algorithmic or technical). Don't hesitate to ask anything (either on this issue or on any related PR you create), and I'll help where I can.

Most of what you need is in https://github.com/pydata/sparse/blob/8f2a9aebe595762eace6bc48531119462f979e21/sparse/coo/indexing.py
Here is a brief sketch of the indexing algorithm:

  • The main idea is to create a sort of "mask" such that output.data = input.data[mask]
  • Since indices are sorted in lexographical order, it tries to find starts/ends for an integer slice via binary search.
  • It does this for all supplied indices.
  • For slices, it repeats the above for every possible integer in the slice.
  • When it detects that the algorithm is becoming too slow (or if it reaches the ending index), it converts the starts/stops to a mask, and switches to filtering that mask using the slices directly.

A couple of quick side notes:

  • All indices represented as slices of an array (n_indices, 3) with each 1-D slice being start/stop/step into input.coords.
  • Integers are represented as a slice, i:i+1:1

What you will need to do (no pressure):

  • Refactor the algorithm to return output.coords and output.shape as well.
  • Broadcast advanced indices to a common shape.
  • Iterate over zip(advanced indices) and treat them as if they were single indices.
  • Combine all the coords and data masks.

A couple of things you might need to know during the implementation phase:

  • It's best to flatten all advanced coordinates, concatenate them into a single 2-D array, and work on it that way. This way, there won't be a million different Numba-compiled versions for every dimension and number of advanced indices.
  • You might need to keep track of where the "advanced index dimension" goes. Specifically, if they're all next to each other, it goes at that place. If not, it goes at the beginning. https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#combining-advanced-and-basic-indexing.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Aug 6, 2018

I have added outer indexing for a single array in #172. If outer indexing for multiple arrays is needed then we might add it after NEP 21 is accepted.

@k-a-mendoza
Copy link

@ahwillia Seems like an older thread, but I was wondering if there's been any progress made on this front?

@hameerabbasi
Copy link
Collaborator

#343 adds this for the one-dimensional case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Indicates new feature requests help wanted
Projects
None yet
Development

No branches or pull requests

4 participants