-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-4453: [Python] Cython wrappers for SparseTensor #4446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rok You can fix lint errors in pyarrow.cc and pyarrow.h by |
12068bc to
d3065ec
Compare
46d82c6 to
0fa8d82
Compare
|
I feel this is ready for a first review now. @mrkn Should we implement a sparse_tensor.to_dense() method? We would probably want to do that in c++ with a sparse_tensor.to_tensor method and only wrap it with Python. |
|
@rok Yes, I also think we should have |
|
@mrkn I'd be happy to do it. |
|
@rok Great! Please ask me if you want to any help to do it. |
962c364 to
4dd83c3
Compare
|
I added this to the 0.14 milestone so we can maybe get it merged early next week |
|
@wesm - that would be great. |
|
Either is fine so long as we close this current patch out |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. This is a nice start. I posted a bunch of comments below. Some of them might have to do with the CI crash.
a7032b2 to
9d560f6
Compare
05add3f to
09b8341
Compare
Codecov Report
@@ Coverage Diff @@
## master #4446 +/- ##
==========================================
+ Coverage 88.39% 88.99% +0.6%
==========================================
Files 908 719 -189
Lines 114157 99293 -14864
Branches 1418 0 -1418
==========================================
- Hits 100904 88366 -12538
+ Misses 12891 10927 -1964
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
|
I rebased, did a lot of polish and fixed some issues. |
|
AppVeyor build: https://ci.appveyor.com/project/pitrou/arrow/builds/25661941 |
|
Travis passed as well: https://travis-ci.org/apache/arrow/builds/552756394 |
| def test_sparse_tensor_coo_from_dense(dtype_str, arrow_type): | ||
| dtype = np.dtype(dtype_str) | ||
| data = np.array([[4], [9], [7], [5]]).astype(dtype) | ||
| coords = np.array([[0, 0], [0, 2], [1, 1], [3, 3]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrkn Can you quickly double-check this is ok? The original code had different values here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results match scipy's reference. However our API for COO is slightly different to scipy's. Would it make sense to go closer to scipy?
Another reference is pydata/sparse.
| scipy | pydata/sparse | pyarrow | |
|---|---|---|---|
| coo constructor | row(1, M), col(1, M), data(M), shape(ndims) | coords(2, M), data(M), shape(ndims) | coords(M, 2), data(M), shape(ndims) |
| csr constructor | indptr(1, k), indices(1, M), data(M), shape(ndims) | NA | indptr(1, k), indices(1, M), data(M), shape(ndims) |
Here M is number on non zero elements and k is number of pointers that will depend on the data present in the tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the current API allows zero-copy between Numpy and Arrow, so I think it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean the 'python facing' interface, so users would see this:
pyarrow:
data = np.array([[4], [9], [7], [5]])
coords = np.array([[0, 0], [3, 3], [1, 1], [0, 2]])
pa.SparseTensorCOO.from_numpy(data, coords, (4, 4))
scipy:
row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
coo_matrix((data, (row, col)), shape=(4, 4)).toarray()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is that different from currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, either the PyArrow API is closer to pydata/sparse or it's closer to the C++ Arrow API.
In any case, it seems it's just a transpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and since transpose can be done from Python it's maybe best to leave this to integrations (https://issues.apache.org/jira/browse/ARROW-4224, https://issues.apache.org/jira/browse/ARROW-4223).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) The reason why the current coords matrix is column-major is due to the discussion in #2546. But I'm working on introducing the srides array in a SparseCOOIndex at mrkn/arrow@c40d8e4...31beafd. This allows us to have coords matrix in any-contiguousness.
(2) Arrow's COO sparse tensor manages an index as a matrix while SciPy's COO matrix manages an index in two vectors. This difference comes from Arrow's sparse tensor is not a matrix, but a tensor. Tensorflow and PyTorch employ the same approach for COO sparse tensor.
(3) If we should support zero-copy data sharing of a COO matrix with SciPy, we can introduce a special index format for COO matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this can be revisited later. Will merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating cython wrappers for SparseTensor. This is to resolve [ARROW-4453](https://issues.apache.org/jira/browse/ARROW-4453). Author: Rok <rok@mihevc.org> Author: Antoine Pitrou <antoine@python.org> Closes #4446 from rok/ARROW-4453 and squashes the following commits: db5d620 <Rok> Typo. 9e0363a <Antoine Pitrou> Polish code c31b8eb <Rok> Enabling SparseTensor.Equals checks. 654002a <Rok> Partial review feedback implementation. e89edc6 <Rok> Refactoring to_numpy methods. 3fcc192 <Rok> Add equality methods. 4a30487 <Rok> Set base object in to_numpy methods. 4eeae02 <Rok> Cython wrapper for SparseTensor.
Creating cython wrappers for SparseTensor. This is to resolve [ARROW-4453](https://issues.apache.org/jira/browse/ARROW-4453). Author: Rok <rok@mihevc.org> Author: Antoine Pitrou <antoine@python.org> Closes #4446 from rok/ARROW-4453 and squashes the following commits: db5d620 <Rok> Typo. 9e0363a <Antoine Pitrou> Polish code c31b8eb <Rok> Enabling SparseTensor.Equals checks. 654002a <Rok> Partial review feedback implementation. e89edc6 <Rok> Refactoring to_numpy methods. 3fcc192 <Rok> Add equality methods. 4a30487 <Rok> Set base object in to_numpy methods. 4eeae02 <Rok> Cython wrapper for SparseTensor.
Creating cython wrappers for SparseTensor.
This is to resolve ARROW-4453.