Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Jun 2, 2019

Creating cython wrappers for SparseTensor.
This is to resolve ARROW-4453.

@mrkn
Copy link
Member

mrkn commented Jun 3, 2019

@rok You can fix lint errors in pyarrow.cc and pyarrow.h by ninja format or make format.

@rok
Copy link
Member Author

rok commented Jun 3, 2019

@rok You can fix lint errors in pyarrow.cc and pyarrow.h by ninja format or make format.

@mrkn I did, thanks for the tip. :)

@rok rok force-pushed the ARROW-4453 branch 9 times, most recently from 12068bc to d3065ec Compare June 19, 2019 00:02
@rok rok changed the title [WIP] ARROW-4453: Cython wrappers for SparseTensor ARROW-4453: [Python] Cython wrappers for SparseTensor Jun 19, 2019
@rok rok force-pushed the ARROW-4453 branch 2 times, most recently from 46d82c6 to 0fa8d82 Compare June 20, 2019 00:03
@rok
Copy link
Member Author

rok commented Jun 20, 2019

I feel this is ready for a first review now.

@mrkn Should we implement a sparse_tensor.to_dense() method? We would probably want to do that in c++ with a sparse_tensor.to_tensor method and only wrap it with Python.

@mrkn
Copy link
Member

mrkn commented Jun 20, 2019

@rok Yes, I also think we should have sparse_tensor.ToTensor() method in C++ and wrap it in Python.
Could you implement such a method in C++, or shall I do it?

@rok
Copy link
Member Author

rok commented Jun 20, 2019

@mrkn I'd be happy to do it.

@mrkn
Copy link
Member

mrkn commented Jun 20, 2019

@rok Great! Please ask me if you want to any help to do it.

@rok rok force-pushed the ARROW-4453 branch 2 times, most recently from 962c364 to 4dd83c3 Compare June 20, 2019 02:38
@wesm
Copy link
Member

wesm commented Jun 22, 2019

I added this to the 0.14 milestone so we can maybe get it merged early next week

@rok
Copy link
Member Author

rok commented Jun 22, 2019

@wesm - that would be great.
Also: shall I open a new issue for implementing sparse_tensor.ToTensor() in c++ or just do it here?

@wesm
Copy link
Member

wesm commented Jun 23, 2019

Either is fine so long as we close this current patch out

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This is a nice start. I posted a bunch of comments below. Some of them might have to do with the CI crash.

@rok rok force-pushed the ARROW-4453 branch 2 times, most recently from a7032b2 to 9d560f6 Compare June 28, 2019 23:08
@rok rok force-pushed the ARROW-4453 branch 4 times, most recently from 05add3f to 09b8341 Compare June 29, 2019 14:06
@codecov-io
Copy link

codecov-io commented Jun 30, 2019

Codecov Report

Merging #4446 into master will increase coverage by 0.6%.
The diff coverage is 84.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4446      +/-   ##
==========================================
+ Coverage   88.39%   88.99%    +0.6%     
==========================================
  Files         908      719     -189     
  Lines      114157    99293   -14864     
  Branches     1418        0    -1418     
==========================================
- Hits       100904    88366   -12538     
+ Misses      12891    10927    -1964     
+ Partials      362        0     -362
Impacted Files Coverage Δ
python/pyarrow/lib.pyx 97.82% <ø> (ø) ⬆️
python/pyarrow/array.pxi 80.61% <ø> (-0.25%) ⬇️
cpp/src/arrow/python/serialize.cc 89.93% <ø> (ø) ⬆️
python/pyarrow/__init__.py 40.54% <ø> (ø) ⬆️
cpp/src/arrow/python/pyarrow.cc 23.91% <0%> (-5.82%) ⬇️
cpp/src/arrow/python/pyarrow_api.h 82.35% <100%> (+1.7%) ⬆️
cpp/src/arrow/compare.cc 85.81% <100%> (+5.81%) ⬆️
python/pyarrow/tests/test_sparse_tensor.py 100% <100%> (ø)
cpp/src/arrow/sparse_tensor-test.cc 100% <100%> (ø) ⬆️
python/pyarrow/tests/test_tensor.py 100% <100%> (+5.4%) ⬆️
... and 202 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 175ad65...db5d620. Read the comment docs.

@pitrou
Copy link
Member

pitrou commented Jul 1, 2019

I rebased, did a lot of polish and fixed some issues.

@pitrou
Copy link
Member

pitrou commented Jul 1, 2019

@rok
Copy link
Member Author

rok commented Jul 1, 2019

Travis passed as well: https://travis-ci.org/apache/arrow/builds/552756394

def test_sparse_tensor_coo_from_dense(dtype_str, arrow_type):
dtype = np.dtype(dtype_str)
data = np.array([[4], [9], [7], [5]]).astype(dtype)
coords = np.array([[0, 0], [0, 2], [1, 1], [3, 3]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrkn Can you quickly double-check this is ok? The original code had different values here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results match scipy's reference. However our API for COO is slightly different to scipy's. Would it make sense to go closer to scipy?
Another reference is pydata/sparse.

scipy pydata/sparse pyarrow
coo constructor row(1, M), col(1, M), data(M), shape(ndims) coords(2, M), data(M), shape(ndims) coords(M, 2), data(M), shape(ndims)
csr constructor indptr(1, k), indices(1, M), data(M), shape(ndims) NA indptr(1, k), indices(1, M), data(M), shape(ndims)

Here M is number on non zero elements and k is number of pointers that will depend on the data present in the tensor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the current API allows zero-copy between Numpy and Arrow, so I think it's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean the 'python facing' interface, so users would see this:

pyarrow:
data = np.array([[4], [9], [7], [5]])
coords = np.array([[0, 0], [3, 3], [1, 1], [0, 2]])
pa.SparseTensorCOO.from_numpy(data, coords, (4, 4))

scipy:
row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
coo_matrix((data, (row, col)), shape=(4, 4)).toarray()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that different from currently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, either the PyArrow API is closer to pydata/sparse or it's closer to the C++ Arrow API.
In any case, it seems it's just a transpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and since transpose can be done from Python it's maybe best to leave this to integrations (https://issues.apache.org/jira/browse/ARROW-4224, https://issues.apache.org/jira/browse/ARROW-4223).

Copy link
Member

@mrkn mrkn Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) The reason why the current coords matrix is column-major is due to the discussion in #2546. But I'm working on introducing the srides array in a SparseCOOIndex at mrkn/arrow@c40d8e4...31beafd. This allows us to have coords matrix in any-contiguousness.

(2) Arrow's COO sparse tensor manages an index as a matrix while SciPy's COO matrix manages an index in two vectors. This difference comes from Arrow's sparse tensor is not a matrix, but a tensor. Tensorflow and PyTorch employ the same approach for COO sparse tensor.

(3) If we should support zero-copy data sharing of a COO matrix with SciPy, we can introduce a special index format for COO matrix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this can be revisited later. Will merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Thanks @mrkn and @pitrou! :)

@pitrou pitrou closed this in 74b9294 Jul 2, 2019
kou pushed a commit that referenced this pull request Jul 4, 2019
Creating cython wrappers for SparseTensor.
This is to resolve [ARROW-4453](https://issues.apache.org/jira/browse/ARROW-4453).

Author: Rok <rok@mihevc.org>
Author: Antoine Pitrou <antoine@python.org>

Closes #4446 from rok/ARROW-4453 and squashes the following commits:

db5d620 <Rok> Typo.
9e0363a <Antoine Pitrou> Polish code
c31b8eb <Rok> Enabling SparseTensor.Equals checks.
654002a <Rok> Partial review feedback implementation.
e89edc6 <Rok> Refactoring to_numpy methods.
3fcc192 <Rok> Add equality methods.
4a30487 <Rok> Set base object in to_numpy methods.
4eeae02 <Rok> Cython wrapper for SparseTensor.
kszucs pushed a commit that referenced this pull request Jul 22, 2019
Creating cython wrappers for SparseTensor.
This is to resolve [ARROW-4453](https://issues.apache.org/jira/browse/ARROW-4453).

Author: Rok <rok@mihevc.org>
Author: Antoine Pitrou <antoine@python.org>

Closes #4446 from rok/ARROW-4453 and squashes the following commits:

db5d620 <Rok> Typo.
9e0363a <Antoine Pitrou> Polish code
c31b8eb <Rok> Enabling SparseTensor.Equals checks.
654002a <Rok> Partial review feedback implementation.
e89edc6 <Rok> Refactoring to_numpy methods.
3fcc192 <Rok> Add equality methods.
4a30487 <Rok> Set base object in to_numpy methods.
4eeae02 <Rok> Cython wrapper for SparseTensor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants