-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-854: [Format] Add tentative SparseTensor format #2546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
format/SparseTensor.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name indptr comes from scipy's CSR implementation.
I want to replace this with a more suitable name.
1c903a1 to
43c7cd7
Compare
|
All the jobs in Travis CI were passed. |
|
I will review. This is a difficult week because of the Strata conference in New York City so I won't have too much time to review until later in the day on Thursday my time |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this. I agree that it would be valuable to be able to continue to have standardized metadata and an IPC protocol for data structures outside of Arrow columnar structures, like tensors and sparse tensors.
We should probably solicit the feedback of folks with experience with scipy.sparse. I can help with this if you like
cpp/src/arrow/ipc/CMakeLists.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about including this definition in Tensor.fbs also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll move the definition of SparseTensor in Tensor.fbs.
format/SparseTensor.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what data you expect to be in indicesBuffer? I am not that familiar with the COO representation and this comment is not sufficient to understand what I would find here. Maybe a simple example would help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote an example of the COO format in 5094c98.
format/SparseTensor.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this could be large. Should this be a Buffer also? We don't want the metadata to be too big
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I should make it a Buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 7416960.
format/SparseTensor.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a small example would help here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've described an example of CSR format in 152cb10.
format/SparseTensor.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this patch maybe we want to write down how the Buffers will be arranged in the IPC body.
I'm thinking
<metadata> <padding> <indicesBuffer> <data>
and possibly in the case of sparse matrix CSR format:
<metadata> <padding> <indptr> <indicesBuffer> <data>
Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both layouts you wrote are the same as my thought.
I'll write them in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the suitable place to describe the message layout of SparseTensor?
I think adding to IPC.md can be acceptable, but I want to know the policy if exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPC.md sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the description of sparse tensor format in IPC.md.
I have one question. @wes, I found that the format of Tensor is the following order in IPC.md:
<padding> <metadata length> <metadata> <tensor body>
I followed this order in the sparse tensor format in 29ed742, but in the example format you mention above, the padding and the metadata were swapped.
Would you please tell me do you think which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the additional padding between the sparse index and the sparse tensor body is necessary for aligning sparse tensor body to be multiple of 64 bytes.
So I added that in f0ec57d.
|
You may already be aware that the Python sparse ndarray implementation is being developed over at https://github.com/pydata/sparse. It was a project started by Matthew Rocklin and now continued by Hameer Abbasi so they may have some valuable insights to add here. |
@wesm @dhirschfeld Yes, I'm considering so, too. I referred scipy.sparse and pydata/sparse when I wrote this pull request, especially the implementation of scipy.sparse and these two issues: scipy/scipy#8162 and pydata/sparse#125. I need to cooperate with those people in scipy.sparse and pydata/sparse to make Arrow's SparseTensor interoperable widely. So I want to have the feedback from them. @wesm Could you please connect me to the folks in scipy.sparse? I don't know who is best person to mention here. And, I'll consult with Hameer Abbasi about cooperate with pydata/sparse. |
|
@rgommers we are looking at adding sparse matrix and sparse ndarray metadata descriptors to allow for zero-copy reads of such objects from shared memory using the Arrow library ecosystem, and eventually develop interop with scipy.sparse and other frameworks. Would you or someone from the scipy.sparse community be able to review and give some feedback on the proposed metadata? |
|
nice:)
I'm pretty much out of bandwidth till the end of October, and there are people with more expertise - let me ping a few: @hameerabbasi @pv @perimosocordiae |
hameerabbasi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, in general, looks okay to me. Perhaps in the future, if zero-copy and future-proof-ness is really what we want, we might want to add the CSF (compressed sparse fiber) format, a generalisation of CSR/CSC. I'm currently working on adding it to PyData/Sparse, and I plan to make it the preferred format (COO will still be around though).
CSF involves two-dimensional jagged arrays, I'm not entirely sure how that is handled in this grammar, otherwise I'd send a PR myself.
|
COO and CSR are probably sufficient for many use cases with matrixes. If CSR is included, I wonder if it would be also useful to also include CSC, as it iirc was the default io format for used in several sparse matrix libraries (umfpack, superlu). |
|
@wesm @rgommers Thank you very much for arranging. @hameerabbasi @pv Thank you for giving your comments. @hameerabbasi I was hoping to consult with you about introducing CSF in Arrow, so it was great that you gave me a comment to recommend CSF. @pv OK. CSC is just the transpose of CSR, so we can introduce CSC very easily. |
Codecov Report
@@ Coverage Diff @@
## master #2546 +/- ##
=========================================
+ Coverage 87.55% 88.56% +1%
=========================================
Files 403 342 -61
Lines 62113 58322 -3791
=========================================
- Hits 54386 51651 -2735
+ Misses 7657 6671 -986
+ Partials 70 0 -70Continue to review full report at Codecov.
|
format/SparseTensor.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you'd like zero-copy writes to scipy.sparse.coo_matrix and sparse then I'd recommend that this array be transposed, or stored as F-contiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable advice.
I made COO with a column-major coordinates tensor.
cd79dda to
74c6c7b
Compare
This commit defines the new NumericTensor<T> class as a subclass of Tensor class. NumericTensor<T> extends Tensor class by adding a member function to access element values in a tensor. I want to use this new feature for writing tests of SparseTensor in #2546. Author: Kenta Murata <mrkn@mrkn.jp> Closes #2759 from mrkn/tensor_element_access and squashes the following commits: 37f0bb4 <Kenta Murata> Add tests for column-major strides 14fa527 <Kenta Murata> Remove needless cases 1646461 <Kenta Murata> Introduce NumericTensor class
74c6c7b to
9b7468b
Compare
|
What are next steps here? I am not enough of an expert to give my seal of approval, if someone more experienced in sparse matrices could weigh in it would be helpful |
|
@wes I'm working on implementing sparse index of COO and CSR formats. |
c61bed1 to
cb0efef
Compare
76e205b to
61aa7f5
Compare
- SparseTensorBase to SparseTensor - SparseTensor<...> to SparseTensorImpl<...>
sparse_tensor_format_id -> format_id
00a7e3c to
148bff8
Compare
|
@wesm Thank you for giving your comments. |
|
I've restarted the failed job and CI is green now. |
|
Merging. Thank you again for your patience and care on this |
| data: Buffer; | ||
| } | ||
|
|
||
| root_type SparseTensor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm @mrkn - (I'm not a flat buffer expert by any means), but can you have 2 root_types in a single .fbs file?
I was updating the C# flat buffer generated code, and I noticed that the Tensor generated struct no longer has the methods:
public static void FinishTensorBuffer(FlatBufferBuilder builder, Offset<Tensor> offset) { builder.Finish(offset.Value); }
public static void FinishSizePrefixedTensorBuffer(FlatBufferBuilder builder, Offset<Tensor> offset) { builder.FinishSizePrefixed(offset.Value); }I traced this to the code generator here:
Which makes it look like the code generator is only expecting a single root_type for the parser.
Looking at the doc, it makes it sound like there should only be one:
The last part of the schema is the root_type. The root type declares what will be the root table for the serialized data. In our case, the root type is our Monster table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it looks like we should remove the other root_type. Can you open a JIRA about fixing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm Shall I make a new pull-request to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened https://issues.apache.org/jira/browse/ARROW-4571 for this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on it.
- Update FlatBuffers code to latest version. - Mark all FlatBuffer types as internal. Note: I didn't use the latest `.fbs` file version because it included the SparseTensor support. Using the latest `.fbs` change caused problems with the `Tensor` generation. I have a [question if the latest `.fbs` file is correct](#2546 (comment)). /cc @chutchinson @stephentoub - FYI Author: Eric Erhardt <eric.erhardt@microsoft.com> Closes #3637 from eerhardt/UpdateFlatBufferCode and squashes the following commits: 15e0831 <Eric Erhardt> Adding how to update FlatBuffers code to the README 043ce7e <Eric Erhardt> ARROW-4543: Update Flat Buffers code to latest version
- Update FlatBuffers code to latest version. - Mark all FlatBuffer types as internal. Note: I didn't use the latest `.fbs` file version because it included the SparseTensor support. Using the latest `.fbs` change caused problems with the `Tensor` generation. I have a [question if the latest `.fbs` file is correct](apache/arrow#2546 (comment)). /cc @chutchinson @stephentoub - FYI Author: Eric Erhardt <eric.erhardt@microsoft.com> Closes #3637 from eerhardt/UpdateFlatBufferCode and squashes the following commits: 15e0831c <Eric Erhardt> Adding how to update FlatBuffers code to the README 043ce7e2 <Eric Erhardt> ARROW-4543: Update Flat Buffers code to latest version
- Update FlatBuffers code to latest version. - Mark all FlatBuffer types as internal. Note: I didn't use the latest `.fbs` file version because it included the SparseTensor support. Using the latest `.fbs` change caused problems with the `Tensor` generation. I have a [question if the latest `.fbs` file is correct](apache/arrow#2546 (comment)). /cc @chutchinson @stephentoub - FYI Author: Eric Erhardt <eric.erhardt@microsoft.com> Closes #3637 from eerhardt/UpdateFlatBufferCode and squashes the following commits: 15e0831c <Eric Erhardt> Adding how to update FlatBuffers code to the README 043ce7e2 <Eric Erhardt> ARROW-4543: Update Flat Buffers code to latest version
I'm interested in making a language-agnostic sparse tensor format. I believe one of the suitable places to do this is Apache Arrow, so let me propose my idea of this here.
First of all, I found that there is no common memory layout of sparse tensor representations in my investigation. It means we need some kinds of conversion to share sparse tensors among different systems even if the data format is logically the same. It is the same situation as dataframe, and this is the reason why I believe Apache Arrow is the suitable place.
There are many formats to represent a sparse tensor. Most of them are specialized for a matrix, which has two dimensions. There are few formats for general sparse tensor with more than two dimensions.
I think the COO format is suitable to start because COO can handle any dimensions, and many systems support the COO format. In my investigation, the systems support COO are SciPy, dask, pydata/sparse, TensorFlow, and PyTorch.
Additionally, CSR format for matrices may also be good to support at the first time. The reason is that CSR format is efficient to extract row slices, that may be important for extracting samples from tidy data, and it is supported by SciPy, MXNet, and R's Matrix library.
I add my prototype definition of SparseTensor format in this pull-request. I designed this prototype format to be extensible so that we can support additional sparse formats. I think we at least need to support additional sparse tensor format for more than two dimensions in addition to COO so we will need this extensibility.